From doug.simon at oracle.com Sat Feb 1 08:03:35 2025 From: doug.simon at oracle.com (Douglas Simon) Date: Sat, 1 Feb 2025 08:03:35 +0000 Subject: Proposal: Remove EnableJVMCI flag Message-ID: Hi, https://bugs.openjdk.org/browse/JDK-8345826 was filed to make libgraal and new CDS optimizations more compatible: Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers in https://bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to be used. Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS archive will be disabled. This will result in slower start-up time when UseGraalJIT is enabled. Further internal discussion resulted in the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the Java code (i.e. adds jdk.internal.vm.ci to the root module set). However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag altogether. This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for ?guest? code (e.g., Truffle use case). 1. JVMCI as JIT. To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the java launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already opt-in due to needing these other flags - specifying EnableJVMCI is redundant. 2. JVMCI as guest code compiler In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add-modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor) so using it requires specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not sufficient for opting-in to JVMCI. In light of the above, I propose removing EnableJVMCI altogether. This will require using --add-modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them separately below. #### VM code All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies: 1. Remove the guard and make the code unconditional. 2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon as JVMCI compiled code is about to be installed in the code cache (example). 3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example). Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before worrying about that. -Doug -------------- next part -------------- An HTML attachment was scrubbed... URL: From galder at openjdk.org Mon Feb 3 14:22:52 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 3 Feb 2025 14:22:52 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Fri, 17 Jan 2025 17:53:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo @eastig fyi ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2631136070 From duke at openjdk.org Mon Feb 3 15:49:01 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 3 Feb 2025 15:49:01 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v3] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - merging master - Use SHA3Parallel for matrix generation - fixing whitespace errors - 8348561: Add aarch64 intrinsics for ML-DSA ------------- Changes: https://git.openjdk.org/jdk/pull/23300/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=02 Stats: 2133 lines in 19 files changed: 2045 ins; 11 del; 77 mod Patch: https://git.openjdk.org/jdk/pull/23300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300 PR: https://git.openjdk.org/jdk/pull/23300 From duke at openjdk.org Mon Feb 3 16:15:32 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 3 Feb 2025 16:15:32 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v4] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: removed debugging code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23300/files - new: https://git.openjdk.org/jdk/pull/23300/files/5630fd14..9f7c4a23 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=02-03 Stats: 25 lines in 3 files changed: 0 ins; 25 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300 PR: https://git.openjdk.org/jdk/pull/23300 From vladimir.kozlov at oracle.com Mon Feb 3 17:45:39 2025 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 3 Feb 2025 09:45:39 -0800 Subject: Proposal: Remove EnableJVMCI flag In-Reply-To: References: Message-ID: <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com> Hi Doug, My concern is that some code (stubs, blobs, Interpreter) are generated before we are loading any modules. How you handle JVMCI specific code there if you have it? If you don't have such code than we can discuss. I definitely against adding runtime checks for JVMCI presence into executed (assembler) code. Would be nice if/when command line is parsed we can detect presence of `--add-modules=jdk.internal.vm.ci` (or others related) flag and enable JVMCI flag. I am fine to keep `EnableJVMCI` but make it ergonomic. You may still want to disable JVMCI from command line even if somewhere in start script you have `--add-modules=jdk.internal.vm.ci`. Thanks, Vladimir K On 2/1/25 12:03 AM, Douglas Simon wrote: > Hi, > > https://bugs.openjdk.org/browse/JDK-8345826 ?was filed to make libgraal and > new CDS optimizations more compatible: > >> Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers in?https:// >> bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to be used. >> Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS archive >> will be disabled. This will result in slower start-up time when UseGraalJIT is enabled. >> > > Further internal discussion focusedId=14736369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14736369>?resulted in > the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the Java > code (i.e. adds jdk.internal.vm.ci to the root module set). > > However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag altogether. > > This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this > option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for > ?guest? code (e.g., Truffle use case). > > 1. JVMCI as JIT. > > To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the java > launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already opt-in > due to needing these other flags - specifying EnableJVMCI is redundant. > > 2. JVMCI as guest code compiler > > In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add- > modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/module-info.java>)?so using it requires > specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not > sufficient for opting-in to JVMCI. > > In light of the above, I propose removing EnableJVMCI altogether. This will require using --add- > modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code > guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them separately > below. > > #### VM code > > All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies: > 1. Remove the guard and make the code unconditional. > 2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon as > JVMCI compiled code is about to be installed in the code cache (example files#diff-ee8337800ed1d1b84e3e49a2481809a6affac5d70ca23934a44497c9c758092fR456>). > 3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example github.com/openjdk/jdk/pull/23408/files#diff-4e6668d768f7d67417cbac39bcb723552cc0b80ad218709cfa0e6e31f32b69f0R518>). > > Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before > worrying about that. > > -Doug > From coleenp at openjdk.org Mon Feb 3 17:49:19 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 3 Feb 2025 17:49:19 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native Message-ID: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. Tested with tier1-8. ------------- Commit messages: - Removed @Stable. - Fix JFR bug. - 8345678: Make Class.getModifiers() non-native. Changes: https://git.openjdk.org/jdk/pull/22652/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8346567 Stats: 218 lines in 34 files changed: 57 ins; 139 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/22652.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652 PR: https://git.openjdk.org/jdk/pull/22652 From liach at openjdk.org Mon Feb 3 17:49:20 2025 From: liach at openjdk.org (Chen Liang) Date: Mon, 3 Feb 2025 17:49:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com> On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. The change to java.lang.Class looks good. Looking at #23396, we might need to filter this field too. src/hotspot/share/classfile/javaClasses.cpp line 1504: > 1502: macro(_reflectionData_offset, k, "reflectionData", java_lang_ref_SoftReference_signature, false); \ > 1503: macro(_signers_offset, k, "signers", object_array_signature, false); \ > 1504: macro(_modifiers_offset, k, vmSymbols::modifiers_name(), int_signature, false) Do we need a trailing semicolon here? src/java.base/share/classes/java/lang/Class.java line 1315: > 1313: > 1314: // Set by the JVM when creating the instance of this java.lang.Class > 1315: private transient int modifiers; If this is set by the JVM, can this be marked `final` so JIT compiler can trust this field? Also preferable if we can move this together with components/signers/classData fields. ------------- PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2490110846 PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2631658029 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876630297 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876627105 From coleenp at openjdk.org Mon Feb 3 17:49:20 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 3 Feb 2025 17:49:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. > Looking at https://github.com/openjdk/jdk/pull/23396, we might need to filter this field too. Yes, I agree. This patch is a follow on to that one, so I'll add it to the same places when that one is merged in here. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2631661716 From coleenp at openjdk.org Mon Feb 3 17:49:20 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 3 Feb 2025 17:49:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com> Message-ID: On Mon, 9 Dec 2024 19:46:43 GMT, Chen Liang wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > src/hotspot/share/classfile/javaClasses.cpp line 1504: > >> 1502: macro(_reflectionData_offset, k, "reflectionData", java_lang_ref_SoftReference_signature, false); \ >> 1503: macro(_signers_offset, k, "signers", object_array_signature, false); \ >> 1504: macro(_modifiers_offset, k, vmSymbols::modifiers_name(), int_signature, false) > > Do we need a trailing semicolon here? yes. it is needed. > src/java.base/share/classes/java/lang/Class.java line 1315: > >> 1313: >> 1314: // Set by the JVM when creating the instance of this java.lang.Class >> 1315: private transient int modifiers; > > If this is set by the JVM, can this be marked `final` so JIT compiler can trust this field? Also preferable if we can move this together with components/signers/classData fields. The JVM rearranges these fields so that's why I put it near the caller. Let me check if final compiles. Edit: it looks better with the other fields though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876712191 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876713323 From duke at openjdk.org Mon Feb 3 17:49:20 2025 From: duke at openjdk.org (ExE Boss) Date: Mon, 3 Feb 2025 17:49:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com> Message-ID: On Mon, 9 Dec 2024 20:27:52 GMT, Coleen Phillimore wrote: >> src/hotspot/share/classfile/javaClasses.cpp line 1504: >> >>> 1502: macro(_reflectionData_offset, k, "reflectionData", java_lang_ref_SoftReference_signature, false); \ >>> 1503: macro(_signers_offset, k, "signers", object_array_signature, false); \ >>> 1504: macro(_modifiers_offset, k, vmSymbols::modifiers_name(), int_signature, false) >> >> Do we need a trailing semicolon here? > > yes. it is needed. This is?**C++**, so?yes. Suggestion: macro(_modifiers_offset, k, vmSymbols::modifiers_name(), int_signature, false); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1876794006 From coleenp at openjdk.org Mon Feb 3 17:49:20 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 3 Feb 2025 17:49:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <1kHVpYCOExfkn8UHTZNZT6zwjRj3MCXJD2LVcY0NTrg=.0644323b-5f40-4441-8c19-763105aaf08d@github.com> Message-ID: <44DPWzTGxPDoyWwZFbAxE74-KrXChIvfusVws1N-uN0=.f346731b-c61e-468f-9f58-4dc6e2df35d2@github.com> On Mon, 9 Dec 2024 21:35:42 GMT, ExE Boss wrote: >> yes. it is needed. > > This is?**C++**, so?yes. > Suggestion: > > macro(_modifiers_offset, k, vmSymbols::modifiers_name(), int_signature, false); I see, there's a trailing semi somewhere in the expansion of this macro so it compiles, but I added one in. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1878263513 From heidinga at openjdk.org Mon Feb 3 17:49:20 2025 From: heidinga at openjdk.org (Dan Heidinga) Date: Mon, 3 Feb 2025 17:49:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <5bMxhTRPqj-dMhr3FoSrym2ttWuzjWwtXAEcQHbF9Vg=.859ae29f-2530-4130-b108-d47c100ac19f@github.com> On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. src/java.base/share/classes/java/lang/Class.java line 244: > 242: classLoader = loader; > 243: componentType = arrayComponentType; > 244: modifiers = 0; The comment above about assigning a parameter to the field to prevent the JIT from assuming an incorrect default also should apply to the new `modifiers` field. I think the constructor, which is never called, should also pass in a `dummyModifiers` value rather than using 0 directly ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880689835 From coleenp at openjdk.org Mon Feb 3 17:49:20 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 3 Feb 2025 17:49:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <5bMxhTRPqj-dMhr3FoSrym2ttWuzjWwtXAEcQHbF9Vg=.859ae29f-2530-4130-b108-d47c100ac19f@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <5bMxhTRPqj-dMhr3FoSrym2ttWuzjWwtXAEcQHbF9Vg=.859ae29f-2530-4130-b108-d47c100ac19f@github.com> Message-ID: On Wed, 11 Dec 2024 18:15:57 GMT, Dan Heidinga wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > src/java.base/share/classes/java/lang/Class.java line 244: > >> 242: classLoader = loader; >> 243: componentType = arrayComponentType; >> 244: modifiers = 0; > > The comment above about assigning a parameter to the field to prevent the JIT from assuming an incorrect default also should apply to the new `modifiers` field. I think the constructor, which is never called, should also pass in a `dummyModifiers` value rather than using 0 directly Yes, definitely, didn't see that this is the right way to do this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1887157349 From duke at openjdk.org Mon Feb 3 17:49:21 2025 From: duke at openjdk.org (ExE Boss) Date: Mon, 3 Feb 2025 17:49:21 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <-GEiPPAhFzy-uaUwIACYA7fZVCT3wkuVd-gtf9rrlnw=.de130f97-59bd-4581-a568-05d6238cf90a@github.com> On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. src/java.base/share/classes/java/lang/Class.java line 1005: > 1003: private transient Object[] signers; // Read by VM, mutable > 1004: > 1005: @Stable The?`modifiers`?field doesn?t?need to?be?`@Stable`: Suggestion: test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 65: > 63: */ > 64: @Benchmark > 65: public int getModifiers() throws NoSuchMethodException { The?only `Throwable`s that?can be?thrown by?calling `Class::getModifiers()` are?`Error`s (e.g.:?`StackOverflowError`) and?`RuntimeException`s (e.g.:?`NullPointerException`): Suggestion: public int getModifiers() { test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 71: > 69: Clazz[] clazzArray = new Clazz[1]; > 70: @Benchmark > 71: public int getAppArrayModifiers() throws NoSuchMethodException { Suggestion: public int getAppArrayModifiers() { test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 81: > 79: */ > 80: @Benchmark > 81: public int getArrayModifiers() throws NoSuchMethodException { Suggestion: public int getArrayModifiers() { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888757754 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888760732 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888760967 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1888761412 From coleenp at openjdk.org Mon Feb 3 17:49:21 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 3 Feb 2025 17:49:21 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <-GEiPPAhFzy-uaUwIACYA7fZVCT3wkuVd-gtf9rrlnw=.de130f97-59bd-4581-a568-05d6238cf90a@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <-GEiPPAhFzy-uaUwIACYA7fZVCT3wkuVd-gtf9rrlnw=.de130f97-59bd-4581-a568-05d6238cf90a@github.com> Message-ID: On Tue, 17 Dec 2024 15:54:48 GMT, ExE Boss wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > src/java.base/share/classes/java/lang/Class.java line 1005: > >> 1003: private transient Object[] signers; // Read by VM, mutable >> 1004: >> 1005: @Stable > > The?`modifiers`?field doesn?t?need to?be?`@Stable`: > Suggestion: I now don't know whether we want @Stable here or not. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1890866329 From vklang at openjdk.org Mon Feb 3 17:49:21 2025 From: vklang at openjdk.org (Viktor Klang) Date: Mon, 3 Feb 2025 17:49:21 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. src/java.base/share/classes/java/lang/Class.java line 1006: > 1004: private final transient int modifiers; // Set by the VM > 1005: > 1006: // package-private @coleenp Could this field be @Stable, or does that only apply to `putfield`s? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1879797327 From liach at openjdk.org Mon Feb 3 17:49:21 2025 From: liach at openjdk.org (Chen Liang) Date: Mon, 3 Feb 2025 17:49:21 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Wed, 11 Dec 2024 10:24:03 GMT, Viktor Klang wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > src/java.base/share/classes/java/lang/Class.java line 1006: > >> 1004: private final transient int modifiers; // Set by the VM >> 1005: >> 1006: // package-private > > @coleenp Could this field be @Stable, or does that only apply to `putfield`s? I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880350790 From vklang at openjdk.org Mon Feb 3 17:49:24 2025 From: vklang at openjdk.org (Viktor Klang) Date: Mon, 3 Feb 2025 17:49:24 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Wed, 11 Dec 2024 14:52:48 GMT, Chen Liang wrote: >> src/java.base/share/classes/java/lang/Class.java line 1006: >> >>> 1004: private final transient int modifiers; // Set by the VM >>> 1005: >>> 1006: // package-private >> >> @coleenp Could this field be @Stable, or does that only apply to `putfield`s? > > I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler. Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880374750 From coleenp at openjdk.org Mon Feb 3 17:49:24 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 3 Feb 2025 17:49:24 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Wed, 11 Dec 2024 15:06:54 GMT, Viktor Klang wrote: >> I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler. > > Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :) I don't think @Stable would hurt but final should provide the same guarantee. It's set internally by the VM so there's no late setting. I don't know if this field implementation can constant fold in the case of Arrays which are (JVM_ACC_ABSTRACT | JVM_ACC_FINAL | JVM_ACC_PUBLIC). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880663099 From heidinga at openjdk.org Mon Feb 3 17:49:25 2025 From: heidinga at openjdk.org (Dan Heidinga) Date: Mon, 3 Feb 2025 17:49:25 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Wed, 11 Dec 2024 15:06:54 GMT, Viktor Klang wrote: >> I don't think this needs to be stable - finals in java.lang is trusted by the JIT compiler. > > Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :) @viktorklang-ora `@Stable` is not about how the field was set, but about the JIT observing a non-default value at compile time. If it observes a non-default value, it can treat it as a compile time constant. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1880692608 From vklang at openjdk.org Mon Feb 3 17:49:25 2025 From: vklang at openjdk.org (Viktor Klang) Date: Mon, 3 Feb 2025 17:49:25 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com> On Wed, 11 Dec 2024 18:17:43 GMT, Dan Heidinga wrote: >> Yeah, I was just thinking whether something set from inside the VM which is marked @Stable is constant-folded :) > > @viktorklang-ora `@Stable` is not about how the field was set, but about the JIT observing a non-default value at compile time. If it observes a non-default value, it can treat it as a compile time constant. @DanHeidinga Great explanation, thank you! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1881782322 From duke at openjdk.org Mon Feb 3 18:14:54 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 3 Feb 2025 18:14:54 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2] In-Reply-To: <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com> References: <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com> Message-ID: On Thu, 30 Jan 2025 16:23:56 GMT, Andrew Dinn wrote: > @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2631720583 From jbhateja at openjdk.org Mon Feb 3 18:14:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 3 Feb 2025 18:14:56 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java > > Co-authored-by: Emanuel Peter Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2631719276 From doug.simon at oracle.com Mon Feb 3 19:09:26 2025 From: doug.simon at oracle.com (Douglas Simon) Date: Mon, 3 Feb 2025 19:09:26 +0000 Subject: Proposal: Remove EnableJVMCI flag In-Reply-To: <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com> References: <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com> Message-ID: Hi Vladimir, On 3 Feb 2025, at 18:45, Vladimir Kozlov wrote: Hi Doug, My concern is that some code (stubs, blobs, Interpreter) are generated before we are loading any modules. How you handle JVMCI specific code there if you have it? If you don't have such code than we can discuss. You mean what would we do with generated code that currently tests EnableJVMCI? We have these 2 options as far as I can see: 1. Always generate the JVMCI part of the code (example). 2. Instead of testing EnableJVMCI, we instead test a JVMCI::_is_enabled bool which would be initialized during argument parsing (i.e. before any code is generated). JVMCI::_is_enabled would be set to true if jdk.internal.vm.ci is in the root module set or if any other JVMCI flags such as UseGraalJIT or UseJVMCICompiler are true. I suspect this option is the one to go with as it?s pretty much equivalent to the current semantics (i.e. JVMCI conditional VM is only executed/generated) if JVMCI is enabled. I definitely against adding runtime checks for JVMCI presence into executed (assembler) code. I agree that we do not want that. Would be nice if/when command line is parsed we can detect presence of `--add-modules=jdk.internal.vm.ci` (or others related) flag and enable JVMCI flag. I am fine to keep `EnableJVMCI` but make it ergonomic. I?d like EnableJVMCI to become purely an alias for --add-modules=jdk.internal.vm.ci. You may still want to disable JVMCI from command line even if somewhere in start script you have `--add-modules=jdk.internal.vm.ci`. I don't think we need to support such a contradiction - if the launcher has been asked to load jdk.internal.vm.ci as part of the root module set, then it wants JVMCI enabled. Either that or we make -EnableJVMCI undo any preceding --add-modules=jdk.internal.vm.ci (if that?s even possible). -Doug On 2/1/25 12:03 AM, Douglas Simon wrote: Hi, https://bugs.openjdk.org/browse/JDK-8345826 was filed to make libgraal and new CDS optimizations more compatible: Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers in https:// bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to be used. Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS archive will be disabled. This will result in slower start-up time when UseGraalJIT is enabled. Further internal discussion resulted in the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the Java code (i.e. adds jdk.internal.vm.ci to the root module set). However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag altogether. This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for ?guest? code (e.g., Truffle use case). 1. JVMCI as JIT. To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the java launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already opt-in due to needing these other flags - specifying EnableJVMCI is redundant. 2. JVMCI as guest code compiler In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add- modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor ) so using it requires specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not sufficient for opting-in to JVMCI. In light of the above, I propose removing EnableJVMCI altogether. This will require using --add- modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them separately below. #### VM code All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies: 1. Remove the guard and make the code unconditional. 2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon as JVMCI compiled code is about to be installed in the code cache (example ). 3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example ). Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before worrying about that. -Doug -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Mon Feb 3 19:14:08 2025 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 3 Feb 2025 11:14:08 -0800 Subject: Proposal: Remove EnableJVMCI flag In-Reply-To: References: <611affa4-09c6-41af-a853-1106e12dfbb9@oracle.com> Message-ID: <888c013b-482e-4269-972e-078b8517485e@oracle.com> On 2/3/25 11:09 AM, Douglas Simon wrote: > Hi Vladimir, > >> On 3 Feb 2025, at 18:45, Vladimir Kozlov wrote: >> >> Hi Doug, >> >> My concern is that some code (stubs, blobs, Interpreter) are generated before we are loading any modules. >> How you handle JVMCI specific code there if you have it? If you don't have such code than we can discuss. > > You mean what would we do with generated code that currently tests EnableJVMCI? We have these 2 options as far as I can see: > 1. Always generate the JVMCI part of the code (example files#diff-524c9e019cb83916aa3db772fb33acbbe3e7465867a8d2f7e6376be3c8260eddL606>). > 2. Instead of testing EnableJVMCI, we instead test a JVMCI::_is_enabled bool which would be initialized during argument > parsing (i.e. before any code is generated). JVMCI::_is_enabled would be set to true if jdk.internal.vm.ci is in the > root module set or if any other JVMCI flags such as UseGraalJIT or UseJVMCICompiler are true. I suspect this option is > the one to go with as it?s pretty much equivalent to the current semantics (i.e. JVMCI conditional VM is only executed/ > generated) if JVMCI is enabled. I agree with option 2. This looks like most reasonable approach. Thanks, Vladimir K > >> I definitely against adding runtime checks for JVMCI presence into executed (assembler) code. > > I agree that we do not want that. > >> Would be nice if/when command line is parsed we can detect presence of `--add-modules=jdk.internal.vm.ci` (or others >> related) flag and enable JVMCI flag. I am fine to keep `EnableJVMCI` but make it ergonomic. > > I?d like EnableJVMCI to become purely an alias for --add-modules=jdk.internal.vm.ci. > >> You may still want to disable JVMCI from command line even if somewhere in start script you have `--add- >> modules=jdk.internal.vm.ci`. > > I don't think we need to support such a contradiction - if the launcher has been asked to load jdk.internal.vm.ci as > part of the root module set, then it wants JVMCI enabled. Either that or we make -EnableJVMCI undo any preceding --add- > modules=jdk.internal.vm.ci (if that?s even possible). > > -Doug > >> On 2/1/25 12:03 AM, Douglas Simon wrote: >>> Hi, >>> https://bugs.openjdk.org/browse/JDK-8345826 ?was filed to make libgraal >>> and new CDS optimizations more compatible: >>>> Since JDK 483, many more CDS optimizations are enabled when -XX:+AOTClassLinking is specified (see numbers >>>> in?https:// bugs.openjdk.org/browse/JDK-8342279). However, these optimizations require the archived module graph to >>>> be used. Today, if you enable UseGraalJIT, the archived module graph will be disabled. As a result, the *entire* CDS >>>> archive will be disabled. This will result in slower start-up time when UseGraalJIT is enabled. >>>> >>> Further internal discussion >> focusedId=14736369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14736369>?resulted >>> in the proposal to remove all use of EnableJVMCI in the VM code. This will mean -XX:+EnableJVMCI only applies to the >>> Java code (i.e. adds jdk.internal.vm.ci to the root module set). >>> However, further reflection suggests something more aggressive is worth considering: remove the EnableJVMCI flag >>> altogether. >>> This option was implemented to make use of JVMCI opt-in. However, JVMCI is effectively opt-in anyway without this >>> option. There are two ways in which JVMCI can be used: as a JIT compiler by the CompileBroker and as a compiler for >>> ?guest? code (e.g., Truffle use case). >>> 1. JVMCI as JIT. >>> To enable JVMCI as JIT, flags such as UseJVMCICompiler, UseGraalJIT or EnableJVMCIProduct must be specified to the >>> java launcher. Each of these flags set EnableJVMCI to true as a side-effect. That is, use of JVMCI as JIT is already >>> opt-in due to needing these other flags - specifying EnableJVMCI is redundant. >>> 2. JVMCI as guest code compiler >>> In this mode, the jdk.internal.vm.ci module must be loaded (i.e. EnableJVMCI currently has the side-effect of `--add- >>> modules=jdk.internal.vm.ci`). This module has no unqualified exports (as seen in its module descriptor >> github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/module-info.java>)?so using it requires >>> specifying at least one instance of --add-exports to the Java launcher. That is, once again EnableJVMCI alone is not >>> sufficient for opting-in to JVMCI. >>> In light of the above, I propose removing EnableJVMCI altogether. This will require using --add- >>> modules=jdk.internal.vm.ci when you actually want to use the JVMCI module. It will also require modifying JDK code >>> guarded by this flag. It guards both VM code and use of the `jdk.internal.vm.ci` module and I consider them >>> separately below. >>> #### VM code >>> All uses of EnableJVMCI to guard VM code would adapted with one of the following strategies: >>> 1. Remove the guard and make the code unconditional. >>> 2. Replace EnableJVMCI with something else such as UseJVMCICompiler or test of a global variable set to true as soon >>> as JVMCI compiled code is about to be installed in the code cache (example >> pull/23408/ files#diff-ee8337800ed1d1b84e3e49a2481809a6affac5d70ca23934a44497c9c758092fR456>). >>> 3. Replace EnableJVMCI with a test of whether the jdk.internal.vm.ci module has been resolved (example >> github.com/openjdk/jdk/pull/23408/files#diff-4e6668d768f7d67417cbac39bcb723552cc0b80ad218709cfa0e6e31f32b69f0R518>). >>> Of course, this change almost certainly needs a CSR as well but I?d like to get feedback on the primary change before >>> worrying about that. >>> -Doug >> > From epeter at openjdk.org Tue Feb 4 08:52:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 08:52:15 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: <_5bwBRKG8Zu7iywOJZ6WgUb6N4so1sAO6Ua8S0zQU94=.3200ef74-4e50-424b-a3da-637be63e3f0c@github.com> On Mon, 3 Feb 2025 18:11:11 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java >> >> Co-authored-by: Emanuel Peter > > Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments. @jatin-bhateja Testing is all green :green_circle: Doing a last pass over the code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2633248273 From epeter at openjdk.org Tue Feb 4 09:03:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 09:03:17 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java > > Co-authored-by: Emanuel Peter src/hotspot/share/opto/convertnode.hpp line 222: > 220: class ReinterpretS2HFNode : public Node { > 221: public: > 222: ReinterpretS2HFNode(Node* in1) : Node(0, in1) {} Suggestion: ReinterpretS2HFNode(Node* in1) : Node(nullptr, in1) {} Oh, just caught this. I think you should not use `0` here any more, check all other uses. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940762320 From epeter at openjdk.org Tue Feb 4 09:16:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 09:16:17 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Thu, 30 Jan 2025 11:03:43 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java > > Co-authored-by: Emanuel Peter Ooops, I found a few more details. But the C++ VM changes look really good now. The Java changes I leave to @PaulSandoz src/hotspot/share/opto/convertnode.cpp line 971: > 969: return true; > 970: default: > 971: return false; Does this cover all cases? What about `FmaHF`? src/hotspot/share/opto/convertnode.hpp line 234: > 232: class ReinterpretHF2SNode : public Node { > 233: public: > 234: ReinterpretHF2SNode(Node* in1) : Node(0, in1) {} Suggestion: ReinterpretHF2SNode(Node* in1) : Node(nullptr, in1) {} src/hotspot/share/opto/divnode.cpp line 866: > 864: // Dividing by self is 1. > 865: // IF the divisor is 1, we are an identity on the dividend. > 866: Node* DivHFNode::Identity(PhaseGVN* phase) { Remove line with `isA_Copy`. src/hotspot/share/opto/type.cpp line 1106: > 1104: if (_base == FloatBot || _base == FloatTop) return FLOAT; > 1105: if (_base == HalfFloatTop || _base == HalfFloatBot) return Type::BOTTOM; > 1106: if (_base == DoubleTop || _base == DoubleBot) return Type::BOTTOM; If you already fixing the style, you should use curly braces as I said above ;) src/hotspot/share/opto/type.cpp line 1472: > 1470: //------------------------------meet------------------------------------------- > 1471: // Compute the MEET of two types. It returns a new Type object. > 1472: const Type* TypeH::xmeet(const Type* t) const { Suggestion: //------------------------------xmeet------------------------------------------- // Compute the MEET of two types. It returns a new Type object. const Type* TypeH::xmeet(const Type* t) const { ------------- PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2592155651 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940766035 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940763403 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940766624 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940771256 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940771662 From jbhateja at openjdk.org Tue Feb 4 10:05:09 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 10:05:09 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: > Hi All, > > This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) > > Following is the summary of changes included with this patch:- > > 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. > 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. > 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. > - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. > 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. > 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. > 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. > 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF > 9. X86 backend implementation for all supported intrinsics. > 10. Functional and Performance validation tests. > > Kindly review the patch and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Fixing typos ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22754/files - new: https://git.openjdk.org/jdk/pull/22754/files/8207c9ff..82a42213 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=15-16 Stats: 13 lines in 3 files changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/22754.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22754/head:pull/22754 PR: https://git.openjdk.org/jdk/pull/22754 From jbhateja at openjdk.org Tue Feb 4 10:05:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 10:05:11 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 18:11:11 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java >> >> Co-authored-by: Emanuel Peter > > Hi @PaulSandoz , @eme64 , All outstanding comments haven been addressed, please let me know if there are other comments. > @jatin-bhateja Testing is all green ? Doing a last pass over the code. Thanks @eme64, looking forward to your approval :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2633414710 From jbhateja at openjdk.org Tue Feb 4 10:05:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 4 Feb 2025 10:05:11 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 09:03:09 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/micro/org/openjdk/bench/jdk/incubator/vector/Float16OperationsBenchmark.java >> >> Co-authored-by: Emanuel Peter > > src/hotspot/share/opto/convertnode.cpp line 971: > >> 969: return true; >> 970: default: >> 971: return false; > > Does this cover all cases? What about `FmaHF`? FmaHF is a ternary operation and is intrinsified. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1940855109 From adinn at openjdk.org Tue Feb 4 11:48:09 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 4 Feb 2025 11:48:09 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2] In-Reply-To: References: <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com> Message-ID: On Mon, 3 Feb 2025 18:11:51 GMT, Ferenc Rakoczi wrote: >> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. > >> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. > > @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently. @ferakocz Yes, the stub declaration part of it looks to be correct. The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2633666753 From alanb at openjdk.org Tue Feb 4 13:39:14 2025 From: alanb at openjdk.org (Alan Bateman) Date: Tue, 4 Feb 2025 13:39:14 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. src/hotspot/share/oops/arrayKlass.hpp line 2: > 1: /* > 2: * Copyright (c) 1997, 2025, Oracle and/or its affiliates. All rights reserved. arrayKlass.hpp isn't changed, is this left over from a previous iteration? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941185550 From alanb at openjdk.org Tue Feb 4 14:00:13 2025 From: alanb at openjdk.org (Alan Bateman) Date: Tue, 4 Feb 2025 14:00:13 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. Good cleanup. src/java.base/share/classes/java/lang/Class.java line 244: > 242: classLoader = loader; > 243: componentType = arrayComponentType; > 244: modifiers = dummyModifiers; I realize this ctor isn't used but "dummyModifiers" looks very strange as parameter name when compared to the others, renaming it to something like "mods" would make it less confusing for anyone reading through this code. ------------- Marked as reviewed by alanb (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2592938860 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941220263 From coleenp at openjdk.org Tue Feb 4 14:43:51 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 4 Feb 2025 14:43:51 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Fix copyright and param name ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22652/files - new: https://git.openjdk.org/jdk/pull/22652/files/8854fcc6..ff693418 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=00-01 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/22652.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652 PR: https://git.openjdk.org/jdk/pull/22652 From coleenp at openjdk.org Tue Feb 4 14:43:51 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 4 Feb 2025 14:43:51 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Tue, 4 Feb 2025 14:40:47 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix copyright and param name Thank you for your comments, Alan. ------------- PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2593075666 From coleenp at openjdk.org Tue Feb 4 14:43:51 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 4 Feb 2025 14:43:51 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Tue, 4 Feb 2025 13:36:44 GMT, Alan Bateman wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix copyright and param name > > src/hotspot/share/oops/arrayKlass.hpp line 2: > >> 1: /* >> 2: * Copyright (c) 1997, 2025, Oracle and/or its affiliates. All rights reserved. > > arrayKlass.hpp isn't changed, is this left over from a previous iteration? yes, it was something that my copyright script thought I changed from merging some previous changes. > src/java.base/share/classes/java/lang/Class.java line 244: > >> 242: classLoader = loader; >> 243: componentType = arrayComponentType; >> 244: modifiers = dummyModifiers; > > I realize this ctor isn't used but "dummyModifiers" looks very strange as parameter name when compared to the others, renaming it to something like "mods" would make it less confusing for anyone reading through this code. I changed it to mods. Thanks for the suggestion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941301152 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1941302820 From never at openjdk.org Tue Feb 4 16:36:26 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 16:36:26 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash Message-ID: This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. ------------- Commit messages: - 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash Changes: https://git.openjdk.org/jdk/pull/23444/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349374 Stats: 8 lines in 1 file changed: 5 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23444.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444 PR: https://git.openjdk.org/jdk/pull/23444 From dnsimon at openjdk.org Tue Feb 4 16:58:09 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Feb 2025 16:58:09 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez wrote: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java line 179: > 177: } > 178: > 179: if (UnsafeAccess.UNSAFE.getLong(getFailedSpeculationsAddress()) != 0) { It's still possible for `getFailedSpeculationsAddress()` to return 0 (i.e. when `managesFailedSpeculations` is `false`). So I think this should be: diff --git a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java index fd46e281c3b..a861c00d77d 100644 --- a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java +++ b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java @@ -171,8 +171,9 @@ public String toString() { @Override public void collectFailedSpeculations() { - if (failedSpeculationsAddress != 0 && UnsafeAccess.UNSAFE.getLong(failedSpeculationsAddress) != 0) { - failedSpeculations = compilerToVM().getFailedSpeculations(failedSpeculationsAddress, failedSpeculations); + long address = getFailedSpeculationsAddress(); + if (address != 0 && UnsafeAccess.UNSAFE.getLong(address) != 0) { + failedSpeculations = compilerToVM().getFailedSpeculations(address, failedSpeculations); assert failedSpeculations.getClass() == byte[][].class; } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941551882 From never at openjdk.org Tue Feb 4 17:41:14 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 17:41:14 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:54:58 GMT, Doug Simon wrote: >> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java line 179: > >> 177: } >> 178: >> 179: if (UnsafeAccess.UNSAFE.getLong(getFailedSpeculationsAddress()) != 0) { > > It's still possible for `getFailedSpeculationsAddress()` to return 0 (i.e. when `managesFailedSpeculations` is `false`). So I think this should be: > > diff --git a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java > index fd46e281c3b..a861c00d77d 100644 > --- a/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java > +++ b/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java > @@ -171,8 +171,9 @@ public String toString() { > > @Override > public void collectFailedSpeculations() { > - if (failedSpeculationsAddress != 0 && UnsafeAccess.UNSAFE.getLong(failedSpeculationsAddress) != 0) { > - failedSpeculations = compilerToVM().getFailedSpeculations(failedSpeculationsAddress, failedSpeculations); > + long address = getFailedSpeculationsAddress(); > + if (address != 0 && UnsafeAccess.UNSAFE.getLong(address) != 0) { > + failedSpeculations = compilerToVM().getFailedSpeculations(address, failedSpeculations); > assert failedSpeculations.getClass() == byte[][].class; > } > } I'm filtering out 0 above this line and `getFailedSpeculationsAddress()` can't return 0 if `failedSpeculationsAddress` is already non-zero. `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941612206 From epeter at openjdk.org Tue Feb 4 18:47:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:47:25 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Thu, 30 Jan 2025 17:11:08 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > format Nice cleanup. Though it looks like you are doing more than remove the ctrl input. I don't know the code very well, so I have some questions ;) src/hotspot/share/opto/parseHelper.cpp line 170: > 168: !too_many_traps(Deoptimization::Reason_array_check) && > 169: !tak->klass_is_exact() && > 170: tak->isa_aryklassptr()) { Looks like an implicit `nullptr` check. Not allowed by code style ;) src/hotspot/share/opto/parseHelper.cpp line 193: > 191: // See issue JDK-8057622 for details. > 192: > 193: always_see_exact_class = true; Why is it ok to remove this? If this branch is not taken, it used to be `false`, and would lead to something different below... ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2593742600 PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941714615 PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941719070 From epeter at openjdk.org Tue Feb 4 18:47:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 18:47:25 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Tue, 4 Feb 2025 18:39:32 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> format > > src/hotspot/share/opto/parseHelper.cpp line 170: > >> 168: !too_many_traps(Deoptimization::Reason_array_check) && >> 169: !tak->klass_is_exact() && >> 170: tak->isa_aryklassptr()) { > > Looks like an implicit `nullptr` check. Not allowed by code style ;) Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941715309 From qamai at openjdk.org Tue Feb 4 18:57:10 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Feb 2025 18:57:10 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com> On Tue, 4 Feb 2025 18:40:05 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/parseHelper.cpp line 170: >> >>> 168: !too_many_traps(Deoptimization::Reason_array_check) && >>> 169: !tak->klass_is_exact() && >>> 170: tak->isa_aryklassptr()) { >> >> Looks like an implicit `nullptr` check. Not allowed by code style ;) > > Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please? > Looks like an implicit nullptr check. Not allowed by code style ;) But the verb here is `isa` and we use these as a `bool` a lot, though :/ > Can you quickly explain this change from tak != TypeInstKlassPtr::OBJECT so I don't need to investigate myself, please? The bottom type of an array can be either `Object` or an array of some kind, so `tak != TypeInstKlassPtr::OBJECT` is the same as `tak->isa_aryklassptr()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941732694 From qamai at openjdk.org Tue Feb 4 18:57:11 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Feb 2025 18:57:11 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Tue, 4 Feb 2025 18:43:04 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> format > > src/hotspot/share/opto/parseHelper.cpp line 193: > >> 191: // See issue JDK-8057622 for details. >> 192: >> 193: always_see_exact_class = true; > > Why is it ok to remove this? > If this branch is not taken, it used to be `false`, and would lead to something different below... The only use of this is to decide if we need to attach a control input to the `LoadKlass`. As the control input is not needed, this can be removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1941735400 From duke at openjdk.org Tue Feb 4 19:00:33 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 4 Feb 2025 19:00:33 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2] In-Reply-To: References: <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com> Message-ID: On Mon, 3 Feb 2025 18:11:51 GMT, Ferenc Rakoczi wrote: >> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. > >> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. > > @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently. > @ferakocz Yes, the stub declaration part of it looks to be correct. > > The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at? The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsicss' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2634810518 From epeter at openjdk.org Tue Feb 4 19:09:35 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:09:35 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos Thanks @jatin-bhateja for all your patience, this really took a while ? It looks good to me - again I'm only reviewing the C++ VM changes, so someone else has to review the Java changes. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2593800414 From epeter at openjdk.org Tue Feb 4 19:09:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Feb 2025 19:09:36 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v16] In-Reply-To: References: Message-ID: <7WobCDj_e4Sw1CEYr3EVfgHTxJoxBfiFR63WwrzDDzs=.27e926d0-23e6-4231-a677-fdfd683083be@github.com> On Tue, 4 Feb 2025 09:56:15 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/convertnode.cpp line 971: >> >>> 969: return true; >>> 970: default: >>> 971: return false; >> >> Does this cover all cases? What about `FmaHF`? > > FmaHF is a ternary operation and is intrinsified. Ah, right. My bad ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1941748224 From liach at openjdk.org Tue Feb 4 19:21:44 2025 From: liach at openjdk.org (Chen Liang) Date: Tue, 4 Feb 2025 19:21:44 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com> On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 42: > 40: } > 41: > 42: public interface Float16TernaryMathOp { Is there a reason we don't write the default impl explicitly in this class, but ask for a lambda for an implementation? Each intrinsified method only has one default impl, so I think we can just inline that into the method body here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1941764924 From dnsimon at openjdk.org Tue Feb 4 19:39:15 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Feb 2025 19:39:15 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez wrote: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. Marked as reviewed by dnsimon (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23444#pullrequestreview-2593861407 From dnsimon at openjdk.org Tue Feb 4 19:39:16 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 4 Feb 2025 19:39:16 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 17:38:40 GMT, Tom Rodriguez wrote: > `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. Ok, I'd forgotten about that invariant. Might be worth reminding the reader of it with a comment in collectFailedSpeculations. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941785813 From never at openjdk.org Tue Feb 4 20:52:37 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 20:52:37 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v2] In-Reply-To: References: Message-ID: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: improve javadoc ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23444/files - new: https://git.openjdk.org/jdk/pull/23444/files/aefc1dfd..459f5c36 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23444.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444 PR: https://git.openjdk.org/jdk/pull/23444 From never at openjdk.org Tue Feb 4 20:52:37 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 20:52:37 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v2] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 19:36:36 GMT, Doug Simon wrote: >> I'm filtering out 0 above this line and `getFailedSpeculationsAddress()` can't return 0 if `failedSpeculationsAddress` is already non-zero. >> >> `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. > >> `failedSpeculationsAddress` also can't be 0 if `managesFailedSpeculations` is false since we throw `IllegalArgumentException` in that case. > > Ok, I'd forgotten about that invariant. Might be worth reminding the reader of it with a comment in collectFailedSpeculations. It's already the style in other places like the call to addFailedSpeculation so I'm not sure it's worth calling out here. I've updated the javadoc for getFailedSpeculationsAddress to specify that it always returns non-zero. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23444#discussion_r1941877712 From never at openjdk.org Tue Feb 4 20:56:53 2025 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 4 Feb 2025 20:56:53 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v3] In-Reply-To: References: Message-ID: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: improve comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23444/files - new: https://git.openjdk.org/jdk/pull/23444/files/459f5c36..5a5fd6fc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23444&range=01-02 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23444.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23444/head:pull/23444 PR: https://git.openjdk.org/jdk/pull/23444 From dlong at openjdk.org Wed Feb 5 01:13:20 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 5 Feb 2025 01:13:20 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix copyright and param name test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 73: > 71: public int getAppArrayModifiers() { > 72: return clazzArray.getClass().getModifiers(); > 73: } I'm guessing this is the benchmark that shows an extra load. How about adding a benchmark that makes the Clazz[] final or @Stable, and see if that makes the extra load go away? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1942114565 From jbhateja at openjdk.org Wed Feb 5 07:09:15 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 5 Feb 2025 07:09:15 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com> References: <7oq7j2pYG9ToDNcGyVWrphH_wFyvPRX2kl3qxgQYBss=.449139d7-e3a8-4587-b5ce-a5f7f9f5b613@github.com> Message-ID: On Tue, 4 Feb 2025 19:18:39 GMT, Chen Liang wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing typos > > src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 42: > >> 40: } >> 41: >> 42: public interface Float16TernaryMathOp { > > Is there a reason we don't write the default impl explicitly in this class, but ask for a lambda for an implementation? Each intrinsified method only has one default impl, so I think we can just inline that into the method body here. This wrapper class is part of java.base module and only contains intrinsic entry points for APIs defined in Float16 class which is part of an incubation module. Thus, exposing intrinsic fallback code through lambda keeps the interface clean while actual API logic and comments around it remains intact in Float16 class. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1942344948 From epeter at openjdk.org Wed Feb 5 09:30:21 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:30:21 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> <067rRrzD6d7ZDU-HYPHQ-qVhPygP_3WqrrgZvikgjIc=.98110421-5c91-492a-8f35-a9544cde6189@github.com> Message-ID: On Tue, 4 Feb 2025 18:52:13 GMT, Quan Anh Mai wrote: >> Can you quickly explain this change from `tak != TypeInstKlassPtr::OBJECT` so I don't need to investigate myself, please? > >> Looks like an implicit nullptr check. Not allowed by code style ;) > > But the verb here is `isa` and we use these as a `bool` a lot, though :/ > >> Can you quickly explain this change from tak != TypeInstKlassPtr::OBJECT so I don't need to investigate myself, please? > > The bottom type of an array can be either `Object` or an array of some kind, so `tak != TypeInstKlassPtr::OBJECT` is the same as `tak->isa_aryklassptr()`. Ah great, thanks for the explanation! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1942520434 From epeter at openjdk.org Wed Feb 5 09:35:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:35:17 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Thu, 30 Jan 2025 17:11:08 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > format Looks good, thanks for the explanations! I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2595110848 From epeter at openjdk.org Wed Feb 5 09:35:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Feb 2025 09:35:18 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v4] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: On Tue, 4 Feb 2025 18:54:21 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/parseHelper.cpp line 193: >> >>> 191: // See issue JDK-8057622 for details. >>> 192: >>> 193: always_see_exact_class = true; >> >> Why is it ok to remove this? >> If this branch is not taken, it used to be `false`, and would lead to something different below... > > The only use of this is to decide if we need to attach a control input to the `LoadKlass`. As the control input is not needed, this can be removed. Got it, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23274#discussion_r1942528816 From adinn at openjdk.org Wed Feb 5 10:35:10 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 5 Feb 2025 10:35:10 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2] In-Reply-To: References: <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com> Message-ID: On Tue, 4 Feb 2025 18:57:28 GMT, Ferenc Rakoczi wrote: >>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. >> >> @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently. > >> @ferakocz Yes, the stub declaration part of it looks to be correct. >> >> The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at? > > The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsicss' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing. @ferakocz > The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsics' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing. Yes, I located the relevant Java implementations in SHA3.java (keccak) and ML_DSA.java (dilithiumXXX) plus also SHA3Parallel.java (doubleKeccak). The first file does at least mention FIPS-202. The second does not include any reference, in particular does not mention FIPS-204. I still think it would be helpful for reviewers and maintainers if you were to add a comment in front of the generator routines that 1) notes that these routines are based on the relevant Java sources and 2) mentions that the Java code is in turn based on the FIPS-202 and FIPS-204 standards. While I agree that a reviewer or maintainer could simply check the generated code against the Java code I believe access to the underlying theory will be of aid when it comes to understanding what each variant is doing and verifying the equivalence of the two. That's why I'd also prefer to have two reviews to be sure that more than one of us who may be tasked with maintaining this code can be happy that we understand, at least, the equivalence in question. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2636346476 From coleenp at openjdk.org Wed Feb 5 19:51:06 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 5 Feb 2025 19:51:06 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com> On Wed, 5 Feb 2025 01:10:39 GMT, Dean Long wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix copyright and param name > > test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 73: > >> 71: public int getAppArrayModifiers() { >> 72: return clazzArray.getClass().getModifiers(); >> 73: } > > I'm guessing this is the benchmark that shows an extra load. How about adding a benchmark that makes the Clazz[] final or @Stable, and see if that makes the extra load go away? Name Cnt Base Error Test Error Unit Change getAppArrayModifiers 30 0.923 ? 0.004 1.260 ? 0.001 ns/op 0.73x (p = 0.000*) getAppArrayModifiersFinal 30 0.922 ? 0.000 1.260 ? 0.001 ns/op 0.73x (p = 0.000*) No it doesn't really help. There's still an extra load. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943569183 From kvn at openjdk.org Wed Feb 5 20:15:10 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 5 Feb 2025 20:15:10 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v3] In-Reply-To: References: Message-ID: <1GBWBQfWNIwLEF26VW0tecseBegwuuRUDG-rNg1zdoU=.63aa4380-5c75-45ac-86dd-9c9fe308b9dc@github.com> On Tue, 4 Feb 2025 20:56:53 GMT, Tom Rodriguez wrote: >> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. > > Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: > > improve comments Seems fine. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23444#pullrequestreview-2596877352 From dlong at openjdk.org Wed Feb 5 20:26:16 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 5 Feb 2025 20:26:16 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com> Message-ID: <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com> On Wed, 5 Feb 2025 19:42:02 GMT, Coleen Phillimore wrote: >> test/micro/org/openjdk/bench/java/lang/reflect/Clazz.java line 73: >> >>> 71: public int getAppArrayModifiers() { >>> 72: return clazzArray.getClass().getModifiers(); >>> 73: } >> >> I'm guessing this is the benchmark that shows an extra load. How about adding a benchmark that makes the Clazz[] final or @Stable, and see if that makes the extra load go away? > > Name Cnt Base Error Test Error Unit Change > getAppArrayModifiers 30 0.923 ? 0.004 1.260 ? 0.001 ns/op 0.73x (p = 0.000*) > getAppArrayModifiersFinal 30 0.922 ? 0.000 1.260 ? 0.001 ns/op 0.73x (p = 0.000*) > > No it doesn't really help. There's still an extra load. OK, if the extra load turns out to be a problem in the future, we could look into why the compilers are generating the load when the Class is known/constant. If the old intrinsic was able to pull the constant out of the Klass, then surely we can do the same and pull the value from the Class field. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943616021 From dlong at openjdk.org Wed Feb 5 21:29:12 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 5 Feb 2025 21:29:12 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com> On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix copyright and param name src/hotspot/share/compiler/compileLog.cpp line 116: > 114: print(" unloaded='1'"); > 115: } else { > 116: print(" flags='%d'", klass->access_flags()); There may be tools that parse the log file and get confused by this change. Maybe we should also change the label from "flags" to "access flags". src/hotspot/share/jfr/recorder/checkpoint/types/jfrTypeSet.cpp line 350: > 348: writer->write(mark_symbol(klass, leakp)); > 349: writer->write(package_id(klass, leakp)); > 350: writer->write(klass->compute_modifier_flags()); Isn't this much more expensive than grabbing the value from the mirror, especially if we have to iterate over inner classes? src/hotspot/share/oops/instanceKlass.hpp line 1128: > 1126: #endif > 1127: > 1128: int compute_modifier_flags() const; I don't see why this can't stay u2. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943680670 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943679056 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943682936 From dlong at openjdk.org Wed Feb 5 21:43:14 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 5 Feb 2025 21:43:14 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix copyright and param name src/hotspot/share/opto/memnode.cpp line 2458: > 2456: return TypePtr::NULL_PTR; > 2457: } > 2458: // ??? I suspect that we still need this code to support intrinsics like LibraryCallKit::inline_native_classID() and maybe other users of this field, but the comment below no longer makes sense. src/hotspot/share/opto/memnode.cpp line 2459: > 2457: } > 2458: // ??? > 2459: // (Folds up the 1st indirection in aClassConstant.getModifiers().) Suggestion: // Fold up the load of the hidden field ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943695585 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943696867 From dlong at openjdk.org Wed Feb 5 21:47:12 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 5 Feb 2025 21:47:12 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com> Message-ID: On Thu, 12 Dec 2024 10:16:01 GMT, Viktor Klang wrote: >> @viktorklang-ora `@Stable` is not about how the field was set, but about the JIT observing a non-default value at compile time. If it observes a non-default value, it can treat it as a compile time constant. > > @DanHeidinga Great explanation, thank you! If Class had other fields smaller than `int`, would be consider making this something like `char` to save space (allowing all the sub-word fields to be compacted)? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1943701237 From dlong at openjdk.org Wed Feb 5 21:53:11 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 5 Feb 2025 21:53:11 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix copyright and param name Overall looks good to me. Please ask @iwanowww to review compiler changes. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2597046622 From liach at openjdk.org Thu Feb 6 04:40:11 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 6 Feb 2025 04:40:11 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com> <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com> Message-ID: <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com> On Wed, 5 Feb 2025 20:23:05 GMT, Dean Long wrote: >> Name Cnt Base Error Test Error Unit Change >> getAppArrayModifiers 30 0.923 ? 0.004 1.260 ? 0.001 ns/op 0.73x (p = 0.000*) >> getAppArrayModifiersFinal 30 0.922 ? 0.000 1.260 ? 0.001 ns/op 0.73x (p = 0.000*) >> >> No it doesn't really help. There's still an extra load. > > OK, if the extra load turns out to be a problem in the future, we could look into why the compilers are generating the load when the Class is known/constant. If the old intrinsic was able to pull the constant out of the Klass, then surely we can do the same and pull the value from the Class field. Does `static final` help here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944083490 From coleenp at openjdk.org Thu Feb 6 12:11:14 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 12:11:14 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com> Message-ID: <4aAX8rSEcvkeYteaJUXHfVEzBbNGwGlhDLIz548dFcs=.616fa7dd-d5bf-42d5-aca0-0bea0b5591d0@github.com> On Wed, 5 Feb 2025 21:44:37 GMT, Dean Long wrote: >> @DanHeidinga Great explanation, thank you! > > If Class had other fields smaller than `int`, would be consider making this something like `char` to save space (allowing all the sub-word fields to be compacted)? I thought of doing this since I made modifiers u2 in the Hotspot code just previously, but all the Java code refers to this as an int. And I didn't see other fields to compact it with. Maybe if access_flags are moved we could make them both char (not short since they're unsigned). It feels weird to not have unsigned short to my C++ eyes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944613105 From coleenp at openjdk.org Thu Feb 6 13:13:29 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 13:13:29 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v3] In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/memnode.cpp Co-authored-by: Dean Long <17332032+dean-long at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22652/files - new: https://git.openjdk.org/jdk/pull/22652/files/ff693418..f92620eb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/22652.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652 PR: https://git.openjdk.org/jdk/pull/22652 From coleenp at openjdk.org Thu Feb 6 13:13:29 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 13:13:29 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com> Message-ID: On Wed, 5 Feb 2025 21:24:25 GMT, Dean Long wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix copyright and param name > > src/hotspot/share/compiler/compileLog.cpp line 116: > >> 114: print(" unloaded='1'"); >> 115: } else { >> 116: print(" flags='%d'", klass->access_flags()); > > There may be tools that parse the log file and get confused by this change. Maybe we should also change the label from "flags" to "access flags". Okay, I wanted to remove the one use of ciKlass::modifier_flags() and the field with this change, but I'll add it back since I added a Klass::modifier_flags() function. > src/hotspot/share/jfr/recorder/checkpoint/types/jfrTypeSet.cpp line 350: > >> 348: writer->write(mark_symbol(klass, leakp)); >> 349: writer->write(package_id(klass, leakp)); >> 350: writer->write(klass->compute_modifier_flags()); > > Isn't this much more expensive than grabbing the value from the mirror, especially if we have to iterate over inner classes? I was trying not to add a Klass::modifier_flags function, but now I have. > src/hotspot/share/opto/memnode.cpp line 2458: > >> 2456: return TypePtr::NULL_PTR; >> 2457: } >> 2458: // ??? > > I suspect that we still need this code to support intrinsics like LibraryCallKit::inline_native_classID() and maybe other users of this field, but the comment below no longer makes sense. Thank you for noticing the ??? that I left in and the comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944651499 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944640356 PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944697467 From coleenp at openjdk.org Thu Feb 6 13:13:29 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 13:13:29 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Tue, 4 Feb 2025 14:43:51 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix copyright and param name Thank you for the detailed comments. ------------- PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2598534835 From coleenp at openjdk.org Thu Feb 6 13:13:30 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 13:13:30 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com> <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com> <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com> Message-ID: On Thu, 6 Feb 2025 04:37:17 GMT, Chen Liang wrote: >> OK, if the extra load turns out to be a problem in the future, we could look into why the compilers are generating the load when the Class is known/constant. If the old intrinsic was able to pull the constant out of the Klass, then surely we can do the same and pull the value from the Class field. > > Does `static final` help here? Yes. Yes it does. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944694824 From coleenp at openjdk.org Thu Feb 6 13:23:54 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 13:23:54 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v4] In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <4ruwzJXM3Jgy0rbobE3PPNAH4k8c10_4zAi6mCmc4Lw=.ccf7c825-4ffc-49fb-bc42-3c0168c1dcf8@github.com> > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Add Klass::modifier_flags to look in the mirror, restore ciKlass::modifier_flags, add benchmark. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22652/files - new: https://git.openjdk.org/jdk/pull/22652/files/f92620eb..85026362 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=02-03 Stats: 28 lines in 7 files changed: 26 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/22652.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652 PR: https://git.openjdk.org/jdk/pull/22652 From coleenp at openjdk.org Thu Feb 6 14:31:28 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 14:31:28 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5] In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Make compute_modifiers return u2. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22652/files - new: https://git.openjdk.org/jdk/pull/22652/files/85026362..146e2551 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=03-04 Stats: 7 lines in 7 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/22652.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652 PR: https://git.openjdk.org/jdk/pull/22652 From coleenp at openjdk.org Thu Feb 6 14:31:29 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 14:31:29 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <1yPHOj_hANp7ZvMfmgi6lRkpokgNNaUSc09FJfZvWk8=.bfcf2780-4afe-4253-ae0b-e3bc6ab7ee86@github.com> Message-ID: <9iIj0xWClD_H4U0MiEUrQGqeIgjyFdC4tuN0sAP9kUo=.1c11d464-4380-4954-9e9f-c40872acff24@github.com> On Wed, 5 Feb 2025 21:26:29 GMT, Dean Long wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix copyright and param name > > src/hotspot/share/oops/instanceKlass.hpp line 1128: > >> 1126: #endif >> 1127: >> 1128: int compute_modifier_flags() const; > > I don't see why this can't stay u2. I had some compilation error for conversion that has disappeared into the either with u2, so I've restored them to u2. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1944825437 From liach at openjdk.org Thu Feb 6 16:20:21 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 6 Feb 2025 16:20:21 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5] In-Reply-To: <4aAX8rSEcvkeYteaJUXHfVEzBbNGwGlhDLIz548dFcs=.616fa7dd-d5bf-42d5-aca0-0bea0b5591d0@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <8Wx3xbbOnPXS5n1RuNaesqHbhKV3iLwrCVF0s6uWOrA=.cb20728e-e13c-4667-822b-3ba424cbc12f@github.com> <4aAX8rSEcvkeYteaJUXHfVEzBbNGwGlhDLIz548dFcs=.616fa7dd-d5bf-42d5-aca0-0bea0b5591d0@github.com> Message-ID: On Thu, 6 Feb 2025 12:08:59 GMT, Coleen Phillimore wrote: >> If Class had other fields smaller than `int`, would be consider making this something like `char` to save space (allowing all the sub-word fields to be compacted)? > > I thought of doing this since I made modifiers u2 in the Hotspot code just previously, but all the Java code refers to this as an int. And I didn't see other fields to compact it with. Maybe if access_flags are moved we could make them both char (not short since they're unsigned). It feels weird to not have unsigned short to my C++ eyes. >From a Java perspective, using `char` for the field is completely fine; this field is only accessed via `getModifiers` and not set by Java code, so the automatic widening conversion can handle it all. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1945021458 From duke at openjdk.org Thu Feb 6 18:47:54 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 6 Feb 2025 18:47:54 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: Adding comments + some code reorganization ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23300/files - new: https://git.openjdk.org/jdk/pull/23300/files/9f7c4a23..9a3a9444 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=03-04 Stats: 447 lines in 3 files changed: 140 ins; 247 del; 60 mod Patch: https://git.openjdk.org/jdk/pull/23300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300 PR: https://git.openjdk.org/jdk/pull/23300 From qamai at openjdk.org Thu Feb 6 19:11:58 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Feb 2025 19:11:58 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: Message-ID: > Hi, > > This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: > > // We are allowed to use the constant type only if cast succeeded > > But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. > > Please take a look and leave your reviews, thanks a lot. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' into loadklassctrl - format - clearer intention, revert formatting, add assert - remove always_see_exact_class - remove control input of LoadKlassNode ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23274/files - new: https://git.openjdk.org/jdk/pull/23274/files/175232a6..7c2b595b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23274&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23274&range=03-04 Stats: 34650 lines in 1350 files changed: 16246 ins; 10055 del; 8349 mod Patch: https://git.openjdk.org/jdk/pull/23274.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23274/head:pull/23274 PR: https://git.openjdk.org/jdk/pull/23274 From qamai at openjdk.org Thu Feb 6 19:15:13 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 6 Feb 2025 19:15:13 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> Message-ID: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> On Wed, 5 Feb 2025 09:32:27 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Merge branch 'master' into loadklassctrl >> - format >> - clearer intention, revert formatting, add assert >> - remove always_see_exact_class >> - remove control input of LoadKlassNode > > Looks good, thanks for the explanations! > > I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. > > But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2640769617 From vlivanov at openjdk.org Thu Feb 6 21:15:12 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 6 Feb 2025 21:15:12 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v2] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> <0ZM_vg_dAmbdbeoIeZ8ylBUDj_4_jxM-aE6IKoH6ykM=.69c7554f-5e2b-40b9-8d1a-abe147548dbb@github.com> <0efX7bcHNl5p1RoF3VnqZIabdavsGosuMI14cZPDzbQ=.2bde6bbf-a59b-4f5b-9c68-7a8a258b2ee5@github.com> <7KdNVSXLx0N027uyQgtUuN82VpXTlyPpPOnBv3sqYRs=.6b549b56-36f9-4ab3-8469-4779d93dd1e7@github.com> Message-ID: On Thu, 6 Feb 2025 13:08:31 GMT, Coleen Phillimore wrote: >> Does `static final` help here? > > Yes. Yes it does. Cases when a class mirror is a compile-time constant are already well-optimized. Non constant cases are the ones where missing optimization opportunities arise. In this particular case, C2 doesn't benefit from the observation that `Clazz[]` is a leaf type at runtime (no subclasses). Hence, a value loaded from a field typed as `Clazz[]` has exactly the same type and `clazzArray.getClass()` can be constant folded to `Clazz[].class`. Rather than a common case, it feels more like a corner case. So, worth addressing as a follow-up enhancement. Another scenario is a meet of 2 primitive array types (ends up as `bottom[]` in C2 type system), but I believe it hasn't been optimized before. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22652#discussion_r1945451909 From vlivanov at openjdk.org Thu Feb 6 21:21:18 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 6 Feb 2025 21:21:18 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Thu, 6 Feb 2025 14:31:28 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Make compute_modifiers return u2. Looks good. (Except a left-over `???` in a comment.) I very much like this cleanup. Migrating from Klass to Class simplifies compiler logic since there's no need to care about primitives at runtime anymore. Speaking of missing optimization opportunities (demonstrated by one microbenchmark), it looks like a corner case and can be addressed later. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2599983789 From coleenp at openjdk.org Thu Feb 6 23:26:31 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 23:26:31 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v6] In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. Coleen Phillimore has updated the pull request incrementally with two additional commits since the last revision: - Remove ??? in the code. - Hide Class.modifiers field. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22652/files - new: https://git.openjdk.org/jdk/pull/22652/files/146e2551..304a17ee Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=04-05 Stats: 6 lines in 3 files changed: 1 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/22652.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652 PR: https://git.openjdk.org/jdk/pull/22652 From coleenp at openjdk.org Thu Feb 6 23:26:31 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 6 Feb 2025 23:26:31 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v5] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Thu, 6 Feb 2025 14:31:28 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Make compute_modifiers return u2. Thank you Vladimir for encouraging me to continue this change. I removed the ??? and hid the modifiers field for reflection as suggested in this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2641339406 From epeter at openjdk.org Fri Feb 7 07:07:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 07:07:14 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 19:11:58 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into loadklassctrl > - format > - clearer intention, revert formatting, add assert > - remove always_see_exact_class > - remove control input of LoadKlassNode Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23274#pullrequestreview-2600950916 From epeter at openjdk.org Fri Feb 7 07:07:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 07:07:15 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> Message-ID: On Thu, 6 Feb 2025 19:12:26 GMT, Quan Anh Mai wrote: >> Looks good, thanks for the explanations! >> >> I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. >> >> But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) > > @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much. @merykitty Testing launched! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2642097456 From galder at openjdk.org Fri Feb 7 12:31:11 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 7 Feb 2025 12:31:11 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Fri, 17 Jan 2025 17:53:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo @eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results. Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly). First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better. # MinMaxVector AVX-512 Following are results with AVX-512 instructions: Benchmark (probability) (range) (seed) (size) Mode Cnt Baseline Patch Units MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 834.127 3688.961 ops/ms MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 1147.010 3687.721 ops/ms MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 1126.718 1072.812 ops/ms MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 1070.921 1070.538 ops/ms MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 510.483 1073.081 ops/ms MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 935.658 1016.910 ops/ms MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 1007.410 933.774 ops/ms MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 536.582 1017.337 ops/ms MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 4 967.288 966.945 ops/ms MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 4 967.327 967.382 ops/ms MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 4 849.689 967.327 ops/ms MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 4 966.323 967.275 ops/ms MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 4 967.340 967.228 ops/ms MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 4 880.921 967.233 ops/ms ### `longReduction[Min|Max]` performance improves slightly when probability is 100 Without the patch the code uses compare instructions: 7.83% ???? ???? ? 0x00007f4f700fb305: imulq $0xb, 0x20(%r14, %r8, 8), %rdi ???? ???? ? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ???? ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255) ???? ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 5.64% ???? ???? ? 0x00007f4f700fb30b: cmpq %rdi, %rdx ????????? ? 0x00007f4f700fb30e: jge 0x7f4f700fb32c ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ????????? ? ; - java.lang.Math::max at 11 (line 2037) ????????? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256) ????????? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 12.82% ?????????? ? 0x00007f4f700fb310: imulq $0xb, 0x28(%r14, %r8, 8), %rbp ?????????? ? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ?????????? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255) ?????????? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 7.46% ?????????? ? 0x00007f4f700fb316: cmpq %rbp, %rdi ?????????? ? 0x00007f4f700fb319: jl 0x7f4f700fb2e0 ;*iflt {reexecute=0 rethrow=0 return_oop=0} ????? ???? ? ; - java.lang.Math::max at 3 (line 2037) ????? ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256) ????? ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) And with the patch these become vectorized: ? ?? ????? 0x00007f56280fad10: vpmullq 0xf0(%rdx, %rsi, 8), %ymm10, %ymm4 8.35% ? ?? ????? 0x00007f56280fad1b: vpmullq 0xd0(%rdx, %rsi, 8), %ymm10, %ymm5 4.27% ? ?? ????? 0x00007f56280fad26: vpmullq 0x10(%rdx, %rsi, 8), %ymm10, %ymm6 ? ?? ????? ; {no_reloc} 4.22% ? ?? ????? 0x00007f56280fad31: vpmullq 0x30(%rdx, %rsi, 8), %ymm10, %ymm7 4.00% ? ?? ????? 0x00007f56280fad3c: vpmullq 0xb0(%rdx, %rsi, 8), %ymm10, %ymm8 4.13% ? ?? ????? 0x00007f56280fad47: vpmullq 0x50(%rdx, %rsi, 8), %ymm10, %ymm11 4.10% ? ?? ????? 0x00007f56280fad52: vpmullq 0x70(%rdx, %rsi, 8), %ymm10, %ymm12 4.13% ? ?? ????? 0x00007f56280fad5d: vpmullq 0x90(%rdx, %rsi, 8), %ymm10, %ymm13 4.03% ? ?? ????? 0x00007f56280fad68: vpmaxsq %ymm6, %ymm3, %ymm3 ? ?? ????? 0x00007f56280fad6e: vpmaxsq %ymm7, %ymm3, %ymm3 4.72% ? ?? ????? 0x00007f56280fad74: vpmaxsq %ymm11, %ymm3, %ymm3 ? ?? ????? 0x00007f56280fad7a: vpmaxsq %ymm12, %ymm3, %ymm3 8.40% ? ?? ????? 0x00007f56280fad80: vpmaxsq %ymm13, %ymm3, %ymm3 23.11% ? ?? ????? 0x00007f56280fad86: vpmaxsq %ymm8, %ymm3, %ymm3 2.15% ? ?? ????? 0x00007f56280fad8c: vpmaxsq %ymm5, %ymm3, %ymm3 8.79% ? ?? ????? 0x00007f56280fad92: vpmaxsq %ymm4, %ymm3, %ymm3 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ? ?? ????? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256) ? ?? ????? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) ### `longLoop[Min|Max]` performance improves considerably when probability is 100 Without the patch the code uses compare + move instructions: 4.53% ???? ?? ? ? 0x00007f96b40faf33: movq 0x18(%rax, %rsi, 8), %r13;*laload {reexecute=0 rethrow=0 return_oop=0} ???? ?? ? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 20 (line 236) ???? ?? ? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) 2.69% ???? ?? ? ? 0x00007f96b40faf38: cmpq %r11, %r13 ????? ?? ? ? 0x00007f96b40faf3b: jl 0x7f96b40faf67 ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ????? ?? ? ? ; - java.lang.Math::max at 11 (line 2037) ????? ?? ? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 236) ????? ?? ? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) 8.75% ????? ??? ? ? 0x00007f96b40faf3d: movq %r13, 0x18(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0} ????? ??? ? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236) ????? ??? ? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) And with the patch those become vectorized: 3.55% ? ?? 0x00007f13c80fa18a: vmovdqu 0xf0(%rbx, %r10, 8), %ymm5 ? ?? 0x00007f13c80fa194: vmovdqu 0xf0(%rdi, %r10, 8), %ymm6 2.35% ? ?? 0x00007f13c80fa19e: vpmaxsq %ymm6, %ymm5, %ymm5 5.03% ? ?? 0x00007f13c80fa1a4: vmovdqu %ymm5, 0xf0(%rax, %r10, 8) ? ?? ;*lastore {reexecute=0 rethrow=0 return_oop=0} ? ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 236) ? ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) It's interesting to observe that at probabilites of 50/80% the baseline performs better than at 100%. The reason for that is because at 50/80% the baseline already vectorizes. So, why isn't the baseline vectorizing at 100% probability? VLoop::check_preconditions Loop: N1256/N463 limit_check counted [int,int),+4 (3161 iters) main rc has_sfpt strip_mined 1256 CountedLoop === 1256 598 463 [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) VLoop::check_preconditions: fails because of control flow. cl_exit 594 594 CountedLoopEnd === 415 593 [[ 1275 463 ]] [lt] P=0.999684, C=707717.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) cl_exit->in(0) 415 415 Region === 415 411 412 [[ 415 594 416 451 ]] !orig=[423] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) lpt->_head 1256 1256 CountedLoop === 1256 598 463 [[ 1256 1257 1271 1272 ]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) Loop: N1256/N463 limit_check counted [int,int),+4 (3161 iters) main rc has_sfpt strip_mined VLoop::check_preconditions: failed: control flow in loop not allowed At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken. ### `longClippingRange` performance improves considerably Without the patch the code uses compare + move instructions: 3.39% ?? ? ?? ? 0x00007febb40fb175: cmpq %rbp, %rcx ?? ?? ?? ? 0x00007febb40fb178: jge 0x7febb40fb17d ;*iflt {reexecute=0 rethrow=0 return_oop=0} ?? ?? ?? ? ; - java.lang.Math::max at 3 (line 2037) ?? ?? ?? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220) ?? ?? ?? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124) 2.69% ?? ?? ?? ? 0x00007febb40fb17a: movq %rbp, %rcx ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ?? ?? ?? ? ; - java.lang.Math::max at 11 (line 2037) ?? ?? ?? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 25 (line 220) ?? ?? ?? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124) 4.35% ?? ?? ?? ? 0x00007febb40fb17d: nop 2.93% ?? ? ?? ? 0x00007febb40fb180: cmpq %r8, %rcx ?? ? ? ?? ? 0x00007febb40fb183: jle 0x7febb40fb188 ;*ifgt {reexecute=0 rethrow=0 return_oop=0} ?? ? ? ?? ? ; - java.lang.Math::min at 3 (line 2132) ?? ? ? ?? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220) ?? ? ? ?? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124) 3.51% ?? ? ? ?? ? 0x00007febb40fb185: movq %r8, %rcx ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ?? ? ? ?? ? ; - java.lang.Math::min at 11 (line 2132) ?? ? ? ?? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 32 (line 220) ?? ? ? ?? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124) 4.26% ?? ? ? ?? ? 0x00007febb40fb188: movq %rcx, 0x10(%rsi, %r9, 8);*lastore {reexecute=0 rethrow=0 return_oop=0} ?? ? ?? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220) ?? ? ?? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124) With the patch these become vectorized: 0.20% ??? ? 0x00007f10180fd15c: vmovdqu 0x10(%r11, %rcx, 8), %ymm6 ??? ? 0x00007f10180fd163: vpmaxsq %ymm6, %ymm7, %ymm6 ??? ? 0x00007f10180fd169: vpminsq %ymm8, %ymm6, %ymm6 ??? ? 0x00007f10180fd16f: vmovdqu %ymm6, 0x10(%r8, %rcx, 8);*lastore {reexecute=0 rethrow=0 return_oop=0} ??? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220) ??? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124) # `MinMaxVector` AVX2 Following are results on the same machine as above but forcing AVX2 to be used instead of AVX-512: Benchmark (probability) (range) (seed) (size) Mode Cnt Baseline Patch Units MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 832.132 1813.609 ops/ms MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 832.546 1814.477 ops/ms MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 938.372 939.313 ops/ms MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 934.964 945.124 ops/ms MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 512.076 937.287 ops/ms MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 999.455 689.750 ops/ms MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 1000.352 876.326 ops/ms MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 536.359 999.475 ops/ms MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 4 409.413 409.363 ops/ms MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 4 409.374 409.141 ops/ms MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 4 883.614 409.318 ops/ms MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 4 404.723 404.705 ops/ms MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 4 404.755 404.748 ops/ms MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 4 848.784 404.669 ops/ms ### `longClippingRange` performance improves considerably Baseline uses compare + move instructions as shown above. But the patched version improves in spite of not being able to use AVX-512 instructions such as `vpmaxsq`. The performance improvements come from using other vectorized compare + vectorized move instructions: ? ? ???? 0x00007f9aa40f94ac: vpcmpgtq %ymm6, %ymm7, %ymm12 3.79% ? ? ???? 0x00007f9aa40f94b1: vblendvpd %ymm12, %ymm7, %ymm6, %ymm12 3.72% ? ? ???? 0x00007f9aa40f94b7: vpcmpgtq %ymm8, %ymm12, %ymm10 ? ? ???? 0x00007f9aa40f94bc: vblendvpd %ymm10, %ymm8, %ymm12, %ymm10 3.78% ? ? ???? 0x00007f9aa40f94c2: vmovdqu %ymm10, 0xf0(%r8, %rcx, 8) ? ? ???? ;*lastore {reexecute=0 rethrow=0 return_oop=0} ? ? ???? ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange at 35 (line 220) ? ? ???? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub at 19 (line 124) ### `longReduction[Min|Max]` performance drops considerably when probability is 100 Baseline uses compare + move instruction to implement this: ???? ???? ? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ???? ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255) ???? ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 6.30% ???? ???? ? 0x00007fd5580f678b: cmpq %rdi, %rdx ????????? ? 0x00007fd5580f678e: jge 0x7fd5580f67ac ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ????????? ? ; - java.lang.Math::max at 11 (line 2037) ????????? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256) ????????? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 12.88% ?????????? ? 0x00007fd5580f6790: imulq $0xb, 0x28(%r14, %r8, 8), %rbp ?????????? ? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ?????????? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255) ?????????? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 7.55% ?????????? ? 0x00007fd5580f6796: cmpq %rbp, %rdi ?????????? ? 0x00007fd5580f6799: jl 0x7fd5580f6760 ;*iflt {reexecute=0 rethrow=0 return_oop=0} ????? ???? ? ; - java.lang.Math::max at 3 (line 2037) ????? ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256) ????? ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) With the patch the code uses conditional moves instead: 0.05% ?? 0x00007fc4700f5253: imulq $0xb, 0x28(%r14, %r11, 8), %rdx 10.62% ?? 0x00007fc4700f5259: imulq $0xb, 0x20(%r14, %r11, 8), %rax 0.63% ?? 0x00007fc4700f525f: imulq $0xb, 0x10(%r14, %r11, 8), %r8 ?? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255) ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 10.34% ?? 0x00007fc4700f5265: cmpq %r8, %r13 2.37% ?? 0x00007fc4700f5268: cmovlq %r8, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256) ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 1.15% ?? 0x00007fc4700f526c: imulq $0xb, 0x18(%r14, %r11, 8), %r8 ?? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 24 (line 255) ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) 9.28% ?? 0x00007fc4700f5272: cmpq %r8, %r13 3.82% ?? 0x00007fc4700f5275: cmovlq %r8, %r13 21.61% ?? 0x00007fc4700f5279: cmpq %rax, %r13 11.55% ?? 0x00007fc4700f527c: cmovlq %rax, %r13 4.48% ?? 0x00007fc4700f5280: cmpq %rdx, %r13 11.76% ?? 0x00007fc4700f5283: cmovlq %rdx, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax at 30 (line 256) ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub at 19 (line 124) When one of the branches is taken always or almost always, the branched code of baseline can be optimized with branch prediction. However, the conditional move instructions force the CPU to compute both sides of the branch, so it performs worse in this scenario. Why vectorized instructions are not used in this scenario? Vector instructions for min/max are not available with AVX2 and the trace vectorization signals it: PackSet::print: 3 packs Pack: 0 0: 1119 LoadL === 1105 343 1120 [[ 1117 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=997,663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 1: 1112 LoadL === 1105 343 1113 [[ 1111 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 2: 997 LoadL === 1105 343 998 [[ 996 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=663,[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 3: 663 LoadL === 1105 343 455 [[ 458 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[457] !jvms: MinMaxVector::longReductionMax @ bci:23 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) Pack: 1 0: 1117 MulL === _ 1119 162 [[ 1116 ]] !orig=996,458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 1: 1111 MulL === _ 1112 162 [[ 1110 ]] !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 2: 996 MulL === _ 997 162 [[ 995 ]] !orig=458 !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 3: 458 MulL === _ 663 162 [[ 459 ]] !jvms: MinMaxVector::longReductionMax @ bci:24 (line 255) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) Pack: 2 0: 1116 MaxL === _ 1128 1117 [[ 1110 ]] !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 1: 1110 MaxL === _ 1116 1111 [[ 995 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 2: 995 MaxL === _ 1110 996 [[ 459 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 3: 459 MaxL === _ 995 458 [[ 1128 923 570 ]] !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) WARNING: Removed pack: not implemented at any smaller size: 0: 1116 MaxL === _ 1128 1117 [[ 1110 ]] !orig=995,459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 1: 1110 MaxL === _ 1116 1111 [[ 995 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 2: 995 MaxL === _ 1110 996 [[ 459 ]] !orig=459,1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) 3: 459 MaxL === _ 995 458 [[ 1128 923 570 ]] !orig=1012 !jvms: MinMaxVector::longReductionMax @ bci:30 (line 256) MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 (line 124) After SuperWord::split_packs_only_implemented_with_smaller_size One interesting question option to explore here would be if MaxL/MinL could be implemented in terms of vectorized compare instructions, as shown above in the `longClippingRange` scenario. Thoughts @rwestrel @eme64? # `VectorReduction2.WithSuperword` on AVX-512 machine As requested by Emanuel I've also run this benchmark. Note that the results here are time per op, so the lower the number the better: Benchmark (SIZE) (seed) Mode Cnt Baseline Patch Units VectorReduction2.WithSuperword.longMaxBig 2048 0 avgt 3 3970.527 1918.821 ns/op VectorReduction2.WithSuperword.longMaxDotProduct 2048 0 avgt 3 1369.634 1055.762 ns/op VectorReduction2.WithSuperword.longMaxSimple 2048 0 avgt 3 722.314 2172.064 ns/op VectorReduction2.WithSuperword.longMinBig 2048 0 avgt 3 3996.694 1918.398 ns/op VectorReduction2.WithSuperword.longMinDotProduct 2048 0 avgt 3 1363.687 1056.375 ns/op VectorReduction2.WithSuperword.longMinSimple 2048 0 avgt 3 718.150 2179.478 ns/op `long[Min|Max]Big` and `long[Min|Max]DotProduct` benchmarks show considerable improvements, but something odd is happening in `long[Min|Max]Simple`. ### `long[Min|Max]Simple` performance drops considerably Baseline uses compare + moves instructions: 8.05% ?? ??? ? 0x00007f9d580f569b: movq 0x18(%r13, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0} ?? ??? ? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054) ?? ??? ? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190) 0.23% ?? ??? ? 0x00007f9d580f56a0: cmpq %r8, %rsi ??? ??? ? 0x00007f9d580f56a3: jl 0x7f9d580f5713 ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ??? ??? ? ; - java.lang.Math::max at 11 (line 2037) ??? ??? ? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055) ??? ??? ? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190) Patched version uses conditional moves instead of vectorized instructions: 2.76% ?? 0x00007fcd180f695c: movq 0x18(%r14, %r11, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 22 (line 1054) ?? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190) ?? 0x00007fcd180f6961: cmpq %rdi, %r13 3.11% ?? 0x00007fcd180f6964: cmovlq %rdi, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple at 28 (line 1055) ?? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub at 17 (line 190) Why are vectorized instructions not kicking in with patch? Because superword doesn't think it's profitable to vectorize this: PackSet::print: 2 packs Pack: 0 0: 733 LoadL === 721 184 734 [[ 732 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 1: 728 LoadL === 721 184 729 [[ 727 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 2: 669 LoadL === 721 184 670 [[ 668 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 3: 500 LoadL === 721 184 317 [[ 320 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) Pack: 1 0: 732 MaxL === _ 743 733 [[ 727 ]] !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 1: 727 MaxL === _ 732 728 [[ 668 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 2: 668 MaxL === _ 727 669 [[ 320 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 3: 320 MaxL === _ 668 500 [[ 743 593 456 ]] !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) WARNING: Removed pack: not profitable: 0: 732 MaxL === _ 743 733 [[ 727 ]] !orig=668,320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 1: 727 MaxL === _ 732 728 [[ 668 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 2: 668 MaxL === _ 727 669 [[ 320 ]] !orig=320,685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 3: 320 MaxL === _ 668 500 [[ 743 593 456 ]] !orig=685 !jvms: VectorReduction2::longMaxSimple @ bci:28 (line 1055) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) WARNING: Removed pack: not profitable: 0: 733 LoadL === 721 184 734 [[ 732 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=669,500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 1: 728 LoadL === 721 184 729 [[ 727 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 2: 669 LoadL === 721 184 670 [[ 668 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=500,[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) 3: 500 LoadL === 721 184 317 [[ 320 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not depend only on test, unknown control) !orig=[319] !jvms: VectorReduction2::longMaxSimple @ bci:22 (line 1054) VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub @ bci:17 (line 190) After Superword::filter_packs_for_profitable PackSet::print: 0 packs SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize How can you make it vectorize? By doing something with the value in the array before passing it to min/max. That is what `MinMaxVector.longReduction[Min|Max]` and `VectorReduction2.long[Min|Max]DotProduct` methods do. # `VectorReduction2.NoSuperword` on AVX-512 machine Benchmark (SIZE) (seed) Mode Cnt Baseline Patch Units VectorReduction2.NoSuperword.longMaxBig 2048 0 avgt 3 3964.403 2966.258 ns/op VectorReduction2.NoSuperword.longMaxDotProduct 2048 0 avgt 3 1686.373 2462.876 ns/op VectorReduction2.NoSuperword.longMaxSimple 2048 0 avgt 3 722.219 2171.859 ns/op VectorReduction2.NoSuperword.longMinBig 2048 0 avgt 3 3994.685 2971.143 ns/op VectorReduction2.NoSuperword.longMinDotProduct 2048 0 avgt 3 1366.291 2428.173 ns/op VectorReduction2.NoSuperword.longMinSimple 2048 0 avgt 3 719.218 2179.546 ns/op Performance improves or `long[Min|Max]Big`. `long[Min|Max]Simple` suffers similar issues as shown in previous section because when not vectorized, these benchmarks fallback on conditional moves. The drop in performance in `long[Min|Max]DotProduct` needs some explanation. ### `long[Min|Max]DotProduct` performance drops considerably Baseline uses compare + move instructions here: 5.67% ??? ???? ? 0x00007f3fcc0fa71d: movq 0x20(%r14, %r8, 8), %r9 5.19% ??? ???? ? 0x00007f3fcc0fa722: imulq 0x20(%rax, %r8, 8), %r9;*lmul {reexecute=0 rethrow=0 return_oop=0} ??? ???? ? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125) ??? ???? ? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190) 8.46% ??? ???? ? 0x00007f3fcc0fa728: cmpq %r9, %rsi ???????? ? 0x00007f3fcc0fa72b: jl 0x7f3fcc0fa751 ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ???????? ? ; - java.lang.Math::max at 11 (line 2037) ???????? ? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126) ???????? ? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190) Patch transforms this into conditional moves: 11.00% ? 0x00007f66f40f70b2: movq 0x18(%r13, %rcx, 8), %rax ? 0x00007f66f40f70b7: imulq 0x18(%r9, %rcx, 8), %rax;*lmul {reexecute=0 rethrow=0 return_oop=0} ? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 30 (line 1125) ? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190) ? 0x00007f66f40f70bd: cmpq %rdx, %rax 13.07% ? 0x00007f66f40f70c0: cmovlq %rdx, %rax ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ? ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct at 36 (line 1126) ? ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub at 17 (line 190) This is similar to what we have seen above. Lacking superword functionality, the fallback for MaxL/MinL implies using conditional moves. Although branch probabilities are not controlled here, we can observe that one of the branches is likely being taken ~100% of the time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2642788364 From coleenp at openjdk.org Fri Feb 7 12:34:40 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 7 Feb 2025 12:34:40 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Fix jvmci test. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22652/files - new: https://git.openjdk.org/jdk/pull/22652/files/304a17ee..37a8cf81 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22652&range=05-06 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/22652.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22652/head:pull/22652 PR: https://git.openjdk.org/jdk/pull/22652 From galder at openjdk.org Fri Feb 7 12:39:24 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 7 Feb 2025 12:39:24 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: > This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. > > Currently vectorization does not kick in for loops containing either of these calls because of the following error: > > > VLoop::check_preconditions: failed: control flow in loop not allowed > > > The control flow is due to the java implementation for these methods, e.g. > > > public static long max(long a, long b) { > return (a >= b) ? a : b; > } > > > This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. > By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. > E.g. > > > SuperWord::transform_loop: > Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined > 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) > > > Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1155 > long max 1173 > > > After the patch, on darwin/aarch64 (M1): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java > 1 1 0 0 > ============================== > TEST SUCCESS > > long min 1042 > long max 1042 > > > This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. > Therefore, it still relies on the macro expansion to transform those into CMoveL. > > I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:tier1 2500 2500 0 0 >>> jtreg:test/jdk:tier1 ... Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: - Merge branch 'master' into topic.intrinsify-max-min-long - Fix typo - Renaming methods and variables and add docu on algorithms - Fix copyright years - Make sure it runs with cpus with either avx512 or asimd - Test can only run with 256 bit registers or bigger * Remove platform dependant check and use platform independent configuration instead. - Fix license header - Tests should also run on aarch64 asimd=true envs - Added comment around the assertions - Adjust min/max identity IR test expectations after changes - ... and 34 more: https://git.openjdk.org/jdk/compare/f56622ff...a190ae68 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20098/files - new: https://git.openjdk.org/jdk/pull/20098/files/724a346a..a190ae68 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20098&range=10-11 Stats: 206462 lines in 5108 files changed: 101636 ins; 84099 del; 20727 mod Patch: https://git.openjdk.org/jdk/pull/20098.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20098/head:pull/20098 PR: https://git.openjdk.org/jdk/pull/20098 From epeter at openjdk.org Fri Feb 7 16:40:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 7 Feb 2025 16:40:15 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> References: <9IQfA04XvFFNu5R3LMh6dYWPAbf7D-VoAb2O0Gt0b8Q=.22b568c9-2a45-41b7-91d7-11c22bb6abe0@github.com> <38AZvEN6jWtzUKAm6eRqJwarn31L2bZYw4-MTClOMaQ=.828db7e7-115f-4856-afbf-6c00bbc34224@github.com> Message-ID: On Thu, 6 Feb 2025 19:12:26 GMT, Quan Anh Mai wrote: >> Looks good, thanks for the explanations! >> >> I see we did not yet run internal tests for the last commit, though it is only formatting, so most most likely ok. >> >> But the state of the code is also 2 weeks old, so it would be good if you merged and launched testing again before integration, just in case ;) > > @eme64 I have merged the change with master, could you help me initiate the testing process, please? Thanks very much. @merykitty Testing is all passing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2643431552 From coleenp at openjdk.org Fri Feb 7 19:16:13 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 7 Feb 2025 19:16:13 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix jvmci test. I added some code to hide the Class.modifiers field and fixed the JVMCI test. Please re-review. Also @iwanowww I think the intrinsic for isInterface can be removed and just be Java code like: public boolean isInterface() { return getModifiers().isInterface(); } ------------- PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2643799984 From vlivanov at openjdk.org Fri Feb 7 19:47:12 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 7 Feb 2025 19:47:12 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix jvmci test. Marked as reviewed by vlivanov (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2602686659 From never at openjdk.org Fri Feb 7 20:01:24 2025 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 7 Feb 2025 20:01:24 GMT Subject: RFR: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash [v3] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 20:56:53 GMT, Tom Rodriguez wrote: >> This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. > > Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: > > improve comments Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23444#issuecomment-2643985603 From vlivanov at openjdk.org Fri Feb 7 20:01:27 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 7 Feb 2025 20:01:27 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 19:13:07 GMT, Coleen Phillimore wrote: > I think the intrinsic for isInterface can be removed Good point. Moreover, it seems most of intrinsics on Class queries can be replaced with a flag bit check on the mirror. (Do we have 16 unused bits in Class::modifiers after this change?) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2643997479 From never at openjdk.org Fri Feb 7 20:01:25 2025 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 7 Feb 2025 20:01:25 GMT Subject: Integrated: 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 16:31:50 GMT, Tom Rodriguez wrote: > This ensures that collectFailedSpeculations sees the initialization of the recently allocated failedSpeculationsAddress memory. This pull request has now been integrated. Changeset: 7f6c6878 Author: Tom Rodriguez URL: https://git.openjdk.org/jdk/commit/7f6c687815031d99931265007ff8867bf964cb25 Stats: 14 lines in 1 file changed: 9 ins; 0 del; 5 mod 8349374: [JVMCI] concurrent use of HotSpotSpeculationLog can crash Reviewed-by: kvn, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/23444 From coleenp at openjdk.org Fri Feb 7 21:14:13 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 7 Feb 2025 21:14:13 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 19:58:12 GMT, Vladimir Ivanov wrote: > Good point. Moreover, it seems most of intrinsics on Class queries can be replaced with a flag bit check on the mirror. (Do we have 16 unused bits in Class::modifiers after this change?) Yes, I think so. isArray and isPrimitive definitely. We could first change the modifiers field to "char" because that's its size and then have two booleans for each of these. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2644136904 From liach at openjdk.org Fri Feb 7 21:37:12 2025 From: liach at openjdk.org (Chen Liang) Date: Fri, 7 Feb 2025 21:37:12 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix jvmci test. Making `isArray` and `isPrimitive` Java-based is going to be helpful for the interpreter performance of these methods in early bootstrap. ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2644171713 From qamai at openjdk.org Sat Feb 8 04:23:18 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 8 Feb 2025 04:23:18 GMT Subject: Integrated: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode In-Reply-To: References: Message-ID: On Thu, 23 Jan 2025 17:22:02 GMT, Quan Anh Mai wrote: > Hi, > > This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: > > // We are allowed to use the constant type only if cast succeeded > > But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. > > Please take a look and leave your reviews, thanks a lot. This pull request has now been integrated. Changeset: e9278de3 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/e9278de3f8676c288bfdce96f8348470e7c42900 Stats: 60 lines in 10 files changed: 5 ins; 18 del; 37 mod 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode Reviewed-by: vlivanov, epeter ------------- PR: https://git.openjdk.org/jdk/pull/23274 From qamai at openjdk.org Sat Feb 8 04:23:17 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 8 Feb 2025 04:23:17 GMT Subject: RFR: 8348411: C2: Remove the control input of LoadKlassNode and LoadNKlassNode [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 19:11:58 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch removes the control input of `LoadKlassNode` and `LoadNKlassNode`. They can only have a control input if created inside `Parse::array_store_check()`, the reason given is: >> >> // We are allowed to use the constant type only if cast succeeded >> >> But this seems incorrect, the load from the constant type can be done regardless, and it will be constant-folded. This patch only makes that more formal and cleanup `LoadKlassNode::can_remove_control`. >> >> Please take a look and leave your reviews, thanks a lot. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into loadklassctrl > - format > - clearer intention, revert formatting, add assert > - remove always_see_exact_class > - remove control input of LoadKlassNode Thanks a lot for your reviews and testing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23274#issuecomment-2644491331 From alanb at openjdk.org Sat Feb 8 19:44:12 2025 From: alanb at openjdk.org (Alan Bateman) Date: Sat, 8 Feb 2025 19:44:12 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix jvmci test. No more comments from me. ------------- Marked as reviewed by alanb (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2604014387 From kvn at openjdk.org Sun Feb 9 19:43:29 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 9 Feb 2025 19:43:29 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Zero and Minimal VM builds ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/11abd5e7..dda20f0b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=00-01 Stats: 6 lines in 1 file changed: 4 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Sun Feb 9 19:43:29 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 9 Feb 2025 19:43:29 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 17:45:30 GMT, Vladimir Kozlov wrote: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp @dougxc and @tkrodriguez, please look if it affects Graal. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2646553512 From cjplummer at openjdk.org Mon Feb 10 03:14:22 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Mon, 10 Feb 2025 03:14:22 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds I almost wished I hadn't looked because there is a lot of SA CodeBlob support that could use some cleanup. Most notably I think most of the wrapper subclasses are not needed by SA, and could be served by one common class. See what I'm doing in #23456 for JavaThread subclasses. Wrapper classes don't need to be 1-to-1 with the class type they are wrapping. A single wrapper class type can handle any number of hotspot types. It usually just make more sense for them to be 1-to-1, but when they are trivial and the implementation is replicated across subtypes, just having one wrapper class implement them all can simplify things. The other thing I noticed is a lot of the subtypes seem to be doing some unnecessary things like registering Observers, and they all do something like the following: 44 Type type = db.lookupType("BufferBlob"); Even when it never references "type". I'm not suggesting you clean up any of this now, but just pointed it out. I might file an issue and try to clean it up myself at some point. I still need to take a closer look at the SA changes. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38: > 36: public class CodeCache { > 37: private static GrowableArray heapArray; > 38: private static VirtualConstructor virtualConstructor; What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2604594200 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948335278 From cjplummer at openjdk.org Mon Feb 10 03:29:13 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Mon, 10 Feb 2025 03:29:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 02:47:58 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38: > >> 36: public class CodeCache { >> 37: private static GrowableArray heapArray; >> 38: private static VirtualConstructor virtualConstructor; > > What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. I think I found the answer. Since there is no longer a vtable, TypeDataBase.addressTypeIsEqualToType() will no longer work for Codeblobs. I was wondering if the lack of a vtable might have some negative impact. Glad to see you found a solution. I hope the lack of a vtable does not creep in elsewhere. I know it will have some negative impact on things like the "findpc" functionality, which will no longer be able to tell the user that an address points to a Codeblob instance. There's no test for this, but users might run across it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948352958 From jbhateja at openjdk.org Mon Feb 10 05:33:25 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Feb 2025 05:33:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos Hi @PaulSandoz , Kindly let us know if this is good for integration. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2646957788 From galder at openjdk.org Mon Feb 10 09:29:20 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 10 Feb 2025 09:29:20 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Fri, 7 Feb 2025 12:27:42 GMT, Galder Zamarre?o wrote: > At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken. I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations. At branch 80% you see: 1115 CountedLoop === 1115 598 463 [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 692 LoadL === 1083 1101 393 [[ 747 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 651 LoadL === 1095 1101 355 [[ 747 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 747 MaxL === _ 651 692 [[ 451 ]] !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 451 StoreL === 1115 1101 449 747 [[ 1116 454 911 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 594 CountedLoopEnd === 1115 593 [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think). At branch 100% you see: 650 LoadL === 1105 1119 355 [[ 416 408 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 691 LoadL === 1093 1119 393 [[ 416 408 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 408 CmpL === _ 650 691 [[ 409 ]] !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 409 Bool === _ 408 [[ 410 ]] [lt] !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 410 If === 1132 409 [[ 411 412 ]] P=0.019892, C=79127.000000 !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 411 IfTrue === 410 [[ 415 ]] #1 !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 412 IfFalse === 410 [[ 415 ]] #0 !jvms: Math::max @ bci:3 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 415 Region === 415 411 412 [[ 415 594 416 451 ]] !orig=[423] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) 594 CountedLoopEnd === 415 593 [[ 1139 463 ]] [lt] P=0.999683, C=706030.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) You see a region within the counted loop with the if/else which belongs to the actual `Math.max` implementation, with the corresponding CmpL and the LoadL nodes for retrieving the longs from the arrays. What causes the difference? It's this section in `PhaseIdealLoop::conditional_move`: ```c++ // Check for highly predictable branch. No point in CMOV'ing if // we are going to predict accurately all the time. if (C->use_cmove() && (cmp_op == Op_CmpF || cmp_op == Op_CmpD)) { //keep going } else if (iff->_prob < infrequent_prob || iff->_prob > (1.0f - infrequent_prob)) return nullptr; At branch 100 `iff->_prob > (1.0f - infrequent_prob)` becomes true and no CMoveL is created so hotspot seems to stick to the original bytecode implementation of `Math.max`. At branch 80 that comparison is below and CMoveL is created, which eventually gets converted into a MaxL node and vectorization kicks in. The numbers are interesting. `infrequent_prob` appears to be a fixed number `0.181818187` and `1.0f` minus that is `0.818181812`. So, at branch 100 `iff->_prob` is `0.906792104` therefore higher than `0.818181812`, and at branch 80 `0.718619287`. I would have expected those `iff->_prob` to be closer to the branch % targets I set, but ignoring that, seems like ~90% would be the cut off. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2647410266 From yzheng at openjdk.org Mon Feb 10 10:17:15 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Mon, 10 Feb 2025 10:17:15 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix jvmci test. JVMCI change looks good to me ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/22652#pullrequestreview-2605295926 From dnsimon at openjdk.org Mon Feb 10 11:03:13 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 10 Feb 2025 11:03:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:36:28 GMT, Vladimir Kozlov wrote: > @dougxc and @tkrodriguez, please look if it affects Graal. I'm pretty sure JVMCI does not care about the virtual-ness of these C++ classes. Running tier9 in mach5 is a good way to be sure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2647642674 From adinn at openjdk.org Mon Feb 10 11:07:14 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 10 Feb 2025 11:07:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com> On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds src/hotspot/share/code/codeBlob.cpp line 58: > 56: #include > 57: > 58: // Virtual methods are not allowed in code blobs to simplify caching compiled code. Is it worth considering generating this code plus also some of the existing code in the header using an iterator template macro? e.g. #define CODEBLOBS_DO(do_codeblob_abstract, do_codeblob_nonleaf, \ do_codeblob_leaf) \ do_codeblob_abstract(CodeBlob) \ do_codeblob_leaf(nmethod, Nmethod, nmethod) \ do_codeblob_abstract(RuntimeBlob) \ do_codeblob_nonleaf(BufferBlob, Buffer, buffer) \ do_codeblob_leaf(AdapterBlob, Adapter, adapter) \ . . . \ do_codeblob_leaf(RuntimeStub, Runtime_Stub, runtime_stub) \ . . . The macro arguments to the templates would themselves be macros: do_codeblob_abstract(classname) // abstract, non-instantiable class do_codeblob_nonleaf(classname, kindname, accessorname) // instantiable, subclassable do_codeblob_leaf(classname, kindname, accessorname) // instantiable, non-subclassable Using a template macro like this to generate the code below -- *plus also* some of the code currently declared piecemeal in the header -- would guarantee all cases are covered now and will remain so later so when the macro is updated. I think it would probably also allow case handling code in AOT cache code to be generated. So, we would generate the code here as follows #define EMPTY1(classname) #define EMPTY3(classname, kindname, accessorname) #define assert_nonvirtual_leaf(classname, kindname, accessorname) \ static_assert(!std::is_polymorphic::value, \ "no virtual methods are allowed in " # classname ); CODEBLOBS_DO(empty1, empty3, assert_nonvirtual_leaf) #undef assert_nonvirtual_leaf Likewise in codeBlob.hpp we could generate `enum CodeBlobKind` to cover all the non-abstract classes and likewise generate the accessor methods `is_nmethod()`, `is_buffer_blob()` in class `CodeBlob` which allow the kind to be tested. #define codekind_enum_tag(classname, kindname, accessorname) \ kindname, enum CodeBlobKind : u1 { None, CODEBLOBS_DO(empty1, codekind_enum_tag, codekind_enum_tag) Number_Of_Kinds }; #define is_codeblob_define(classname, kindname, accessorname) \ void is_ # accessor_name () { return _kind == kindname; } class CodeBlob { . . . CODEBLOBS_DO(empty1, is_codeblob_define, is_codeblob_define); . . . There may be other opportunities to use the iterator (e.g. in vmStructs.cpp?) but this looks like a good start. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1948849392 From coleenp at openjdk.org Mon Feb 10 12:47:31 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 10 Feb 2025 12:47:31 GMT Subject: RFR: 8346567: Make Class.getModifiers() non-native [v7] In-Reply-To: References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: On Fri, 7 Feb 2025 12:34:40 GMT, Coleen Phillimore wrote: >> The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. >> >> There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. >> >> Tested with tier1-8. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix jvmci test. Thank you for the reviews Yudi, Alan, Chen, Vladimir and Dean, and the help and comments with the various pieces of this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22652#issuecomment-2647880184 From coleenp at openjdk.org Mon Feb 10 12:47:32 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 10 Feb 2025 12:47:32 GMT Subject: Integrated: 8346567: Make Class.getModifiers() non-native In-Reply-To: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> References: <7X3DYiPMRGAIWCyCP64kbZvHuxjmmszGxfH1dfSu38k=.7fdb2512-1999-4c7e-835c-da96d57ca1be@github.com> Message-ID: <-VYQTxGucpCCQZccdw6wMnDavFDAt75MDHY8mGxEMiw=.042099b8-41dc-4b0d-8bdd-a874f004a0f6@github.com> On Mon, 9 Dec 2024 19:26:53 GMT, Coleen Phillimore wrote: > The Class.getModifiers() method is implemented as a native method in java.lang.Class to access a field that we've calculated when creating the mirror. The field is final after that point. The VM doesn't need it anymore, so there's no real need for the jdk code to call into the VM to get it. This moves the field to Java and removes the intrinsic code. I promoted the compute_modifiers() functions to return int since that's how java.lang.Class uses the value. It should really be an unsigned short though. > > There's a couple of JMH benchmarks added with this change. One does show that for array classes for non-bootstrap class loader, this results in one extra load which in a long loop of just that, is observable. I don't think this is real life code. The other benchmarks added show no regression. > > Tested with tier1-8. This pull request has now been integrated. Changeset: c9cadbd2 Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/c9cadbd23fb13933b8968f283d27842cd35f8d6f Stats: 217 lines in 31 files changed: 71 ins; 127 del; 19 mod 8346567: Make Class.getModifiers() non-native Reviewed-by: alanb, vlivanov, yzheng, dlong ------------- PR: https://git.openjdk.org/jdk/pull/22652 From stefank at openjdk.org Mon Feb 10 16:26:12 2025 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 10 Feb 2025 16:26:12 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds We have a similar situation with oopDesc that are not allowed to have a vtable. The solution there is to use the Klass as the proxy vtable and then have a bunch of Klass::oop_ functions that act like virtual dispatch functions for associated oopDesc functions. I wonder if a similar approach can be use here? Such an approach would (to me at lest) have the benefit that we don't have to spread switch statements in various functions in the top-most class. If you are interested in seeing a prototype of this, take a look at this branch: https://github.com/openjdk/jdk/compare/master...stefank:jdk:code_blob_vptr Just a suggestion if you want to consider alternatives to these switch statements. ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2606457754 From kvn at openjdk.org Mon Feb 10 16:39:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Feb 2025 16:39:19 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <0LQ3b0zaCg8HEDx4C5xM8W4-qmQ9PkoAClhyVxKxxtE=.8cd94c7a-8496-436c-8387-6aa443942bb6@github.com> Message-ID: <1P7Q-yHC0Ho8DPfgzZfxR27NmNQPJ4LcgEbilqdaVNw=.0c023c74-b3d9-4139-8363-5ebdf1a1805d@github.com> On Mon, 10 Feb 2025 11:04:38 GMT, Andrew Dinn wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > src/hotspot/share/code/codeBlob.cpp line 58: > >> 56: #include >> 57: >> 58: // Virtual methods are not allowed in code blobs to simplify caching compiled code. > > Is it worth considering generating this code plus also some of the existing code in the header using an iterator template macro? e.g. > > #define CODEBLOBS_DO(do_codeblob_abstract, do_codeblob_nonleaf, \ > do_codeblob_leaf) \ > do_codeblob_abstract(CodeBlob) \ > do_codeblob_leaf(nmethod, Nmethod, nmethod) \ > do_codeblob_abstract(RuntimeBlob) \ > do_codeblob_nonleaf(BufferBlob, Buffer, buffer) \ > do_codeblob_leaf(AdapterBlob, Adapter, adapter) \ > . . . \ > do_codeblob_leaf(RuntimeStub, Runtime_Stub, runtime_stub) \ > . . . > > The macro arguments to the templates would themselves be macros: > > do_codeblob_abstract(classname) // abstract, non-instantiable class > do_codeblob_nonleaf(classname, kindname, accessorname) // instantiable, subclassable > do_codeblob_leaf(classname, kindname, accessorname) // instantiable, non-subclassable > > Using a template macro like this to generate the code below -- *plus also* some of the code currently declared piecemeal in the header -- would guarantee all cases are covered now and will remain so later so when the macro is updated. I think it would probably also allow case handling code in AOT cache code to be generated. > > So, we would generate the code here as follows > > #define EMPTY1(classname) > #define EMPTY3(classname, kindname, accessorname) > > #define assert_nonvirtual_leaf(classname, kindname, accessorname) \ > static_assert(!std::is_polymorphic::value, \ > "no virtual methods are allowed in " # classname ); > > CODEBLOBS_DO(empty1, empty3, assert_nonvirtual_leaf) > > #undef assert_nonvirtual_leaf > > Likewise in codeBlob.hpp we could generate `enum CodeBlobKind` to cover all the non-abstract classes and likewise generate the accessor methods `is_nmethod()`, `is_buffer_blob()` in class `CodeBlob` which allow the kind to be tested. > > #define codekind_enum_tag(classname, kindname, accessorname) \ > kindname, > > enum CodeBlobKind : u1 { > None, > CODEBLOBS_DO(empty1, codekind_enum_tag, codekind_enum_tag) > Number_Of_Kinds > }; > > ... Thank you @adinn for suggestion but no, I don't like macros - hard to debug and they add more complexity in this case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1949483501 From kvn at openjdk.org Mon Feb 10 16:50:12 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Feb 2025 16:50:12 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 03:25:30 GMT, Chris Plummer wrote: >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 38: >> >>> 36: public class CodeCache { >>> 37: private static GrowableArray heapArray; >>> 38: private static VirtualConstructor virtualConstructor; >> >> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. > > I think I found the answer. Since there is no longer a vtable, TypeDataBase.addressTypeIsEqualToType() will no longer work for Codeblobs. I was wondering if the lack of a vtable might have some negative impact. Glad to see you found a solution. I hope the lack of a vtable does not creep in elsewhere. I know it will have some negative impact on things like the "findpc" functionality, which will no longer be able to tell the user that an address points to a Codeblob instance. There's no test for this, but users might run across it. > What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. I don't need any more mapping from CodeBlob class to corresponding virtual table name which does not exist anymore. `CodeBlob::_kind` field's value is used to determine which class should be used. I think `hashMap` is overkill here. I can construct array `Class cbClasses[]` and use `cbClasses[CodeBlob::_kind]` instead of `if/else` in `getClassFor`. But I would still need to check for unknown value of `CodeBlob::_kind` somehow. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1949505126 From kvn at openjdk.org Mon Feb 10 17:06:13 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Feb 2025 17:06:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 16:23:53 GMT, Stefan Karlsson wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > We have a similar situation with oopDesc that are not allowed to have a vtable. The solution there is to use the Klass as the proxy vtable and then have a bunch of Klass::oop_ functions that act like virtual dispatch functions for associated oopDesc functions. > > I wonder if a similar approach can be use here? Such an approach would (to me at lest) have the benefit that we don't have to spread switch statements in various functions in the top-most class. > > If you are interested in seeing a prototype of this, take a look at this branch: > https://github.com/openjdk/jdk/compare/master...stefank:jdk:code_blob_vptr > > Just a suggestion if you want to consider alternatives to these switch statements. Thank you, @stefank. This is very interesting suggestion which I may take. I will check it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2648688942 From mpowers at openjdk.org Mon Feb 10 21:01:18 2025 From: mpowers at openjdk.org (Mark Powers) Date: Mon, 10 Feb 2025 21:01:18 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Adding comments + some code reorganization Some measurements: With Intrinsics --------------- keygen ML-DSA-44 38.8 us/op keygen ML-DSA-65 82.5 us/op keygen ML-DSA-87 112.6 us/op siggen ML-DSA-44 119.1 us/op siggen ML-DSA-65 186.5 us/op siggen ML-DSA-87 306.1 us/op sigver ML-DSA-44 46.4 us/op sigver ML-DSA-65 72.8 us/op sigver ML-DSA-87 123.4 us/op No Intrinsics ------------- keygen ML-DSA-44 63.1 us/op keygen ML-DSA-65 118.7 us/op keygen ML-DSA-87 167.2 us/op siggen ML-DSA-44 466.8 us/op siggen ML-DSA-65 546.3 us/op siggen ML-DSA-87 560.3 us/op sigver ML-DSA-44 71.6 us/op sigver ML-DSA-65 117.9 us/op sigver ML-DSA-87 180.4 us/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2649220775 From psandoz at openjdk.org Mon Feb 10 21:26:25 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 10 Feb 2025 21:26:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Tue, 4 Feb 2025 10:05:09 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fixing typos An impressive and substantial change. I focused on the Java code, there are some small tweaks, presented in comments, we can make to the intrinsics to improve the expression of code, and it has no impact on the intrinsic implementation. src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 32: > 30: * The class {@code Float16Math} constains intrinsic entry points corresponding > 31: * to scalar numeric operations defined in Float16 class. > 32: * @since 25 You can remove this line, since this is an internal class. src/java.base/share/classes/jdk/internal/vm/vector/Float16Math.java line 38: > 36: } > 37: > 38: public interface Float16UnaryMathOp { You can just use `UnaryOperator`, no need for a new type, here are the updated methods you can apply to this class. @FunctionalInterface public interface TernaryOperator { T apply(T a, T b, T c); } @IntrinsicCandidate public static T sqrt(Class box_class, T oa, UnaryOperator defaultImpl) { assert isNonCapturingLambda(defaultImpl) : defaultImpl; return defaultImpl.apply(oa); } @IntrinsicCandidate public static T fma(Class box_class, T oa, T ob, T oc, TernaryOperator defaultImpl) { assert isNonCapturingLambda(defaultImpl) : defaultImpl; return defaultImpl.apply(oa, ob, oc); } static boolean isNonCapturingLambda(Object o) { return o.getClass().getDeclaredFields().length == 0; } And in `src/hotspot/share/classfile/vmIntrinsics.hpp`: /* Float16Math API intrinsification support */ \ /* Float16 signatures */ \ do_signature(float16_unary_math_op_sig, "(Ljava/lang/Class;" \ "Ljava/lang/Object;" \ "Ljava/util/function/UnaryOperator;)" \ "Ljava/lang/Object;") \ do_signature(float16_ternary_math_op_sig, "(Ljava/lang/Class;" \ "Ljava/lang/Object;" \ "Ljava/lang/Object;" \ "Ljava/lang/Object;" \ "Ljdk/internal/vm/vector/Float16Math$TernaryOperator;)" \ "Ljava/lang/Object;") \ do_intrinsic(_sqrt_float16, jdk_internal_vm_vector_Float16Math, sqrt_name, float16_unary_math_op_sig, F_S) \ do_intrinsic(_fma_float16, jdk_internal_vm_vector_Float16Math, fma_name, float16_ternary_math_op_sig, F_S) \ src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java line 1202: > 1200: */ > 1201: public static Float16 sqrt(Float16 radicand) { > 1202: return (Float16) Float16Math.sqrt(Float16.class, radicand, With changes to the intrinsics (as presented in another comment) you no longer need explicit casts and the code is precisely the same as before except embedded in a lambda body: public static Float16 sqrt(Float16 radicand) { return Float16Math.sqrt(Float16.class, radicand, (_radicand) -> { // Rounding path of sqrt(Float16 -> double) -> Float16 is fine // for preserving the correct final value. The conversion // Float16 -> double preserves the exact numerical value. The // conversion of double -> Float16 also benefits from the // 2p+2 property of IEEE 754 arithmetic. return valueOf(Math.sqrt(_radicand.doubleValue())); } ); } Similarly for `fma`: return Float16Math.fma(Float16.class, a, b, c, (_a, _b, _c) -> { // product is numerically exact in float before the cast to // double; not necessary to widen to double before the // multiply. double product = (double)(_a.floatValue() * _b.floatValue()); return valueOf(product + _c.doubleValue()); }); test/jdk/jdk/incubator/vector/ScalarFloat16OperationsTest.java line 44: > 42: import static jdk.incubator.vector.Float16.*; > 43: > 44: public class ScalarFloat16OperationsTest { Now that we have IR tests do you still think this test is necessary or should we have more IR test instead? @eme64 thoughts? We could follow up in another PR if need be. ------------- PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2607094727 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949842011 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949871647 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949847574 PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1949858554 From jbhateja at openjdk.org Tue Feb 11 06:32:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Feb 2025 06:32:56 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: > Hi All, > > This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) > > Following is the summary of changes included with this patch:- > > 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. > 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. > 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. > - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. > 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. > 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. > 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. > 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF > 9. X86 backend implementation for all supported intrinsics. > 10. Functional and Performance validation tests. > > Kindly review the patch and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22754/files - new: https://git.openjdk.org/jdk/pull/22754/files/82a42213..111c8084 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22754&range=16-17 Stats: 38 lines in 3 files changed: 2 ins; 11 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/22754.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22754/head:pull/22754 PR: https://git.openjdk.org/jdk/pull/22754 From jbhateja at openjdk.org Tue Feb 11 06:32:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Feb 2025 06:32:56 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: On Mon, 10 Feb 2025 20:43:19 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing typos > > test/jdk/jdk/incubator/vector/ScalarFloat16OperationsTest.java line 44: > >> 42: import static jdk.incubator.vector.Float16.*; >> 43: >> 44: public class ScalarFloat16OperationsTest { > > Now that we have IR tests do you still think this test is necessary or should we have more IR test instead? @eme64 thoughts? We could follow up in another PR if need be. Hi Paul, DataProviders used in this Functional validation test exercises each newly added Float16 operation over entire value range, while our IR tests are more directed towards valdating the newly added IR transforms and constant folding scenarios. We have a follow-up PR for auto-vectorizing Float16 operation which can be used to beefup any validation gap. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22754#discussion_r1950290083 From bkilambi at openjdk.org Tue Feb 11 10:43:22 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 11 Feb 2025 10:43:22 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Adding comments + some code reorganization src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618: > 2616: INSN(smaxp, 0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S > 2617: INSN(sminp, 0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S > 2618: INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1950610623 From kvn at openjdk.org Tue Feb 11 23:58:14 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Feb 2025 23:58:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 03:11:22 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero and Minimal VM builds > > I almost wished I hadn't looked because there is a lot of SA CodeBlob support that could use some cleanup. Most notably I think most of the wrapper subclasses are not needed by SA, and could be served by one common class. See what I'm doing in #23456 for JavaThread subclasses. Wrapper classes don't need to be 1-to-1 with the class type they are wrapping. A single wrapper class type can handle any number of hotspot types. It usually just make more sense for them to be 1-to-1, but when they are trivial and the implementation is replicated across subtypes, just having one wrapper class implement them all can simplify things. > > The other thing I noticed is a lot of the subtypes seem to be doing some unnecessary things like registering Observers, and they all do something like the following: > > 44 Type type = db.lookupType("BufferBlob"); > > Even when it never references "type". > > I'm not suggesting you clean up any of this now, but just pointed it out. I might file an issue and try to clean it up myself at some point. > > I still need to take a closer look at the SA changes. Before I forgot to answer you, @plummercj I completely agree with your comment about cleaning up wrapper subclasses which do nothing. I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? An other purpose could be a place holder for additional information in a future which never come. Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. So yes, feel free to clean this up. I will help with review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652321179 From kvn at openjdk.org Wed Feb 12 00:11:28 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:11:28 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v3] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Add CodeBlob proxy vtable ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/dda20f0b..43ae0ed2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=01-02 Stats: 322 lines in 13 files changed: 175 ins; 90 del; 57 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Wed Feb 12 00:11:28 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:11:28 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 19:43:29 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero and Minimal VM builds I adopted Stefan's suggestion. I agree that it is more "future-proof". I also remove underscore `_` from `CodeBlobKind` names. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652333587 PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652335723 From kvn at openjdk.org Wed Feb 12 00:14:31 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:14:31 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v4] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Minimal and Zero VM builds again ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/43ae0ed2..7d3dce0e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=02-03 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Wed Feb 12 00:22:31 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 00:22:31 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v5] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Minimal and Zero VM builds once more ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/7d3dce0e..1d108349 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=03-04 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From cjplummer at openjdk.org Wed Feb 12 03:06:15 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Wed, 12 Feb 2025 03:06:15 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Tue, 11 Feb 2025 23:55:46 GMT, Vladimir Kozlov wrote: > I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? Possibly getName() didn't exist when PStack was first written. It would be good if PStack not only included the type name as it does now, but also the actual name of the blob, which getName() would return. > An other purpose could be a place holder for additional information in a future which never come. Yes, and you also see that with the Observer registration and the `Type type = db.lookupType()` code, which are only needed if you are going to lookup fields of the subtypes, which most don't ever do, yet they all have this code. > Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. Yeah, that's not working right for CodeBlob subtypes that are not RuntimeStubs. Easy to fix though. > So yes, feel free to clean this up. I will help with review. Ok. Let me see where things are at after you are done with the PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2652549878 From jbhateja at openjdk.org Wed Feb 12 09:13:17 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 09:13:17 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v17] In-Reply-To: References: Message-ID: <1xQeG8IO8aJNUluyWTaz9cm2xmTKSNsZJMNhnicnm5s=.304de8b6-9bba-44db-9982-eddaf950a415@github.com> On Mon, 10 Feb 2025 21:23:28 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing typos > > An impressive and substantial change. I focused on the Java code, there are some small tweaks, presented in comments, we can make to the intrinsics to improve the expression of code, and it has no impact on the intrinsic implementation. Hi @PaulSandoz , Your comments have been addressed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2653071755 From psandoz at openjdk.org Wed Feb 12 14:49:27 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 12 Feb 2025 14:49:27 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 06:32:56 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions Looks good. I merged this PR with master, successfully (at the time) with no conflicts, and ran it through tier 1 to 3 testing and there were no failures. ------------- Marked as reviewed by psandoz (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22754#pullrequestreview-2612181239 From kvn at openjdk.org Wed Feb 12 16:28:32 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 16:28:32 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix Zero VM build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/1d108349..b09ddce6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=04-05 Stats: 11 lines in 2 files changed: 7 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From jbhateja at openjdk.org Wed Feb 12 17:08:25 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 17:08:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Wed, 12 Feb 2025 14:46:49 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolutions > > Looks good. I merged this PR with master, successfully (at the time) with no conflicts, and ran it through tier 1 to 3 testing and there were no failures. Thanks @PaulSandoz , @eme64 and @sviswa7 for your valuable feedback. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2654337191 From jbhateja at openjdk.org Wed Feb 12 17:08:28 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 12 Feb 2025 17:08:28 GMT Subject: Integrated: 8342103: C2 compiler support for Float16 type and associated scalar operations In-Reply-To: References: Message-ID: <0jFE4E2Aewb7aCN5nZrmV3Lz3SSsNSmhhUEiL9JQjMA=.c202afcf-340c-4fca-8a2a-778c7677fe1f@github.com> On Sun, 15 Dec 2024 18:05:02 GMT, Jatin Bhateja wrote: > Hi All, > > This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) > > Following is the summary of changes included with this patch:- > > 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. > 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. > 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. > - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. > 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. > 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. > 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. > 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF > 9. X86 backend implementation for all supported intrinsics. > 10. Functional and Performance validation tests. > > Kindly review the patch and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 4b463ee7 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/4b463ee70eceb94fdfbffa5c49dd58dcc6a6c890 Stats: 2855 lines in 56 files changed: 2788 ins; 0 del; 67 mod 8342103: C2 compiler support for Float16 type and associated scalar operations Co-authored-by: Paul Sandoz Co-authored-by: Bhavana Kilambi Co-authored-by: Joe Darcy Co-authored-by: Raffaello Giulietti Reviewed-by: psandoz, epeter, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/22754 From kvn at openjdk.org Wed Feb 12 20:21:13 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Feb 2025 20:21:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build It is ready for re-review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2654754643 From cjplummer at openjdk.org Thu Feb 13 02:36:16 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 02:36:16 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118: > 116: } > 117: > 118: public static Class getClassFor(Address addr) { Did you consider using a lookup table here that is indexed using the kind value? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146: > 144: } > 145: } > 146: return null; Should this be an assert? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213: > 211: > 212: public boolean isUncommonTrapBlob() { > 213: if (!VM.getVM().isServerCompiler()) return false; Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below? src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95: > 93: } > 94: > 95: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) { I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr". That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953665953 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953666268 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953667349 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953682557 From kvn at openjdk.org Thu Feb 13 03:43:17 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 03:43:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> Message-ID: <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> On Thu, 13 Feb 2025 02:06:57 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero VM build > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118: > >> 116: } >> 117: >> 118: public static Class getClassFor(Address addr) { > > Did you consider using a lookup table here that is indexed using the kind value? Example please. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146: > >> 144: } >> 145: } >> 146: return null; > > Should this be an assert? I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213: > >> 211: >> 212: public boolean isUncommonTrapBlob() { >> 213: if (!VM.getVM().isServerCompiler()) return false; > > Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below? `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`. Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect. > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95: > >> 93: } >> 94: >> 95: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) { > > I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr". That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);` `cbPc` with comment explaining that it could be inside code blob. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953732919 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953733212 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953738572 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953745389 From cjplummer at openjdk.org Thu Feb 13 05:22:14 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 05:22:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> Message-ID: On Thu, 13 Feb 2025 03:26:19 GMT, Vladimir Kozlov wrote: >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 118: >> >>> 116: } >>> 117: >>> 118: public static Class getClassFor(Address addr) { >> >> Did you consider using a lookup table here that is indexed using the kind value? > > Example please. static Class wrapperClasses = new Class[Number_Of_Kinds]; wrapperClasses[NMethodKind] = NMethodBlob.class; wrapperClasses[BufferKind] = BufferBopb.class; ...; wrapperClasses[SafepointKind] = SafepointBlob.class; CodeBlob cb = new CodeBlob(addr); return wrapperClasses[cb.getKind()]; >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 146: >> >>> 144: } >>> 145: } >>> 146: return null; >> >> Should this be an assert? > > I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned. I guess my real question is whether or not it can be considered normal behavior to return null. It seems it should never happen, which is why I was suggesting an assert. >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeBlob.java line 213: >> >>> 211: >>> 212: public boolean isUncommonTrapBlob() { >>> 213: if (!VM.getVM().isServerCompiler()) return false; >> >> Why is the check needed? Why not just return the value `getKind() == UncommonTrapKind` result below? > > `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`. > Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect. Ok. Leaving UncommonTrapKind and ExceptionKind uninitialized seems a bit error prone. Perhaps they can be given some sort of INVALID value. >> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 95: >> >>> 93: } >>> 94: >>> 95: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address start) { >> >> I think the use of the name "start" here is a carryover from `findBlobUnsafe(Address start)`. I find it a very misleading name. cbAddr points to the "start" of the blob. "start" points somewhere in the middle of the blob. In fact callers of this API somethimes pass in findStart(addr) for cbAddr, which just adds to the confusion. Perhaps this is a good time to rename "start" to something else, although I can't come up with a good suggestion, but I think anything other than "start" would be an improvement. Maybe "pcAddr". That aligns with the "for PC=" message below. Or maybe just "ptr" which aligns with `createCodeBlobWrapper(findStart(ptr), ptr);` > > `cbPc` with comment explaining that it could be inside code blob. That sounds fine. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953818292 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953819796 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953821968 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1953822595 From jrose at openjdk.org Thu Feb 13 07:44:19 2025 From: jrose at openjdk.org (John R Rose) Date: Thu, 13 Feb 2025 07:44:19 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build I've read the code and it looks good. I find myself wishing for a few more comments to guide me, especially in knowing which methods to pay attention to, and which to ignore as "pure plumbing". The array of vptr-ptrs is the key element. It seems to work nicely. There are lots of regularizations here, which I enjoy. But the new code has (to me) distracting irregularities. Why define one Vptr as a struct and others as classes? Did we really regularize the names of all the print functions (they were irregular before)? I was glad to see lots of magic code deleted from SA. Although, having to look at SA at all is annoying! I noticed a lot of churn in "innocent bystander" client code that looks like this: p2i(_frame.pc()), decode_offset); - nm()->print_on(&ss); + nm()->print_on_v(&ss); nm()->method()->print_codes_on(&ss); What is the client maintainer (or any casual reader) supposed to get from the "_v" suffix? I know we have made the "v/nv" distinction before, but it is rather obscure, not documeted here. Is it described elsewhere in our code base? Our use of it here should be docuemented in codeBlob.hpp. Normally, we try to keep client APIs invariant while doing refactorings like this, so as to avoid touching all the client code. In this case, we have to use a new naming convention to distinguish all versions of (say) print_on: M. The implementation in each CB class K, which can be private if K::Vptr is a friend. P. The public API point, used outside of the CB classes, as well as inside. V. The name of the virtual function defined by each K::Vptr. I would expect I to have the "nice name" like print_on, not print_on_v, while while the private method M would be print_on_impl or print_on_nv, and never called except from Vptr or other methods of the same name. But any convention will work, as long as it is documented and held to consistently. I'm sympathetic to both Andrew's call for maacro-enforced regularity, and Vladimir's objection that macros make things hard to follow. If macros won't work for us here, let's define a documented pattern and stick to it closely, documenting our decisions as we go. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2655760868 From aboldtch at openjdk.org Thu Feb 13 08:32:21 2025 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Thu, 13 Feb 2025 08:32:21 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 16:28:32 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix Zero VM build Similar to what @rose00 noted I think the `_v` and `_nv` suffixes are unfortunate in the public API. Maybe it we could add a protected `x_impl` containing the implementation, then dispatch to the correct one based on _kind, using the Vptr abstraction. And have the normal print_on method use this. We could let our leaf types to directly call the specific implementation, not that I think that our print functions require compile time devirtualisation. There are many solutions here with their pros and cons. src/hotspot/share/code/codeBlob.hpp line 140: > 138: instance->print_value_on_nv(st); > 139: } > 140: }; I wonder why the base class is not abstract. AFAICT `print_value_on` is unreachable and `print_on` is only used by `DeoptimizationBlob::Vptr` which also seems like a behavioural change, as before this patch calling `print_on` a `DeoptimizationBlob` object would dispatch to `SingletonBlob::print_on` not `CodeBlob::print_on`. Suggestion: struct Vptr { virtual void print_on(const CodeBlob* instance, outputStream* st) const = 0; virtual void print_value_on(const CodeBlob* instance, outputStream* st) const = 0; }; src/hotspot/share/code/codeBlob.hpp line 339: > 337: void print_value_on(outputStream* st) const; > 338: > 339: class Vptr : public CodeBlob::Vptr { I wonder if these should share the same type hierarchy as tier container class. This would also solve the issueI noted in my other comment about not calling the correct `print_on`. Suggestion: class Vptr : public RuntimeBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 427: > 425: void print_value_on(outputStream* st) const; > 426: > 427: class Vptr : public CodeBlob::Vptr { Suggestion: class Vptr : public RuntimeBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 467: > 465: void print_value_on(outputStream* st) const; > 466: > 467: class Vptr : public CodeBlob::Vptr { Suggestion: class Vptr : public RuntimeBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 553: > 551: void print_value_on(outputStream* st) const; > 552: > 553: class Vptr : public CodeBlob::Vptr { This one specifically Suggestion: class Vptr : public SingletonBlob::Vptr { src/hotspot/share/code/codeBlob.hpp line 679: > 677: void print_value_on(outputStream* st) const; > 678: > 679: class Vptr : public CodeBlob::Vptr { Suggestion: class Vptr : public RuntimeBlob::Vptr { ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2614177723 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954019308 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954024528 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954028620 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954028940 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954027733 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954029504 From dnsimon at openjdk.org Thu Feb 13 10:04:20 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 10:04:20 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong Message-ID: The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. ------------- Commit messages: - converted JVMCIRuntime::_shared_library_javavm_id to jlong Changes: https://git.openjdk.org/jdk/pull/23610/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23610&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349977 Stats: 7 lines in 3 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23610.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23610/head:pull/23610 PR: https://git.openjdk.org/jdk/pull/23610 From epeter at openjdk.org Thu Feb 13 11:39:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 11:39:15 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Mon, 10 Feb 2025 09:26:32 GMT, Galder Zamarre?o wrote: >> @eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results. >> >> Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly). >> >> First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better. >> >> # MinMaxVector AVX-512 >> >> Following are results with AVX-512 instructions: >> >> Benchmark (probability) (range) (seed) (size) Mode Cnt Baseline Patch Units >> MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 834.127 3688.961 ops/ms >> MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 1147.010 3687.721 ops/ms >> MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 1126.718 1072.812 ops/ms >> MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 1070.921 1070.538 ops/ms >> MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 510.483 1073.081 ops/ms >> MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 935.658 1016.910 ops/ms >> MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 1007.410 933.774 ops/ms >> MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 536.582 1017.337 ops/ms >> MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 4 967.288 966.945 ops/ms >> MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 4 967.327 967.382 ops/ms >> MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 4 849.689 967.327 ops/ms >> MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 4 966.323 967.275 ops/ms >> MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 4 967.340 967.228 ops/ms >> MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 4 880.921 967.233 ops/ms >> >> >> ### `longReduction[Min|Max]` performance improves slightly when probability is 100 >> >> Without the patch the code uses compare instructions: >> >> >> 7.83% ???? ???? ? 0x00007f4f700fb305: imulq $0xb, 0x20(%r14, %r8, 8), %rdi >> ???? ???... > >> At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken. > > I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations. > > At branch 80% you see: > > 1115 CountedLoop === 1115 598 463 [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > 692 LoadL === 1083 1101 393 [[ 747 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > 651 LoadL === 1095 1101 355 [[ 747 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > 747 MaxL === _ 651 692 [[ 451 ]] !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > 451 StoreL === 1115 1101 449 747 [[ 1116 454 911 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > 594 CountedLoopEnd === 1115 593 [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > > You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think). > > At branch 100% you see: > > > ... @galderz Thanks for all the explanations, that's really helpful ? **Discussion** - AVX512: only imprivements. - Expecially with probability 100, where before we used the bytecode, which would then create an `unstable_if` with uncommon trap. That meant we could not re-discover the CMove / Max later in the IR. Now that we never inline the bytecode, and just intrinsify directly, we can use `vpmax` and that is faster. - Ah, maybe that was all incorrect, though it sounded reasonable. You seem to suggest that we actually did use to inline both branches, but that the issue was that `PhaseIdealLoop::conditional_move` does not like extreme probabilities, and so it did not convert 100% cases to CMove, and so it did not use to vectorize. Right. Getting the probability cutoff just right it a little tricky there, and the precise number can seem strange. But that's a discussion for another day. - The reduction case is only improved slightly... at least. Maybe we can further improve the throughput with [this](https://bugs.openjdk.org/browse/JDK-8345245) later on. - AVX2: mixed results - `longReductionMax/Min`: vector max / min is not implemented. We should investigate why. - It seems like the `MaxVL` and `MinVL` (e.g. `vpmaxsq`) instructions are only implemented directly for AVX512, see [this](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4669,2611&text=max_epi64). - As you suggested @galderz we could consider implementing it via `cmove` in the backend for `AVX2` and maybe lower. Maybe we can talk with @jatin-bhateja about this. That would probably already be worth it on its own, in a separate RFE. Because I would suspect it could give speedup in the non 100% cases as well. Maybe this would even have to be an RFE that makes it in first, so we don't have regressions here? - But even still: just intfinsifying should not get us a regression, because there will always be cases where the auto-vectorizer fails, and so the scalar code should not be slower with your patch than on master, right? So we need to investigate this scalar issue as well. - VectorReduction2.WithSuperword on AVX-512 - `long[Min|Max]Simple performance drops considerably`. Yes, this case is not yet supposed to vectorize, I'm working on that - it is the issue with "simple" reductions, i.e. those that do no work other than reduce. Our current reduction heuristic thinks these are not profitable to vectorize - but that is wrong in almost all cases. You even filed an issue for that a while back ;) see https://bugs.openjdk.org/browse/JDK-8345044 and related issues. We could bite the bullet on this, knowing that I'm working on it and it will probably fix that issue, or we just wait a little here. Let's discuss. - VectorReduction2.NoSuperword on AVX-512 machine - Hmm, ok. So we seem to realize that the scalar case is slower with your patch in some cases, because now we have a `cmove` on the critical path, and previously we could just predict the branches, which was faster. Interesting that the number of other instructions has an effect here as well, you seem to see a speedup with the "big" benchmarks, but the "small" and "dot" benchmarks are slower. This is surprising. It would be great if we understood why it behaves this way. **Summary** Wow, things are more complicated than I would have thought, I hope you are not too discouraged ? We seem to have these issues, maybe there are more: - AVX2 does not have long-vector-min/max implemented. That can be done in a separate RFE. - Simple reductions do not vectorize, known issue see https://bugs.openjdk.org/browse/JDK-8345044, I'm working on that. - Scalar reductions are slower with your patch for extreme probabilities. Before, they were done with branches, and branch prediction was fast. Now with cmove or max instructions, the critical path is longer, and that makes things slow. Maybe this could be alleviated by reordering / reassociating the reduction path, see [JDK-8345245](https://bugs.openjdk.org/browse/JDK-8345245). Alternatively, we could convert the `cmove` back to a branch, but for that we would probably need to know the branching probability, which we now do not have any more, right? Tricky. This seems the real issue we need to address and discuss. @galderz What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656328729 From epeter at openjdk.org Thu Feb 13 11:49:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 11:49:18 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Mon, 10 Feb 2025 09:26:32 GMT, Galder Zamarre?o wrote: >> @eastig is helping with the results on aarch64, so I will verify the numbers in same way done below for x86_64 once he provides me with the results. >> >> Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push an update that just merges the latest master shortly). >> >> First I will go through the results of `MinMaxVector`. This benchmark computes throughput by default so the higher the number the better. >> >> # MinMaxVector AVX-512 >> >> Following are results with AVX-512 instructions: >> >> Benchmark (probability) (range) (seed) (size) Mode Cnt Baseline Patch Units >> MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 834.127 3688.961 ops/ms >> MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 1147.010 3687.721 ops/ms >> MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 1126.718 1072.812 ops/ms >> MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 1070.921 1070.538 ops/ms >> MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 510.483 1073.081 ops/ms >> MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 935.658 1016.910 ops/ms >> MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 1007.410 933.774 ops/ms >> MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 536.582 1017.337 ops/ms >> MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 4 967.288 966.945 ops/ms >> MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 4 967.327 967.382 ops/ms >> MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 4 849.689 967.327 ops/ms >> MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 4 966.323 967.275 ops/ms >> MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 4 967.340 967.228 ops/ms >> MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 4 880.921 967.233 ops/ms >> >> >> ### `longReduction[Min|Max]` performance improves slightly when probability is 100 >> >> Without the patch the code uses compare instructions: >> >> >> 7.83% ???? ???? ? 0x00007f4f700fb305: imulq $0xb, 0x20(%r14, %r8, 8), %rdi >> ???? ???... > >> At 100% probability baseline fails to vectorize because it observes a control flow. This control flow is not the one you see in min/max implementations, but this is one added by HotSpot as a result of the JIT profiling. It observes that one branch is always taken so it optimizes for that, and adds a branch for the uncommon case where the branch is not taken. > > I've dug further into this to try to understand how the baseline hotspot code works, and the explanation above is not entirely correct. Let's look at the IR differences between say 100% vs 80% branch situations. > > At branch 80% you see: > > 1115 CountedLoop === 1115 598 463 [[ 1101 1115 1116 1118 451 594 ]] inner stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > 692 LoadL === 1083 1101 393 [[ 747 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[395] !jvms: MinMaxVector::longLoopMax @ bci:26 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > 651 LoadL === 1095 1101 355 [[ 747 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does not depend only on test, unknown control) !orig=[357] !jvms: MinMaxVector::longLoopMax @ bci:20 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > 747 MaxL === _ 651 692 [[ 451 ]] !orig=[608],[416] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > 451 StoreL === 1115 1101 449 747 [[ 1116 454 911 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > 594 CountedLoopEnd === 1115 593 [[ 1123 463 ]] [lt] P=0.999731, C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124) > > > You see the counted loop with the LoadL for array loads and MaxL consuming those. The StoreL is for array assignment (I think). > > At branch 100% you see: > > > ... @galderz How sure are that intrinsifying directly is really the right approach? Maybe the approach via `PhaseIdealLoop::conditional_move` where we know the branching probability is a better one. Though of course knowing the branching probability is no perfect heuristic for how good branch prediction is going to be, but it is at least something. So I'm wondering if there could be a different approach that sees all the wins you get here, without any of the regressions? If we are just interested in better vectorization: the current issue is that the auto-vectorizer cannot handle CFG, i.e. we do not yet do if-conversion. But if we had if-conversion, then the inlined CFG of min/max would just be converted to vector CMove (or vector min/max where available) at that point. We can take the branching probabilities into account, just like `PhaseIdealLoop::conditional_move` does - if that is necessary. Of course if-conversion is far away, and we will encounter a lot of issues with branch prediction etc, so I'm scared we might never get there - but I want to try ;) Do we see any other wins with your patch, that are not due to vectorization, but just scalar code? @galderz Maybe we can discuss this offline at some point as well :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656350896 PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656351785 From yzheng at openjdk.org Thu Feb 13 12:35:15 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Thu, 13 Feb 2025 12:35:15 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. LGTM ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/23610#pullrequestreview-2614845493 From dnsimon at openjdk.org Thu Feb 13 12:50:15 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 12:50:15 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. Passes the openjdk-pr-canary: https://github.com/dougxc/openjdk-pr-canary/blob/master/tested-prs/23610/b7a38951a54ff4c1186a3682f717805822575ea8.json ------------- PR Comment: https://git.openjdk.org/jdk/pull/23610#issuecomment-2656499655 From roland at openjdk.org Thu Feb 13 16:46:16 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 13 Feb 2025 16:46:16 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Thu, 13 Feb 2025 11:46:35 GMT, Emanuel Peter wrote: > Do we see any other wins with your patch, that are not due to vectorization, but just scalar code? I think there are some. The current transformation from the parsed version of min/max to a conditional move to a `Max`/`Min` node depends on the conditional move transformation which has its own set of heuristics and while it happens on simple test cases, that's not necessarily the case on all code shapes. I don't think we want to trust it too much. With the intrinsic, the type of the min or max can be narrowed down in a way it can't be whether the code includes control flow or a conditional move. That in turn, once types have propagated, could cause some constant to appear and could be a significant win. The `Min`/`Max` nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2657176312 From kvn at openjdk.org Thu Feb 13 17:05:05 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:05:05 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> Message-ID: On Thu, 13 Feb 2025 05:14:59 GMT, Chris Plummer wrote: >> Example please. > > static Class wrapperClasses = new Class[Number_Of_Kinds]; > wrapperClasses[NMethodKind] = NMethodBlob.class; > wrapperClasses[BufferKind] = BufferBopb.class; > ...; > wrapperClasses[SafepointKind] = SafepointBlob.class; > > > > CodeBlob cb = new CodeBlob(addr); > return wrapperClasses[cb.getKind()]; Done. >> I don't think we need it - the caller `CodeCache.createCodeBlobWrapper()` will throw `RuntimeException` when `null` is returned. > > I guess my real question is whether or not it can be considered normal behavior to return null. It seems it should never happen, which is why I was suggesting an assert. With your suggested `wrapperClasses[]` we will get OOB exception. No need separate assert. >> `UncommonTrapKind` and `ExceptionKind` are not initialized for Client VM because corresponding `CodeBlobKind` values are not defined. See `CodeBlob.initialize()`. >> Their not initialized value will be 0 which matches `CodeBlobKind::None` value. Returning true in such case will be incorrect. > > Ok. Leaving UncommonTrapKind and ExceptionKind uninitialized seems a bit error prone. Perhaps they can be given some sort of INVALID value. Done. Initialized them to `Number_Of_Kinds + 1`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954886028 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954890522 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954891616 From kvn at openjdk.org Thu Feb 13 17:05:04 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:05:04 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v7] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <5LGcbNB2_MigrbHGKV3CY8e6z-1iioFUuiSvTU8-lNY=.af273d17-6ab5-4b12-ae41-e6900494b5ee@github.com> > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Update SA based on comments - Merge branch 'master' into 8349088 - Fix Zero VM build - Fix Minimal and Zero VM builds once more - Fix Minimal and Zero VM builds again - Add CodeBlob proxy vtable - Fix Zero and Minimal VM builds - 8349088: De-virtualize Codeblob and nmethod ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/b09ddce6..515495b2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=05-06 Stats: 11482 lines in 618 files changed: 7914 ins; 1738 del; 1830 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Thu Feb 13 17:14:59 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:14:59 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: rename SA argument ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/515495b2..61fdee68 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=06-07 Stats: 6 lines in 1 file changed: 2 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Thu Feb 13 17:14:59 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:14:59 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> <070Dz3l6A_ZT20jprInpMdpeqE3gogKAmmpnCprr4j0=.3b4804dc-02d7-4aa6-af42-7ef076d4fe0d@github.com> <8VFudK82JuBbjj_s74lDlHd1TWurW8uiBbw2DutA-PU=.ec26075e-89ad-4caf-ae3f-f50e5407a5f6@github.com> Message-ID: On Thu, 13 Feb 2025 05:19:48 GMT, Chris Plummer wrote: >> `cbPc` with comment explaining that it could be inside code blob. > > That sounds fine. done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954906986 From cjplummer at openjdk.org Thu Feb 13 17:14:59 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 17:14:59 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 10 Feb 2025 16:57:18 GMT, Vladimir Kozlov wrote: >>> What is the reason for switching from the virtualConstructor/hashMap approach to using getClassFor()? The hashmap is the model for JavaThread, MetaData, and CollectedHeap subtypes. >> >> I don't need any more mapping from CodeBlob class to corresponding virtual table name which does not exist anymore. `CodeBlob::_kind` field's value is used to determine which class should be used. >> >> I think `hashMap` is overkill here. I can construct array `Class cbClasses[]` and use `cbClasses[CodeBlob::_kind]` instead of `if/else` in `getClassFor`. But I would still need to check for unknown value of `CodeBlob::_kind` somehow. > >> impact on things like the "findpc" functionality > > Do you mean `findpc()` function in VM which is used in debugger? Nothing should be changed for it. > It calls `os::print_location()` which calls `CodeBlob::dump_for_addr(addr, st, verbose);`: > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/os.cpp#L1278 Actually I was referring to the clhsdb findpc command, which uses PointerFinder, but actually that should be ok because it special cases the codecache and knows how to find CodeBlobs in it. It's the clhsdb "inspect" command that will no longer be able to identify the type for an address that points to the start of a CodeBlob. This is true of any address that points to the start of a hotspot C++ object that does not have a vtable, or is not declared in vmstructs. So it's not a new issue, but is just adding more types to the list that "inspect" won't figure out. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1954906641 From epeter at openjdk.org Thu Feb 13 17:16:22 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Feb 2025 17:16:22 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Thu, 13 Feb 2025 16:43:22 GMT, Roland Westrelin wrote: > The current transformation from the parsed version of min/max to a conditional move to a Max/Min node depends on the conditional move transformation which has its own set of heuristics and while it happens on simple test cases, that's not necessarily the case on all code shapes. I don't think we want to trust it too much. Well, actually people have tried to improve the conditonal move transformation, and it is really really difficult. It's hard not to get regressions. I'm wondering how much easier it is for min / max. Maybe we have similar limitations, especially with predicting how well branch prediction performs. You are probably right about type propagation and `Min / Max` being floating nodes. @rwestrel What do you think about the regressions in the scalar cases of this patch? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2657253439 From never at openjdk.org Thu Feb 13 17:22:13 2025 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 13 Feb 2025 17:22:13 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23610#pullrequestreview-2615740718 From jrose at openjdk.org Thu Feb 13 17:25:13 2025 From: jrose at openjdk.org (John R Rose) Date: Thu, 13 Feb 2025 17:25:13 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument One related idea: The Vptr classes seem to be regular enough to be templated. That is, one class body, instantiated with a template argument for each code blob type (and probably another for the enum). That would remove some of the distracting per-class boilerplate. Something like: template class Vptr_Impl : public Vptr { override void print_on(const CodeBlob* instance, outputStream* st) const { assert(instance->kind() == Tkind, "sanity"); ((const CB_T*)instance)->print_on_impl(st); } ? override bool assert_sane(cosnt CodeBlob* instance) { assert(instance->kind() == Tkind, ""); return true; } }; class CodeBlob { public: final Vptr* vptr() const { Vptr* vptr = vptr_array[_kind]; assert(vptr->assert_sant(this), "correct array element"); return vptr; } final void print_on(outputStream* st) const { vptr()->print_on(this, st); } }; Then: const Vptr* array[] = { &Vptr_Impl(), ... &Vptr_Impl(), ... }; The array could be filled by a macro that tracks the enum members; I like that as a small job for macros (no code in it). Then: class UncommonTrapBlob : public OtherBlob { protected: // impl "M" method is not public void print_on_impl(outputStream* st) const { OtherBlob::print_on_impl(st); st->print("my field = %d", _my_field); } // Vptr needs to call impl method friend class Vptr_Impl; // this might break down, so make it all public in the end }; I don't see any reason the Vptr subclasses need to be related in any more detail as subs or supers. Well, C++ is a bag of surprises, so there are probably several reasons the above sketch is wrong. But something like it might add a little more readability and predictability to the code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657274388 From kvn at openjdk.org Thu Feb 13 17:29:14 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:29:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:22:18 GMT, John R Rose wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > One related idea: The Vptr classes seem to be regular enough to be templated. That is, one class body, instantiated with a template argument for each code blob type (and probably another for the enum). That would remove some of the distracting per-class boilerplate. Something like: > > > template > class Vptr_Impl : public Vptr { > override void print_on(const CodeBlob* instance, outputStream* st) const { > assert(instance->kind() == Tkind, "sanity"); > ((const CB_T*)instance)->print_on_impl(st); > } > ? > override bool assert_sane(cosnt CodeBlob* instance) { > assert(instance->kind() == Tkind, ""); > return true; > } > }; > > class CodeBlob { > public: > final Vptr* vptr() const { > Vptr* vptr = vptr_array[_kind]; > assert(vptr->assert_sant(this), "correct array element"); > return vptr; > } > final void print_on(outputStream* st) const { > vptr()->print_on(this, st); > } > }; > > > Then: > > > const Vptr* array[] = { > &Vptr_Impl(), > ... > &Vptr_Impl(), > ... > }; > > > The array could be filled by a macro that tracks the enum members; I like that as a small job for macros (no code in it). > > Then: > > > class UncommonTrapBlob : public OtherBlob { > protected: // impl "M" method is not public > void print_on_impl(outputStream* st) const { > OtherBlob::print_on_impl(st); > st->print("my field = %d", _my_field); > } > // Vptr needs to call impl method > friend class Vptr_Impl; // this might break down, so make it all public in the end > }; > > > I don't see any reason the Vptr subclasses need to be related in any more detail as subs or supers. > > Well, C++ is a bag of surprises, so there are probably several reasons the above sketch is wrong. But something like it might add a little more readability and predictability to the code. Thank you, @rose00 and @xmas92, for review and suggestions. Let me say it first - printing code for code blobs and nmethod is big mess. It requires separate big change to clean it up. For example, I have to go through CodeBlob's virtual dispatch `print_value_on_v()` for nmethod because some sets of `nmethod::print*()` are defined only in debug VM: [nmethod.hpp#L919](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.hpp#L919) Then `nmethod` has other mess which requires C++ trickery because it does not follow print API in CodeBlob: void print(outputStream* st) const; // need to re-define this from CodeBlob else the overload hides it void print_on(outputStream* st) const override { CodeBlob::print_on(st); } void print_on(outputStream* st, const char* msg) const; ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657282969 From kvn at openjdk.org Thu Feb 13 17:37:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 17:37:19 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument Saying that, I agree that I need to add comments explaining printing API and how Vptr class will work. I will work on @xmas92 suggestions and look on using `_impl`. I will try to look on templates @rose00 suggested but I don't want to complicate code for just for few print methods. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657303967 From kvn at openjdk.org Thu Feb 13 18:04:17 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 13 Feb 2025 18:04:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <3RrosS3Q-iEBqaD4hVGMfjY2hDGLqwWwSUqgT0Za1k4=.1e32f3f0-6677-4082-b100-ce9b4603ec80@github.com> On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument > AFAICT `print_value_on` is unreachable It is reachable in product VM when `print_value_on_v()` is called for `nmethod` which does not have `print_value_on()` in product VM. Which can be solved by adding simple `nmethod::print_value_on()` for product VM but it will change current behavior. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2657354310 From cjplummer at openjdk.org Thu Feb 13 19:31:15 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Thu, 13 Feb 2025 19:31:15 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 97: > 95: // cbAddr - address of a code blob > 96: // cbPC - address inside of a code blob > 97: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address cbPC) { Can you change findBlobUnsafe() above also? That's where the naming problem originated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955098013 From dnsimon at openjdk.org Thu Feb 13 19:38:19 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 19:38:19 GMT Subject: Integrated: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: <7BPX92kK6cDWVILYcvyQXfSssFDFjv0XjIZQlGnlRhI=.6521b1d0-9c70-4054-a276-601536946443@github.com> On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. This pull request has now been integrated. Changeset: a88e2a58 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/a88e2a58bf834081db55c2071d072567ea763354 Stats: 7 lines in 3 files changed: 0 ins; 0 del; 7 mod 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong Reviewed-by: yzheng, never ------------- PR: https://git.openjdk.org/jdk/pull/23610 From dnsimon at openjdk.org Thu Feb 13 19:38:19 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Feb 2025 19:38:19 GMT Subject: RFR: 8349977: JVMCIRuntime::_shared_library_javavm_id should be jlong In-Reply-To: References: Message-ID: On Thu, 13 Feb 2025 09:59:41 GMT, Doug Simon wrote: > The `JVMCIRuntime::_shared_library_javavm_id` field is initialized from a jlong in [libgraal](https://github.com/oracle/graal/blob/d544bbe3fe416d39e9e5b8fc645a67a36a5d7c07/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jni/functions/JNIInvocationInterface.java#L396-L397) and so it's C++ type in HotSpot should match. Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23610#issuecomment-2657542024 From dlong at openjdk.org Thu Feb 13 22:50:17 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 13 Feb 2025 22:50:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/c1/Runtime1.java line 65: > 63: public CodeBlob blobFor(int id) { > 64: Address blobAddr = blobsField.getStaticFieldAddress().getAddressAt(id * VM.getVM().getAddressSize()); > 65: return VM.getVM().getCodeCache().createCodeBlobWrapper(blobAddr); We don't need to change all the callers if we keep a 1-arg version of createCodeBlobWrapper(): public CodeBlob createCodeBlobWrapper(Address codeBlobAddr) { return createCodeBlobWrapper(codeBlobAddr, codeBlobAddr); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955316582 From dlong at openjdk.org Thu Feb 13 23:04:17 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 13 Feb 2025 23:04:17 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/hotspot/share/compiler/oopMap.cpp line 567: > 565: fr->print_on(tty); > 566: tty->print(" "); > 567: cb->print_value_on(tty); tty->cr(); We could minimize the number of files changed if we keep print_value_on() for compatibility: void print_value_on(outputStream* st) const { print_value_on_v(st); } src/hotspot/share/runtime/vframe.inline.hpp line 178: > 176: INTPTR_FORMAT " not found or invalid at %d", > 177: p2i(_frame.pc()), decode_offset); > 178: nm()->print_on_v(&ss); I suggest removing _v suffix to reduce changes and match existing naming. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955325657 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955327438 From dlong at openjdk.org Fri Feb 14 00:11:16 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 14 Feb 2025 00:11:16 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument src/hotspot/share/code/codeBlob.hpp line 669: > 667: > 668: jobject receiver() { return _receiver; } > 669: ByteSize frame_data_offset() { return _frame_data_offset; } `frame_data_offset()` seems to be unused. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1955373697 From dlong at openjdk.org Fri Feb 14 00:17:14 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 14 Feb 2025 00:17:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument HotSpot C++ changes look good. I skipped SA changes. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2616477660 From roland at openjdk.org Fri Feb 14 16:55:14 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 14 Feb 2025 16:55:14 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Thu, 13 Feb 2025 16:43:22 GMT, Roland Westrelin wrote: >> @galderz How sure are that intrinsifying directly is really the right approach? >> >> Maybe the approach via `PhaseIdealLoop::conditional_move` where we know the branching probability is a better one. Though of course knowing the branching probability is no perfect heuristic for how good branch prediction is going to be, but it is at least something. >> >> So I'm wondering if there could be a different approach that sees all the wins you get here, without any of the regressions? >> >> If we are just interested in better vectorization: the current issue is that the auto-vectorizer cannot handle CFG, i.e. we do not yet do if-conversion. But if we had if-conversion, then the inlined CFG of min/max would just be converted to vector CMove (or vector min/max where available) at that point. We can take the branching probabilities into account, just like `PhaseIdealLoop::conditional_move` does - if that is necessary. Of course if-conversion is far away, and we will encounter a lot of issues with branch prediction etc, so I'm scared we might never get there - but I want to try ;) >> >> Do we see any other wins with your patch, that are not due to vectorization, but just scalar code? > >> Do we see any other wins with your patch, that are not due to vectorization, but just scalar code? > > I think there are some. > > The current transformation from the parsed version of min/max to a conditional move to a `Max`/`Min` node depends on the conditional move transformation which has its own set of heuristics and while it happens on simple test cases, that's not necessarily the case on all code shapes. I don't think we want to trust it too much. > > With the intrinsic, the type of the min or max can be narrowed down in a way it can't be whether the code includes control flow or a conditional move. That in turn, once types have propagated, could cause some constant to appear and could be a significant win. > > The `Min`/`Max` nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise. > @rwestrel What do you think about the regressions in the scalar cases of this patch? Shouldn't int `min`/`max` be affected the same way? I suppose extracting the branch probability from the `MethodData` and attaching it to the `Min`/`Max` nodes is not impossible. I did something like that in the `ScopedValue` PR that you reviewed (and was put on hold). Now, that would be quite a bit of extra complexity for what feels like a corner case. Another possibility would be to implement `CMove` with branches (https://bugs.openjdk.org/browse/JDK-8340206) or to move the implementation of `MinL`/`MovL` in the ad files and experiment with branches there. It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2659821025 From kvn at openjdk.org Fri Feb 14 23:14:18 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:18 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 17:14:59 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > rename SA argument I addressed most @xmas92 and @dean-long comments and working on avoid `_v` suffix Thank you, Dean, for review. ------------- PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2618707275 PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2660443983 From kvn at openjdk.org Fri Feb 14 23:14:20 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:20 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v6] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 08:15:16 GMT, Axel Boldt-Christmas wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Zero VM build > > src/hotspot/share/code/codeBlob.hpp line 140: > >> 138: instance->print_value_on_nv(st); >> 139: } >> 140: }; > > I wonder why the base class is not abstract. AFAICT `print_value_on` is unreachable and `print_on` is only used by `DeoptimizationBlob::Vptr` which also seems like a behavioural change, as before this patch calling `print_on` a `DeoptimizationBlob` object would dispatch to `SingletonBlob::print_on` not `CodeBlob::print_on`. > > Suggestion: > > struct Vptr { > virtual void print_on(const CodeBlob* instance, outputStream* st) const = 0; > virtual void print_value_on(const CodeBlob* instance, outputStream* st) const = 0; > }; done > src/hotspot/share/code/codeBlob.hpp line 339: > >> 337: void print_value_on(outputStream* st) const; >> 338: >> 339: class Vptr : public CodeBlob::Vptr { > > I wonder if these should share the same type hierarchy as tier container class. This would also solve the issueI noted in my other comment about not calling the correct `print_on`. > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { Fixed > src/hotspot/share/code/codeBlob.hpp line 427: > >> 425: void print_value_on(outputStream* st) const; >> 426: >> 427: class Vptr : public CodeBlob::Vptr { > > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { Fixed > src/hotspot/share/code/codeBlob.hpp line 467: > >> 465: void print_value_on(outputStream* st) const; >> 466: >> 467: class Vptr : public CodeBlob::Vptr { > > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { Fixed > src/hotspot/share/code/codeBlob.hpp line 553: > >> 551: void print_value_on(outputStream* st) const; >> 552: >> 553: class Vptr : public CodeBlob::Vptr { > > This one specifically > Suggestion: > > class Vptr : public SingletonBlob::Vptr { fixed > src/hotspot/share/code/codeBlob.hpp line 679: > >> 677: void print_value_on(outputStream* st) const; >> 678: >> 679: class Vptr : public CodeBlob::Vptr { > > Suggestion: > > class Vptr : public RuntimeBlob::Vptr { fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956799673 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956801833 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956801994 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956802109 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956803039 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956827486 From kvn at openjdk.org Fri Feb 14 23:14:22 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:22 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <_9qiqpCFRxCMY4nADw0lqrNuOZYIKUpeY_7FYyoQWC8=.78588553-bede-45b1-bf2d-5ad306b81e29@github.com> On Fri, 14 Feb 2025 00:08:35 GMT, Dean Long wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > src/hotspot/share/code/codeBlob.hpp line 669: > >> 667: >> 668: jobject receiver() { return _receiver; } >> 669: ByteSize frame_data_offset() { return _frame_data_offset; } > > `frame_data_offset()` seems to be unused. removed > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/c1/Runtime1.java line 65: > >> 63: public CodeBlob blobFor(int id) { >> 64: Address blobAddr = blobsField.getStaticFieldAddress().getAddressAt(id * VM.getVM().getAddressSize()); >> 65: return VM.getVM().getCodeCache().createCodeBlobWrapper(blobAddr); > > We don't need to change all the callers if we keep a 1-arg version of createCodeBlobWrapper(): > > public CodeBlob createCodeBlobWrapper(Address codeBlobAddr) { > return createCodeBlobWrapper(codeBlobAddr, codeBlobAddr); > } This is the only one place where arguments are the same. In other two arguments are different. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956672379 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956667806 From kvn at openjdk.org Fri Feb 14 23:14:23 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Feb 2025 23:14:23 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <07aI9gwcVtc89Bte9DRQ6VwmCfhcBJJQlrXhxkRRgX0=.97d4a1cc-92a2-43dc-8516-2433eca67263@github.com> On Thu, 13 Feb 2025 19:27:19 GMT, Chris Plummer wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/CodeCache.java line 97: > >> 95: // cbAddr - address of a code blob >> 96: // cbPC - address inside of a code blob >> 97: public CodeBlob createCodeBlobWrapper(Address cbAddr, Address cbPC) { > > Can you change findBlobUnsafe() above also? That's where the naming problem originated. After some thoughts I think `PC` is not usually used by us. I renamed `cbAddr` to `cbStart` and `cbPC`/`start` to `addr` in this whole file. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956664966 From kvn at openjdk.org Sat Feb 15 02:08:20 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 02:08:20 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v8] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Thu, 13 Feb 2025 23:01:24 GMT, Dean Long wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> rename SA argument > > src/hotspot/share/runtime/vframe.inline.hpp line 178: > >> 176: INTPTR_FORMAT " not found or invalid at %d", >> 177: p2i(_frame.pc()), decode_offset); >> 178: nm()->print_on_v(&ss); > > I suggest removing _v suffix to reduce changes and match existing naming. Done. Testing now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1956985708 From kvn at openjdk.org Sat Feb 15 06:13:57 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 06:13:57 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v9] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/61fdee68..89a383e5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=07-08 Stats: 115 lines in 12 files changed: 7 ins; 7 del; 101 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From kvn at openjdk.org Sat Feb 15 06:29:14 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 06:29:14 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v9] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: <2aYXBHyZE83suQFtY_POyft2gbRwwF_Xf_qajA62Pgw=.1fe1143c-33c5-4e78-b691-3f85f176c598@github.com> On Sat, 15 Feb 2025 06:13:57 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Address comments I removed `_v` from `CodeBlob::print*_on(st)` methods to reduce scope of VM changes. But I have to add `_impl` suffix to these methods in CodeBlob subclasses. I renamed `nmethod::print_on(st, msg);` to `print_on_with_msg(at, msg) to avoid naming conflict C++ complains about. It cased change in `dependencyContext.cpp`. I made `CodeBlob::Vptr` class abstract as suggested. I added empty `Vptr` class to `RuntimeBlob` because it is referenced in subclasses and corrected extensions in sublcasses to avoid mistakes @xmas92 pointed. I also did some arguments renaming in SA in `CodeCache.java` as requested. Tier1-5 testing passed. Ready for new round of reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2660770028 From kvn at openjdk.org Sat Feb 15 06:34:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Feb 2025 06:34:56 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Remove commented lines left by mistake ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23533/files - new: https://git.openjdk.org/jdk/pull/23533/files/89a383e5..3fdf1c81 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23533&range=08-09 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23533/head:pull/23533 PR: https://git.openjdk.org/jdk/pull/23533 From aboldtch at openjdk.org Mon Feb 17 06:41:18 2025 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Mon, 17 Feb 2025 06:41:18 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove commented lines left by mistake Not looked at the SA changes. lgtm. src/hotspot/share/code/codeBlob.hpp line 308: > 306: > 307: class Vptr : public CodeBlob::Vptr { > 308: }; Was this needed for some compiler? Or is it to be more explicit about the type hierarchy? ------------- Marked as reviewed by aboldtch (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2620128040 PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1957678232 From epeter at openjdk.org Mon Feb 17 08:40:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 08:40:17 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Fri, 14 Feb 2025 16:52:17 GMT, Roland Westrelin wrote: > I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible. That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662409450 From roland at openjdk.org Mon Feb 17 08:47:22 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 17 Feb 2025 08:47:22 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: On Mon, 17 Feb 2025 08:37:56 GMT, Emanuel Peter wrote: > > I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible. > > That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node. Possibly. We could also create the intrinsic they way it's done in the patch and extract the frequency from the `MethoData` for the min or max methods. The shape of the bytecodes for these methods should be simple enough that it should be feasible. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662424292 From epeter at openjdk.org Mon Feb 17 10:39:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 10:39:15 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> Message-ID: <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> On Mon, 17 Feb 2025 08:44:46 GMT, Roland Westrelin wrote: >>> I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible. >> >> That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node. > >> > I suppose extracting the branch probability from the MethodData and attaching it to the Min/Max nodes is not impossible. >> >> That is basically what `PhaseIdealLoop::conditional_move` already does, right? It detects the diamond and converts it to `CMove`. We could special case for `min / max`, and then we'd have the probability for the branch, which we could store at the node. > > Possibly. We could also create the intrinsic they way it's done in the patch and extract the frequency from the `MethoData` for the min or max methods. The shape of the bytecodes for these methods should be simple enough that it should be feasible. @rwestrel @galderz > It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues. I'm a little scared to just accept the regressions, especially for this "most average looking case": Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in. > The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise. I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry. > Shouldn't int min/max be affected the same way? I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java CompileCommand: compileonly TestIntMax.test* bool compileonly = true CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true Warmup 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes) 5226 94 3 TestIntMax::test1 (27 bytes) 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes) 5238 96 4 TestIntMax::test1 (27 bytes) Run Time: 542056319 Warmup 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes) 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes) 6329 103 4 TestIntMax::test2 (34 bytes) Run Time: 166815209 That's a 4x regression on random input data! With: import java.util.Random; public class TestIntMax { private static Random RANDOM = new Random(); public static void main(String[] args) { int[] a = new int[64 * 1024]; for (int i = 0; i < a.length; i++) { a[i] = RANDOM.nextInt(); } { System.out.println("Warmup"); for (int i = 0; i < 10_000; i++) { test1(a); } System.out.println("Run"); long t0 = System.nanoTime(); for (int i = 0; i < 10_000; i++) { test1(a); } long t1 = System.nanoTime(); System.out.println("Time: " + (t1 - t0)); } { System.out.println("Warmup"); for (int i = 0; i < 10_000; i++) { test2(a); } System.out.println("Run"); long t0 = System.nanoTime(); for (int i = 0; i < 10_000; i++) { test2(a); } long t1 = System.nanoTime(); System.out.println("Time: " + (t1 - t0)); } } public static int test1(int[] a) { int x = Integer.MIN_VALUE; for (int i = 0; i < a.length; i++) { x = Math.max(x, a[i]); } return x; } public static int test2(int[] a) { int x = Integer.MIN_VALUE; for (int i = 0; i < a.length; i++) { x = (x >= a[i]) ? x : a[i]; } return x; } } ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662706564 From roland at openjdk.org Mon Feb 17 10:50:22 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 17 Feb 2025 10:50:22 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> Message-ID: On Mon, 17 Feb 2025 10:36:52 GMT, Emanuel Peter wrote: > I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry. Right. But, it would likely not optimize as well. The new optimization will possibly have heuristics to limit complexity so could be limited. The diamond could be transformed to something else by some other optimization before it gets a chance to get hoisted. There are likely other optimizations that apply to floating nodes that would still not apply: for instance, `MinL`/`MaxL` can be split thru phi even if the `min` call is not right after the merge point. With branches that's not true. Also, with more compexity comes more bugs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2662733218 From duke at openjdk.org Mon Feb 17 14:10:47 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Mon, 17 Feb 2025 14:10:47 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM Message-ID: By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. ------------- Commit messages: - removing trailing spaces - kyber aarch64 intrinsics Changes: https://git.openjdk.org/jdk/pull/23663/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349721 Stats: 2885 lines in 20 files changed: 2774 ins; 84 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From roland at openjdk.org Mon Feb 17 14:19:12 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 17 Feb 2025 14:19:12 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Is the motivation to use this as a way to do prep work for alias analysis? Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) I went over the code and it looks reasonable to me. I intend to do a more careful review later. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2663262133 From roland at openjdk.org Mon Feb 17 15:05:17 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 17 Feb 2025 15:05:17 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> Message-ID: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> On Mon, 17 Feb 2025 10:36:52 GMT, Emanuel Peter wrote: > I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: I observe the same: Warmup 751 3 b TestIntMax::test1 (27 bytes) Run Time: 360 550 158 Warmup 1862 15 b TestIntMax::test2 (34 bytes) Run Time: 92 116 170 But then with this: diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad index 8cc4a970bfd..9abda8f4178 100644 --- a/src/hotspot/cpu/x86/x86_64.ad +++ b/src/hotspot/cpu/x86/x86_64.ad @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr) %} -instruct maxI_rReg(rRegI dst, rRegI src) +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr) %{ match(Set dst (MaxI dst src)); + effect(KILL cr); ins_cost(200); - expand %{ - rFlagsReg cr; - compI_rReg(cr, dst, src); - cmovI_reg_l(dst, src, cr); + ins_encode %{ + Label done; + __ cmpl($src$$Register, $dst$$Register); + __ jccb(Assembler::less, done); + __ mov($dst$$Register, $src$$Register); + __ bind(done); %} + ins_pipe(pipe_cmov_reg); %} // ============================================================================ the performance gap narrows: Warmup 770 3 b TestIntMax::test1 (27 bytes) Run Time: 94 951 677 Warmup 1312 15 b TestIntMax::test2 (34 bytes) Run Time: 70 053 824 (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663379660 From epeter at openjdk.org Mon Feb 17 15:28:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Feb 2025 15:28:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 14:16:59 GMT, Roland Westrelin wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Is the motivation to use this as a way to do prep work for alias analysis? > > Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) > > I went over the code and it looks reasonable to me. I intend to do a more careful review later. @rwestrel Thanks for having a first look! > What are the architectures affected by this? Isn't it the case that x86 and aarch64 are unaffected by this? Yes, x86 and aarch64 are unaffected, as far as I know. Well, we can simulate strict alignment with `-XX:+AlignVector`, and there it should behave correctly, and it currently fails with the `-XX:+VerifyAlignVector`. It would be nice if that was not the case, so that we can write tests with arbitrary alignment, and turn on those flags freely. > Is the motivation to use this as a way to do prep work for alias analysis? I see this as a bug-fix AND preparation for future work. I suppose I might not have fixed this bug here since our platforms are not really affected, but I might as well fix it now since I can re-use most of the code later. > Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) I suppose that is currently what I'm planning. But we could in principle separate them. But I would leave that for later, if there is any desire to do that. For now, I think it's ok to just go with a single "auto-vectorization" reason. Does that sound reasonable? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2663434802 From dnsimon at openjdk.org Mon Feb 17 16:12:41 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 16:12:41 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: > In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. > > This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that JVMCI is still experimental and only has qualified exports to Graal, I don't think this needs a CSR. Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 - remove non-native-image build time use of ServiceLoader - make Cleaner.clean public ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22869/files - new: https://git.openjdk.org/jdk/pull/22869/files/24bb39be..7c91d00c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22869&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22869&range=00-01 Stats: 212534 lines in 5089 files changed: 102007 ins; 88290 del; 22237 mod Patch: https://git.openjdk.org/jdk/pull/22869.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22869/head:pull/22869 PR: https://git.openjdk.org/jdk/pull/22869 From galder at openjdk.org Mon Feb 17 16:49:15 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 17 Feb 2025 16:49:15 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: <_SUoth7bTq41M5TpGjQ5ADL2TOesK2tIIxmL21BZ6RU=.65284948-b4a8-4d01-a924-e9dfeefe1c88@github.com> On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin wrote: >> @rwestrel @galderz >> >>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues. >> >> I'm a little scared to just accept the regressions, especially for this "most average looking case": >> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in. >> >>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise. >> >> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry. >> >>> Shouldn't int min/max be affected the same way? >> >> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: >> >> >> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java >> CompileCommand: compileonly TestIntMax.test* bool compileonly = true >> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true >> Warmup >> 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes) >> 5226 94 3 TestIntMax::test1 (27 bytes) >> 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes) >> 5238 96 4 TestIntMax::test1 (27 bytes) >> Run >> Time: 542056319 >> Warmup >> 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes) >> 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes) >> 6329 103 4 TestIntMax::test2 (34 bytes) >> Run >> Time: 166815209 >> >> That's a 4x regression on random input data! >> >> With: >> >> import java.util.Random; >> >> public class TestIntMax { >> private static Random RANDOM = new Random(); >> >> public static void main(String[] args) { >> int[] a = new int[64 * 1024]; >> for (int i = 0; i < a.length; i++) { >>... > >> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: > > I observe the same: > > > Warmup > 751 3 b TestIntMax::test1 (27 bytes) > Run > Time: 360 550 158 > Warmup > 1862 15 b TestIntMax::test2 (34 bytes) > Run > Time: 92 116 170 > > > But then with this: > > > diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad > index 8cc4a970bfd..9abda8f4178 100644 > --- a/src/hotspot/cpu/x86/x86_64.ad > +++ b/src/hotspot/cpu/x86/x86_64.ad > @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr) > %} > > > -instruct maxI_rReg(rRegI dst, rRegI src) > +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr) > %{ > match(Set dst (MaxI dst src)); > + effect(KILL cr); > > ins_cost(200); > - expand %{ > - rFlagsReg cr; > - compI_rReg(cr, dst, src); > - cmovI_reg_l(dst, src, cr); > + ins_encode %{ > + Label done; > + __ cmpl($src$$Register, $dst$$Register); > + __ jccb(Assembler::less, done); > + __ mov($dst$$Register, $src$$Register); > + __ bind(done); > %} > + ins_pipe(pipe_cmov_reg); > %} > > // ============================================================================ > > > the performance gap narrows: > > > Warmup > 770 3 b TestIntMax::test1 (27 bytes) > Run > Time: 94 951 677 > Warmup > 1312 15 b TestIntMax::test2 (34 bytes) > Run > Time: 70 053 824 > > > (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then? @rwestrel @eme64 I think that the data distribution in the `TestIntMax` above matters (see my explanations in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364), so I've enhanced the test to control data distribution in the int[] (see at the bottom). Here are the results I see on my AVX-512 machine: Probability: 50% Warmup 7834 92 % b 3 TestIntMax::test1 @ 5 (27 bytes) 7836 93 b 3 TestIntMax::test1 (27 bytes) 7838 94 % b 4 TestIntMax::test1 @ 5 (27 bytes) 7851 95 b 4 TestIntMax::test1 (27 bytes) Run Time: 699 923 014 Warmup 9272 96 % b 3 TestIntMax::test2 @ 5 (34 bytes) 9274 97 b 3 TestIntMax::test2 (34 bytes) 9275 98 % b 4 TestIntMax::test2 @ 5 (34 bytes) 9287 99 b 4 TestIntMax::test2 (34 bytes) Run Time: 699 815 792 Probability: 80% Warmup 7872 92 % b 3 TestIntMax::test1 @ 5 (27 bytes) 7874 93 b 3 TestIntMax::test1 (27 bytes) 7875 94 % b 4 TestIntMax::test1 @ 5 (27 bytes) 7889 95 b 4 TestIntMax::test1 (27 bytes) Run Time: 699 947 633 Warmup 9310 96 % b 3 TestIntMax::test2 @ 5 (34 bytes) 9311 97 b 3 TestIntMax::test2 (34 bytes) 9312 98 % b 4 TestIntMax::test2 @ 5 (34 bytes) 9325 99 b 4 TestIntMax::test2 (34 bytes) Run Time: 699 827 882 Probability: 100% Warmup 7884 92 % b 3 TestIntMax::test1 @ 5 (27 bytes) 7886 93 b 3 TestIntMax::test1 (27 bytes) 7888 94 % b 4 TestIntMax::test1 @ 5 (27 bytes) 7901 95 b 4 TestIntMax::test1 (27 bytes) Run Time: 699 931 243 Warmup 9322 96 % b 3 TestIntMax::test2 @ 5 (34 bytes) 9323 97 b 3 TestIntMax::test2 (34 bytes) 9324 98 % b 4 TestIntMax::test2 @ 5 (34 bytes) 9336 99 b 4 TestIntMax::test2 (34 bytes) Run Time: 1 077 937 282 import java.util.Random; import java.util.concurrent.ThreadLocalRandom; import java.text.DecimalFormat; import java.text.DecimalFormatSymbols; class TestIntMax { static final int RANGE = 16 * 1024; static final int ITER = 100_000; public static void main(String[] args) { final int probability = Integer.parseInt(args[0]); final DecimalFormatSymbols symbols = new DecimalFormatSymbols(); symbols.setGroupingSeparator(' '); final DecimalFormat format = new DecimalFormat("#,###", symbols); System.out.printf("Probability: %d%%%n", probability); int[] a = new int[64 * 1024]; init(a, probability); { System.out.println("Warmup"); for (int i = 0; i < 10_000; i++) { test1(a); } System.out.println("Run"); long t0 = System.nanoTime(); for (int i = 0; i < 10_000; i++) { test1(a); } long t1 = System.nanoTime(); System.out.println("Time: " + format.format(t1 - t0)); } { System.out.println("Warmup"); for (int i = 0; i < 10_000; i++) { test2(a); } System.out.println("Run"); long t0 = System.nanoTime(); for (int i = 0; i < 10_000; i++) { test2(a); } long t1 = System.nanoTime(); System.out.println("Time: " + format.format(t1 - t0)); } } public static int test1(int[] a) { int x = Integer.MIN_VALUE; for (int i = 0; i < a.length; i++) { x = Math.max(x, a[i]); } return x; } public static int test2(int[] a) { int x = Integer.MIN_VALUE; for (int i = 0; i < a.length; i++) { x = (x >= a[i]) ? x : a[i]; } return x; } public static void init(int[] ints, int probability) { int aboveCount, abovePercent; do { int max = ThreadLocalRandom.current().nextInt(10); ints[0] = max; aboveCount = 0; for (int i = 1; i < ints.length; i++) { int value; if (ThreadLocalRandom.current().nextInt(101) <= probability) { int increment = ThreadLocalRandom.current().nextInt(10); value = max + increment; aboveCount++; } else { // Decrement by at least 1 int decrement = ThreadLocalRandom.current().nextInt(10) + 1; value = max - decrement; } ints[i] = value; max = Math.max(max, value); } abovePercent = ((aboveCount + 1) * 100) / ints.length; } while (abovePercent != probability); } } Focusing my comment below on 100% which is where the differences appear: test2 (100%): ;; B12: # out( B21 B13 ) <- in( B11 B20 ) Freq: 1.6744e+09 0x00007f15bcada2e9: movl 0x14(%rsi, %rdx, 4), %r11d ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test2 at 14 (line 71) 0x00007f15bcada2ee: cmpl %r11d, %r10d 0x00007f15bcada2f1: jge 0x7f15bcada362 ;*istore_1 {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test2 at 25 (line 71) test1 (100%) ;; B10: # out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09 0x00007f15bcad9a70: movl 0x4c(%rsi, %rdx, 4), %r11d 0x00007f15bcad9a75: movl %r11d, (%rsp) 0x00007f15bcad9a79: movl 0x48(%rsi, %rdx, 4), %r10d 0x00007f15bcad9a7e: movl %r10d, 4(%rsp) 0x00007f15bcad9a83: movl 0x10(%rsi, %rdx, 4), %r11d 0x00007f15bcad9a88: movl 0x14(%rsi, %rdx, 4), %r9d 0x00007f15bcad9a8d: movl 0x44(%rsi, %rdx, 4), %r10d 0x00007f15bcad9a92: movl %r10d, 8(%rsp) 0x00007f15bcad9a97: movl 0x18(%rsi, %rdx, 4), %r8d 0x00007f15bcad9a9c: cmpl %r11d, %eax 0x00007f15bcad9a9f: cmovll %r11d, %eax 0x00007f15bcad9aa3: cmpl %r9d, %eax 0x00007f15bcad9aa6: cmovll %r9d, %eax 0x00007f15bcad9aaa: movl 0x20(%rsi, %rdx, 4), %r10d 0x00007f15bcad9aaf: cmpl %r8d, %eax 0x00007f15bcad9ab2: cmovll %r8d, %eax 0x00007f15bcad9ab6: movl 0x24(%rsi, %rdx, 4), %r8d 0x00007f15bcad9abb: movl 0x28(%rsi, %rdx, 4), %r11d ; {no_reloc} 0x00007f15bcad9ac0: movl 0x2c(%rsi, %rdx, 4), %ecx 0x00007f15bcad9ac4: movl 0x30(%rsi, %rdx, 4), %r9d 0x00007f15bcad9ac9: movl 0x34(%rsi, %rdx, 4), %edi 0x00007f15bcad9acd: movl 0x38(%rsi, %rdx, 4), %ebx 0x00007f15bcad9ad1: movl 0x3c(%rsi, %rdx, 4), %ebp 0x00007f15bcad9ad5: movl 0x40(%rsi, %rdx, 4), %r13d 0x00007f15bcad9ada: movl 0x1c(%rsi, %rdx, 4), %r14d 0x00007f15bcad9adf: cmpl %r14d, %eax 0x00007f15bcad9ae2: cmovll %r14d, %eax 0x00007f15bcad9ae6: cmpl %r10d, %eax 0x00007f15bcad9ae9: cmovll %r10d, %eax 0x00007f15bcad9aed: cmpl %r8d, %eax 0x00007f15bcad9af0: cmovll %r8d, %eax 0x00007f15bcad9af4: cmpl %r11d, %eax 0x00007f15bcad9af7: cmovll %r11d, %eax 0x00007f15bcad9afb: cmpl %ecx, %eax 0x00007f15bcad9afd: cmovll %ecx, %eax 0x00007f15bcad9b00: cmpl %r9d, %eax 0x00007f15bcad9b03: cmovll %r9d, %eax 0x00007f15bcad9b07: cmpl %edi, %eax 0x00007f15bcad9b09: cmovll %edi, %eax 0x00007f15bcad9b0c: cmpl %ebx, %eax 0x00007f15bcad9b0e: cmovll %ebx, %eax 0x00007f15bcad9b11: cmpl %ebp, %eax 0x00007f15bcad9b13: cmovll %ebp, %eax 0x00007f15bcad9b16: cmpl %r13d, %eax 0x00007f15bcad9b19: cmovll %r13d, %eax 0x00007f15bcad9b1d: cmpl 8(%rsp), %eax 0x00007f15bcad9b21: movl 8(%rsp), %r11d 0x00007f15bcad9b26: cmovll %r11d, %eax 0x00007f15bcad9b2a: cmpl 4(%rsp), %eax 0x00007f15bcad9b2e: movl 4(%rsp), %r10d 0x00007f15bcad9b33: cmovll %r10d, %eax 0x00007f15bcad9b37: cmpl (%rsp), %eax 0x00007f15bcad9b3a: movl (%rsp), %r11d 0x00007f15bcad9b3e: cmovll %r11d, %eax ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test1 at 15 (line 61) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663633050 From galder at openjdk.org Mon Feb 17 17:05:28 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 17 Feb 2025 17:05:28 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/ba549afe...a190ae68 Another interesting comparison arises above when comparing `test2` in 80% vs 100%: test2 (100%): ;; B12: # out( B21 B13 ) <- in( B11 B20 ) Freq: 1.6744e+09 0x00007f15bcada2e9: movl 0x14(%rsi, %rdx, 4), %r11d ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test2 at 14 (line 71) 0x00007f15bcada2ee: cmpl %r11d, %r10d 0x00007f15bcada2f1: jge 0x7f15bcada362 ;*istore_1 {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test2 at 25 (line 71) test2(80%): ;; B10: # out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09 0x00007fe850ada2f0: movl 0x4c(%rsi, %rdx, 4), %r11d 0x00007fe850ada2f5: movl %r11d, (%rsp) 0x00007fe850ada2f9: movl 0x48(%rsi, %rdx, 4), %r10d 0x00007fe850ada2fe: movl %r10d, 4(%rsp) 0x00007fe850ada303: movl 0x10(%rsi, %rdx, 4), %r11d 0x00007fe850ada308: movl 0x14(%rsi, %rdx, 4), %r9d 0x00007fe850ada30d: movl 0x44(%rsi, %rdx, 4), %r10d 0x00007fe850ada312: movl %r10d, 8(%rsp) 0x00007fe850ada317: movl 0x18(%rsi, %rdx, 4), %r8d 0x00007fe850ada31c: cmpl %r11d, %eax 0x00007fe850ada31f: cmovll %r11d, %eax 0x00007fe850ada323: cmpl %r9d, %eax 0x00007fe850ada326: cmovll %r9d, %eax 0x00007fe850ada32a: movl 0x20(%rsi, %rdx, 4), %r10d 0x00007fe850ada32f: cmpl %r8d, %eax 0x00007fe850ada332: cmovll %r8d, %eax 0x00007fe850ada336: movl 0x24(%rsi, %rdx, 4), %r8d 0x00007fe850ada33b: movl 0x28(%rsi, %rdx, 4), %r11d ; {no_reloc} 0x00007fe850ada340: movl 0x2c(%rsi, %rdx, 4), %ecx 0x00007fe850ada344: movl 0x30(%rsi, %rdx, 4), %r9d 0x00007fe850ada349: movl 0x34(%rsi, %rdx, 4), %edi 0x00007fe850ada34d: movl 0x38(%rsi, %rdx, 4), %ebx 0x00007fe850ada351: movl 0x3c(%rsi, %rdx, 4), %ebp 0x00007fe850ada355: movl 0x40(%rsi, %rdx, 4), %r13d 0x00007fe850ada35a: movl 0x1c(%rsi, %rdx, 4), %r14d 0x00007fe850ada35f: cmpl %r14d, %eax 0x00007fe850ada362: cmovll %r14d, %eax 0x00007fe850ada366: cmpl %r10d, %eax 0x00007fe850ada369: cmovll %r10d, %eax 0x00007fe850ada36d: cmpl %r8d, %eax 0x00007fe850ada370: cmovll %r8d, %eax 0x00007fe850ada374: cmpl %r11d, %eax 0x00007fe850ada377: cmovll %r11d, %eax 0x00007fe850ada37b: cmpl %ecx, %eax 0x00007fe850ada37d: cmovll %ecx, %eax 0x00007fe850ada380: cmpl %r9d, %eax 0x00007fe850ada383: cmovll %r9d, %eax 0x00007fe850ada387: cmpl %edi, %eax 0x00007fe850ada389: cmovll %edi, %eax 0x00007fe850ada38c: cmpl %ebx, %eax 0x00007fe850ada38e: cmovll %ebx, %eax 0x00007fe850ada391: cmpl %ebp, %eax 0x00007fe850ada393: cmovll %ebp, %eax 0x00007fe850ada396: cmpl %r13d, %eax 0x00007fe850ada399: cmovll %r13d, %eax 0x00007fe850ada39d: cmpl 8(%rsp), %eax 0x00007fe850ada3a1: movl 8(%rsp), %r11d 0x00007fe850ada3a6: cmovll %r11d, %eax 0x00007fe850ada3aa: cmpl 4(%rsp), %eax 0x00007fe850ada3ae: movl 4(%rsp), %r10d 0x00007fe850ada3b3: cmovll %r10d, %eax 0x00007fe850ada3b7: cmpl (%rsp), %eax 0x00007fe850ada3ba: movl (%rsp), %r11d 0x00007fe850ada3be: cmovll %r11d, %eax ;*istore_1 {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test2 at 25 (line 71) There are a couple of things is puzzling me. This test is like a reduction test and no vectorization appears to be kicking in any of the percentages (I've not enabled vectorization SW rejections to check). The other thing that is strange is the overall time. When no vectorization kicks in and the code uses cmovs, I've been seeing worse performance numbers compared to say compare and jumps, particularly in 100% tests. With `TestIntMax` it appears to be the opposite, test2 at 100% uses jpm+cmp, which performs worse than cmov versions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663665858 From dnsimon at openjdk.org Mon Feb 17 17:11:22 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 17:11:22 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon wrote: >> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. >> >> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that JVMCI is still experimental and only has qualified exports to Graal, I don't think this needs a CSR. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 > - remove non-native-image build time use of ServiceLoader > - make Cleaner.clean public Passes openjdk-pr-canary: https://github.com/dougxc/openjdk-pr-canary/actions/runs/13374826011/job/37351770830#step:4:47 ------------- PR Comment: https://git.openjdk.org/jdk/pull/22869#issuecomment-2663687923 From galder at openjdk.org Mon Feb 17 17:21:15 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 17 Feb 2025 17:21:15 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Mon, 17 Feb 2025 17:02:47 GMT, Galder Zamarre?o wrote: > This test is like a reduction test and no vectorization appears to be kicking in any of the percentages (I've not enabled vectorization SW rejections to check). Ah, that's probably because of profitable vectorization checks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663710153 From dnsimon at openjdk.org Mon Feb 17 17:43:14 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 17:43:14 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: On Mon, 23 Dec 2024 18:06:21 GMT, Doug Simon wrote: >> Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 >> - remove non-native-image build time use of ServiceLoader >> - make Cleaner.clean public > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/services/Services.java line 52: > >> 50: * statement on this field - the guard cannot be behind a method call. >> 51: */ >> 52: public static final boolean IS_BUILDING_NATIVE_IMAGE = Boolean.parseBoolean(VM.getSavedProperty("jdk.vm.ci.services.aot")); > > This field is no longer used in JVMCI and I will remove its usages in Graal. Removed in https://github.com/oracle/graal/pull/10380 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22869#discussion_r1958608248 From yzheng at openjdk.org Mon Feb 17 17:56:15 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Mon, 17 Feb 2025 17:56:15 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: <98jmUmCaXEstTsMZUeuKA1QBro7kZvIZhrFsQWbQIj0=.f4e81caf-78b4-44b8-9d70-b1d68cfc6f7b@github.com> On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon wrote: >> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. >> >> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 > - remove non-native-image build time use of ServiceLoader > - make Cleaner.clean public LGTM ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/22869#pullrequestreview-2621722417 From kvn at openjdk.org Mon Feb 17 18:43:23 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Feb 2025 18:43:23 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Mon, 17 Feb 2025 06:24:35 GMT, Axel Boldt-Christmas wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove commented lines left by mistake > > src/hotspot/share/code/codeBlob.hpp line 308: > >> 306: >> 307: class Vptr : public CodeBlob::Vptr { >> 308: }; > > Was this needed for some compiler? Or is it to be more explicit about the type hierarchy? Thank you, @xmas92, for review and suggestions. It is second (explicit type hierarchy). I think it should be explicitly declared (even empty) because it is referenced in subclasses to avoid confusion. And it could be useful in a future if we need other virtual methods. Local build with `gcc` on Linux passed without it but I did not try to build on other platforms. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23533#discussion_r1958673128 From dnsimon at openjdk.org Mon Feb 17 19:37:24 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 19:37:24 GMT Subject: RFR: 8346781: [JVMCI] Limit ServiceLoader to class initializers [v2] In-Reply-To: References: Message-ID: <0elzblvKiIjGRnZiBSPjStJpDMTPJyXObkHwVuStSJg=.8ac2fd8e-d38c-42de-a1fa-c94eac144a73@github.com> On Mon, 17 Feb 2025 16:12:41 GMT, Doug Simon wrote: >> In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. >> >> This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed. > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'openjdk-jdk/master' into JDK-8346781 > - remove non-native-image build time use of ServiceLoader > - make Cleaner.clean public Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22869#issuecomment-2663944221 From dnsimon at openjdk.org Mon Feb 17 19:37:24 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Feb 2025 19:37:24 GMT Subject: Integrated: 8346781: [JVMCI] Limit ServiceLoader to class initializers In-Reply-To: References: Message-ID: On Mon, 23 Dec 2024 17:58:23 GMT, Doug Simon wrote: > In the context of libgraal, the current use of ServiceLoader in JVMCI is problematic as libgraal does all class loading at image build time. There are static fields such as `JVMCIServiceLocator.cachedLocators` that need to be initialized [via reflection](https://github.com/oracle/graal/blob/30492c3f7847a13ae7f8dc50663a5a039e49a8e7/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/libgraal/BuildTime.java#L175-L180) when building libgraal. > > This PR removes the need for such reflection by moving all use of ServiceLoader in JVMCI into `` methods. These methods are executed when building libgraal. It also removes a few other public methods and fields that are no longer used by Graal. Given that only has qualified exports to Graal, a CSR is not needed. This pull request has now been integrated. Changeset: 8ec58939 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/8ec589390f7dc67dd883a1efddb8da32790f6591 Stats: 166 lines in 7 files changed: 10 ins; 126 del; 30 mod 8346781: [JVMCI] Limit ServiceLoader to class initializers Reviewed-by: never, yzheng ------------- PR: https://git.openjdk.org/jdk/pull/22869 From jwaters at openjdk.org Tue Feb 18 02:39:25 2025 From: jwaters at openjdk.org (Julian Waters) Date: Tue, 18 Feb 2025 02:39:25 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 06:32:56 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch adds C2 compiler support for various Float16 operations added by [PR#22128](https://github.com/openjdk/jdk/pull/22128) >> >> Following is the summary of changes included with this patch:- >> >> 1. Detection of various Float16 operations through inline expansion or pattern folding idealizations. >> 2. Float16 operations like add, sub, mul, div, max, and min are inferred through pattern folding idealization. >> 3. Float16 SQRT and FMA operation are inferred through inline expansion and their corresponding entry points are defined in the newly added Float16Math class. >> - These intrinsics receive unwrapped short arguments encoding IEEE 754 binary16 values. >> 5. New specialized IR nodes for Float16 operations, associated idealizations, and constant folding routines. >> 6. New Ideal type for constant and non-constant Float16 IR nodes. Please refer to [FAQs ](https://github.com/openjdk/jdk/pull/22754#issuecomment-2543982577)for more details. >> 7. Since Float16 uses short as its storage type, hence raw FP16 values are always loaded into general purpose register, but FP16 ISA generally operates over floating point registers, thus the compiler injects reinterpretation IR before and after Float16 operation nodes to move short value to floating point register and vice versa. >> 8. New idealization routines to optimize redundant reinterpretation chains. HF2S + S2HF = HF >> 9. X86 backend implementation for all supported intrinsics. >> 10. Functional and Performance validation tests. >> >> Kindly review the patch and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux * For target hotspot_variant-server_libjvm_objs_mulnode.o: /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?: /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous 1944 | return TypeH::make(fma(f1, f2, f3)); | ^ In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31, from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28, from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26: /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)? 544 | static const TypeH* make(float f); | ^~~~ /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)? 545 | static const TypeH* make(short f); | ^~~~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2664473623 From cjplummer at openjdk.org Tue Feb 18 03:05:16 2025 From: cjplummer at openjdk.org (Chris Plummer) Date: Tue, 18 Feb 2025 03:05:16 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove commented lines left by mistake SA changes look good. Thanks for taking care of this. ------------- Marked as reviewed by cjplummer (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23533#pullrequestreview-2622331256 From galder at openjdk.org Tue Feb 18 08:04:21 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 18 Feb 2025 08:04:21 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/6ad0c61a...a190ae68 What is happening with int min/max needs a separate investigation because based on my testing, the int min/max intrinsic is both a regression and a performance improvement! Check this out: make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax" MICRO="FORK=1" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.intReductionSimpleMax 50 2048 thrpt 4 460.585 ? 0.348 ops/ms MinMaxVector.intReductionSimpleMax 80 2048 thrpt 4 460.633 ? 0.103 ops/ms MinMaxVector.intReductionSimpleMax 100 2048 thrpt 4 460.580 ? 0.091 ops/ms make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionSimpleMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.intReductionSimpleMax 50 2048 thrpt 4 460.479 ? 0.044 ops/ms MinMaxVector.intReductionSimpleMax 80 2048 thrpt 4 460.587 ? 0.106 ops/ms MinMaxVector.intReductionSimpleMax 100 2048 thrpt 4 1027.831 ? 9.353 ops/ms 80%: ?? ? 0x00007ffb200fa089: cmpl %r11d, %r10d 3.04% ?? ? 0x00007ffb200fa08c: cmovll %r11d, %r10d 4.38% ?? ? 0x00007ffb200fa090: cmpl %ebx, %r10d 1.61% ?? ? 0x00007ffb200fa093: cmovll %ebx, %r10d 2.79% ?? ? 0x00007ffb200fa097: cmpl %edi, %r10d 2.92% ?? ? 0x00007ffb200fa09a: cmovll %edi, %r10d ;*ireturn {reexecute=0 rethrow=0 return_oop=0} ?? ? ; - java.lang.Math::max at 10 (line 2023) ?? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax at 23 (line 232) 100%: 3.11% ??????? ?????? ? 0x00007f26c00f8f9c: nopl (%rax) 3.31% ??????? ?????? ? 0x00007f26c00f8fa0: cmpl %r10d, %ecx ???????? ?????? ? 0x00007f26c00f8fa3: jge 0x7f26c00f8ff1 ;*ireturn {reexecute=0 rethrow=0 return_oop=0} ???????? ?????? ? ; - java.lang.Math::max at 10 (line 2023) ???????? ?????? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMax at 23 (line 232) ???????? ?????? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMax_jmhTest::intReductionSimpleMax_thrpt_jmhStub at 19 (line 124) make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax" MICRO="FORK=1" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.intReductionMultiplyMax 50 2048 thrpt 4 2815.614 ? 0.406 ops/ms MinMaxVector.intReductionMultiplyMax 80 2048 thrpt 4 2814.943 ? 2.174 ops/ms MinMaxVector.intReductionMultiplyMax 100 2048 thrpt 4 2815.285 ? 1.725 ops/ms make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_max" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.intReductionMultiplyMax 50 2048 thrpt 4 2802.062 ? 0.710 ops/ms MinMaxVector.intReductionMultiplyMax 80 2048 thrpt 4 2814.874 ? 4.058 ops/ms MinMaxVector.intReductionMultiplyMax 100 2048 thrpt 4 883.879 ? 0.327 ops/ms 80%: 3.54% ? ?? ????? 0x00007faa700fa177: vpmaxsd %ymm4, %ymm5, %ymm13;*ireturn {reexecute=0 rethrow=0 return_oop=0} ? ?? ????? ; - java.lang.Math::max at 10 (line 2023) 100: 7.50% ??????????????????? ? 0x00007f75280f8849: imull $0xb, 0x2c(%rbp, %r11, 4), %r10d ??????????????????? ? ;*imul {reexecute=0 rethrow=0 return_oop=0} ??????????????????? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax at 20 (line 221) ??????????????????? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub at 19 (line 124) 3.85% ??????????????????? ? 0x00007f75280f884f: cmpl %r10d, %r8d ??????????????????? ? 0x00007f75280f8852: jl 0x7f75280f87d0 ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0} ?????????? ???????? ? ; - java.lang.Math::max at 2 (line 2023) ?????????? ???????? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMax at 26 (line 222) ?????????? ???????? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMax_jmhTest::intReductionMultiplyMax_thrpt_jmhStub at 19 (line 124) I ran the exact same test with longs and I don't see such an issue. The performance is always the same either with the intrisinc or disabling it as shown above. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664871838 From galder at openjdk.org Tue Feb 18 08:16:19 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 18 Feb 2025 08:16:19 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin wrote: >> @rwestrel @galderz >> >>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues. >> >> I'm a little scared to just accept the regressions, especially for this "most average looking case": >> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in. >> >>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise. >> >> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry. >> >>> Shouldn't int min/max be affected the same way? >> >> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: >> >> >> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java >> CompileCommand: compileonly TestIntMax.test* bool compileonly = true >> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true >> Warmup >> 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes) >> 5226 94 3 TestIntMax::test1 (27 bytes) >> 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes) >> 5238 96 4 TestIntMax::test1 (27 bytes) >> Run >> Time: 542056319 >> Warmup >> 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes) >> 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes) >> 6329 103 4 TestIntMax::test2 (34 bytes) >> Run >> Time: 166815209 >> >> That's a 4x regression on random input data! >> >> With: >> >> import java.util.Random; >> >> public class TestIntMax { >> private static Random RANDOM = new Random(); >> >> public static void main(String[] args) { >> int[] a = new int[64 * 1024]; >> for (int i = 0; i < a.length; i++) { >>... > >> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: > > I observe the same: > > > Warmup > 751 3 b TestIntMax::test1 (27 bytes) > Run > Time: 360 550 158 > Warmup > 1862 15 b TestIntMax::test2 (34 bytes) > Run > Time: 92 116 170 > > > But then with this: > > > diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad > index 8cc4a970bfd..9abda8f4178 100644 > --- a/src/hotspot/cpu/x86/x86_64.ad > +++ b/src/hotspot/cpu/x86/x86_64.ad > @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr) > %} > > > -instruct maxI_rReg(rRegI dst, rRegI src) > +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr) > %{ > match(Set dst (MaxI dst src)); > + effect(KILL cr); > > ins_cost(200); > - expand %{ > - rFlagsReg cr; > - compI_rReg(cr, dst, src); > - cmovI_reg_l(dst, src, cr); > + ins_encode %{ > + Label done; > + __ cmpl($src$$Register, $dst$$Register); > + __ jccb(Assembler::less, done); > + __ mov($dst$$Register, $src$$Register); > + __ bind(done); > %} > + ins_pipe(pipe_cmov_reg); > %} > > // ============================================================================ > > > the performance gap narrows: > > > Warmup > 770 3 b TestIntMax::test1 (27 bytes) > Run > Time: 94 951 677 > Warmup > 1312 15 b TestIntMax::test2 (34 bytes) > Run > Time: 70 053 824 > > > (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then? Note something I spoke with @rwestrel yesterday in the context of long min/max vs int min/max. Int has an ad implementation for min/max whereas long as not. My very first prototype of this issue was to mimmic what int did with log, but talking to @rwestrel we decided it would be better to implement this without introducing platform specific changes. So, following Roland's thread in https://github.com/openjdk/jdk/pull/20098#issuecomment-2663379660, I could add ad changes for say x86 and aarch64 for long such that it uses branch instead of cmov. Note that the cmov fallback of long min/max comes from macro expansion, not platform specific changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664893516 From galder at openjdk.org Tue Feb 18 08:20:15 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 18 Feb 2025 08:20:15 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FFeVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin wrote: >> @rwestrel @galderz >> >>> It seems overall, we likely win more than we loose with this intrinsic, so I would integrate this change as it is and file a bug to keep track of remaining issues. >> >> I'm a little scared to just accept the regressions, especially for this "most average looking case": >> Imagine you have an array with random numbers. Or at least numbers in a random order. If we take the max, then we expect the first number to be max with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the average branch probability is `n / (sum_i 1/i)`. This goes closer and closer to zero, the larger the array. This means that the "average" case has an extreme probability. And so if we do not vectorize, then this gets us a regression with the current patch. And vectorization is a little fragile, it only takes very little for vectorization not to kick in. >> >>> The Min/Max nodes are floating nodes. They can hoist out of loop and common reliably in ways that are not guaranteed otherwise. >> >> I suppose we could write an optimization that can hoist out loop independent if-diamonds out of a loop. If the condition and all phi inputs are loop invariant, you could just cut the diamond out of the loop, and paste it before the loop entry. >> >>> Shouldn't int min/max be affected the same way? >> >> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: >> >> >> java -XX:CompileCommand=compileonly,TestIntMax::test* -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors TestIntMax.java >> CompileCommand: compileonly TestIntMax.test* bool compileonly = true >> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = true >> Warmup >> 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes) >> 5226 94 3 TestIntMax::test1 (27 bytes) >> 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes) >> 5238 96 4 TestIntMax::test1 (27 bytes) >> Run >> Time: 542056319 >> Warmup >> 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes) >> 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes) >> 6329 103 4 TestIntMax::test2 (34 bytes) >> Run >> Time: 166815209 >> >> That's a 4x regression on random input data! >> >> With: >> >> import java.util.Random; >> >> public class TestIntMax { >> private static Random RANDOM = new Random(); >> >> public static void main(String[] args) { >> int[] a = new int[64 * 1024]; >> for (int i = 0; i < a.length; i++) { >>... > >> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: > > I observe the same: > > > Warmup > 751 3 b TestIntMax::test1 (27 bytes) > Run > Time: 360 550 158 > Warmup > 1862 15 b TestIntMax::test2 (34 bytes) > Run > Time: 92 116 170 > > > But then with this: > > > diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad > index 8cc4a970bfd..9abda8f4178 100644 > --- a/src/hotspot/cpu/x86/x86_64.ad > +++ b/src/hotspot/cpu/x86/x86_64.ad > @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr) > %} > > > -instruct maxI_rReg(rRegI dst, rRegI src) > +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr) > %{ > match(Set dst (MaxI dst src)); > + effect(KILL cr); > > ins_cost(200); > - expand %{ > - rFlagsReg cr; > - compI_rReg(cr, dst, src); > - cmovI_reg_l(dst, src, cr); > + ins_encode %{ > + Label done; > + __ cmpl($src$$Register, $dst$$Register); > + __ jccb(Assembler::less, done); > + __ mov($dst$$Register, $src$$Register); > + __ bind(done); > %} > + ins_pipe(pipe_cmov_reg); > %} > > // ============================================================================ > > > the performance gap narrows: > > > Warmup > 770 3 b TestIntMax::test1 (27 bytes) > Run > Time: 94 951 677 > Warmup > 1312 15 b TestIntMax::test2 (34 bytes) > Run > Time: 70 053 824 > > > (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then? To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664903731 From epeter at openjdk.org Tue Feb 18 08:46:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 08:46:17 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: On Tue, 18 Feb 2025 08:17:59 GMT, Galder Zamarre?o wrote: >>> I think we should be able to see the same issue here, actually. Yes. Here a quick benchmark below: >> >> I observe the same: >> >> >> Warmup >> 751 3 b TestIntMax::test1 (27 bytes) >> Run >> Time: 360 550 158 >> Warmup >> 1862 15 b TestIntMax::test2 (34 bytes) >> Run >> Time: 92 116 170 >> >> >> But then with this: >> >> >> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad >> index 8cc4a970bfd..9abda8f4178 100644 >> --- a/src/hotspot/cpu/x86/x86_64.ad >> +++ b/src/hotspot/cpu/x86/x86_64.ad >> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, rFlagsReg cr) >> %} >> >> >> -instruct maxI_rReg(rRegI dst, rRegI src) >> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr) >> %{ >> match(Set dst (MaxI dst src)); >> + effect(KILL cr); >> >> ins_cost(200); >> - expand %{ >> - rFlagsReg cr; >> - compI_rReg(cr, dst, src); >> - cmovI_reg_l(dst, src, cr); >> + ins_encode %{ >> + Label done; >> + __ cmpl($src$$Register, $dst$$Register); >> + __ jccb(Assembler::less, done); >> + __ mov($dst$$Register, $src$$Register); >> + __ bind(done); >> %} >> + ins_pipe(pipe_cmov_reg); >> %} >> >> // ============================================================================ >> >> >> the performance gap narrows: >> >> >> Warmup >> 770 3 b TestIntMax::test1 (27 bytes) >> Run >> Time: 94 951 677 >> Warmup >> 1312 15 b TestIntMax::test2 (34 bytes) >> Run >> Time: 70 053 824 >> >> >> (the number of test2 fluctuates quite a bit). Does it ever make sense to implement `MaxI` with a conditional move then? > > To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have. @galderz I think we will have the same issue with both `int` and `long`: As far as I know, it is really a difficult problem to decide at compile-time if a `cmove` or `branch` is the better choice. I'm not sure there is any heuristic for which you will not find a micro-benchmark where the heuristic made the wrong choice. To my understanding, these are the factors that impact the performance: - `cmove` requires all inputs to complete before it can execute, and it has an inherent latency of a cycle or so itself. But you cannot have any branch mispredictions, and hence no branch misprediction penalties (i.e. when the CPU has to flush out the ops from the wrong branch and restart at the branch). - `branch` can hide some latencies, because we can already continue with the branch that is speculated on. We do not need to wait for the inputs of the comparison to arrive, and we can already continue with the speculated resulting value. But if the speculation is ever wrong, we have to pay the misprediction penalty. In my understanding, there are roughly 3 scenarios: - The branch probability is so extreme that the branch predictor would be correct almost always, and so it is profitable to do branching code. - The branching probability is somewhere in the middle, and the branch is not predictable. Branch mispredictions are very expensive, and so it is better to use `cmove`. - The branching probability is somewhere in the middle, but the branch is predictable (e.g. swapps back and forth). The branch predictor will have almost no mispredictions, and it is faster to use branching code. Modeling this precisely is actually a little complex. You would have to know the cost of the `cmove` and the `branching` version of the code. That depends on the latency of the inputs, and the outputs: does the `cmove` dramatically increase the latency on the critical path, and `branching` could hide some of that latency? And you would have to know how good the branch predictor is, which you cannot derive from the branching probability of our profiling (at least not when the probabilities are in the middle, and you don't know if it is a random or predictable pattern). If we can find a perfect heuristic - that would be fantastic ;) If we cannot find a perfect heuristic, then we should think about what are the most "common" or "relevant" scenarios, I think. But let's discuss all of this in a call / offline :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2664956307 From galder at openjdk.org Tue Feb 18 09:24:18 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 18 Feb 2025 09:24:18 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: On Tue, 18 Feb 2025 08:43:38 GMT, Emanuel Peter wrote: > But let's discuss all of this in a call / offline :) Yup. > I ran the exact same test with longs and I don't see such an issue. The performance is always the same either with the intrisinc or disabling it as shown above. For the equivalent long tests I think I made a mistake in the id of the disabled intrinsic, it should be `_maxL` and not `_max`. I will repeat the tests and post if any similar differences observed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2665045881 From roland at openjdk.org Tue Feb 18 09:35:16 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 09:35:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Would it make sense to add verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard (and maybe whenever there's a multi version guard, loops that are guarded are indeed flagged correctly)? src/hotspot/share/opto/loopTransform.cpp line 751: > 749: // Peeling also destroys the connection of the main loop > 750: // to the multiversion_if. > 751: cl->set_no_multiversion(); Would we want to change the multiversion guard at this point so it constant folds and the slow version is removed? src/hotspot/share/opto/loopUnswitch.cpp line 513: > 511: > 512: // Create new Region. > 513: RegionNode* region = new RegionNode(1); So we create a new `Region` every time a new condition is added? src/hotspot/share/opto/loopnode.cpp line 1097: > 1095: // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so > 1096: // we do a custom check here. > 1097: if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) { Isn't that done by `add_parse_predicate`? src/hotspot/share/opto/traceAutoVectorizationTag.hpp line 32: > 30: > 31: #define COMPILER_TRACE_AUTO_VECTORIZATION_TAG(flags) \ > 32: flags(POINTER_PARSING, "Trace VPointer/MemPointer parsing") \ Has anything changed here? I stared at it a few times and couldn't figure out what has. ------------- PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2622881581 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959338954 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959344256 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959347164 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959349092 From roland at openjdk.org Tue Feb 18 09:48:14 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 09:48:14 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 15:24:44 GMT, Emanuel Peter wrote: > > Do you intend to use a single deoptimization reason for all vectorization related predicates? (that is when you take care of aliasing, are you going to to use the same reason for aliasing and alignment checks) > > I suppose that is currently what I'm planning. But we could in principle separate them. But I would leave that for later, if there is any desire to do that. For now, I think it's ok to just go with a single "auto-vectorization" reason. > > Does that sound reasonable? Yes, it sounds reasonable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665104472 From epeter at openjdk.org Tue Feb 18 09:48:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 09:48:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com> On Tue, 18 Feb 2025 09:09:15 GMT, Roland Westrelin wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > src/hotspot/share/opto/loopTransform.cpp line 751: > >> 749: // Peeling also destroys the connection of the main loop >> 750: // to the multiversion_if. >> 751: cl->set_no_multiversion(); > > Would we want to change the multiversion guard at this point so it constant folds and the slow version is removed? I suppose we can probably do that. Otherwise, we just have to wait until the `OpaqueMultiversioningNode` constant folds after loop-opts. > src/hotspot/share/opto/loopUnswitch.cpp line 513: > >> 511: >> 512: // Create new Region. >> 513: RegionNode* region = new RegionNode(1); > > So we create a new `Region` every time a new condition is added? Yes. Are you ok with that? Or would you prefer if we extended an existing region (is that possible?) and then we'd have 2 cases, one where there is none yet, and one where we'd extend. I think adding one each time is easier, and it would get commoned anyway, right? > src/hotspot/share/opto/traceAutoVectorizationTag.hpp line 32: > >> 30: >> 31: #define COMPILER_TRACE_AUTO_VECTORIZATION_TAG(flags) \ >> 32: flags(POINTER_PARSING, "Trace VPointer/MemPointer parsing") \ > > Has anything changed here? I stared at it a few times and couldn't figure out what has. I added the tag `SPECULATIVE_RUNTIME_CHECKS`. And then had to change alignment for all others ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959397988 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959392450 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959394676 From epeter at openjdk.org Tue Feb 18 09:51:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 09:51:15 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:14:28 GMT, Roland Westrelin wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > src/hotspot/share/opto/loopnode.cpp line 1097: > >> 1095: // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so >> 1096: // we do a custom check here. >> 1097: if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) { > > Isn't that done by `add_parse_predicate`? @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959403871 From epeter at openjdk.org Tue Feb 18 09:56:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 09:56:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:32:19 GMT, Roland Westrelin wrote: > Would it make sense to add verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard (and maybe whenever there's a multi version guard, loops that are guarded are indeed flagged correctly)? I'd have to see if that is possible. Well: > verification code that makes sure that whenever a loop is flagged as multi version, c2 can find the multi version guard That is maybe possible. At least I cannot think of a reason why it should not work right now. Well, maybe what if the predicates get messed up somehow, is that possible? Then you would lose connection. Ah: what if the pre-loop somehow gets "messed up", i.e. that it loses its loop structure. Then we could not really go from the main-loop to the pre-loop to the selector-if any more. > whenever there's a multi version guard, loops that are guarded are indeed flagged correctly That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665123097 From roland at openjdk.org Tue Feb 18 09:56:14 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 09:56:14 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:48:58 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.cpp line 1097: >> >>> 1095: // PhaseIdealLoop::add_parse_predicate only checks trap limits per method, so >>> 1096: // we do a custom check here. >>> 1097: if (!C->too_many_traps(cloned_sfpt->jvms()->method(), cloned_sfpt->jvms()->bci(), Deoptimization::Reason_auto_vectorization_check)) { >> >> Isn't that done by `add_parse_predicate`? > > @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else? Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959411405 From epeter at openjdk.org Tue Feb 18 10:07:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:07:18 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 09:53:14 GMT, Roland Westrelin wrote: >> @rwestrel I only see `if (!C->too_many_traps(reason)) {` in `PhaseIdealLoop::add_parse_predicate`. And as the comment I put here that only checks the `reason` per `method`, and not per `bci`. Do you see anything else? > > Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it? @rwestrel So we would check both, right? But is that what we want for all predicates? `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`: if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) { But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this: if (md->has_trap_at(bci, m, reason) != 0) { // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic. // Also, if there are multiple reasons, or if there is no per-BCI record, // assume the worst. So the `bci` check fails if there has been even a single trapping recorded. So it seems that such a change would affect the behavior in ways I cannot yet predict. What do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959431345 From galder at openjdk.org Tue Feb 18 10:09:18 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 18 Feb 2025 10:09:18 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: <50uPQ3ue90Xr_LSEm8z3XLTL1yx2A-Q0SJ8rdmv-gsg=.960a6c31-9850-4ce3-bd88-41d4342a5605@github.com> On Tue, 18 Feb 2025 09:21:46 GMT, Galder Zamarre?o wrote: > For the equivalent long tests I think I made a mistake in the id of the disabled intrinsic, it should be _maxL and not _max. I will repeat the tests and post if any similar differences observed. FYI Indeed a similar pattern is observed for long min/max (with the patch in this PR): make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionSimpleMax" MICRO="FORK=1" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.longReductionSimpleMax 50 2048 thrpt 4 460.392 ? 0.076 ops/ms MinMaxVector.longReductionSimpleMax 80 2048 thrpt 4 460.459 ? 0.438 ops/ms MinMaxVector.longReductionSimpleMax 100 2048 thrpt 4 460.469 ? 0.057 ops/ms make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionSimpleMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_maxL" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.longReductionSimpleMax 50 2048 thrpt 4 460.453 ? 0.188 ops/ms MinMaxVector.longReductionSimpleMax 80 2048 thrpt 4 460.507 ? 0.192 ops/ms MinMaxVector.longReductionSimpleMax 100 2048 thrpt 4 1013.498 ? 1.607 ops/ms make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMax" MICRO="FORK=1" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.longReductionMultiplyMax 50 2048 thrpt 4 966.429 ? 0.359 ops/ms MinMaxVector.longReductionMultiplyMax 80 2048 thrpt 4 966.569 ? 0.338 ops/ms MinMaxVector.longReductionMultiplyMax 100 2048 thrpt 4 966.548 ? 0.575 ops/ms make test TEST="micro:org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMax" MICRO="FORK=1;OPTIONS=-jvmArgs -XX:CompileCommand=option,org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub,ccstrlist,DisableIntrinsic,_maxL" Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxVector.longReductionMultiplyMax 50 2048 thrpt 4 966.130 ? 5.549 ops/ms MinMaxVector.longReductionMultiplyMax 80 2048 thrpt 4 966.380 ? 0.663 ops/ms MinMaxVector.longReductionMultiplyMax 100 2048 thrpt 4 859.233 ? 7.817 ops/ms ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2665159015 From epeter at openjdk.org Tue Feb 18 10:11:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:11:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: On Tue, 18 Feb 2025 09:57:29 GMT, Roland Westrelin wrote: > > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? >There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2665161507 From roland at openjdk.org Tue Feb 18 10:11:17 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 10:11:17 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> On Tue, 18 Feb 2025 10:04:59 GMT, Emanuel Peter wrote: >> Seems like it's a bug that `PhaseIdealLoop::add_parse_predicate` doesn't check the `bci` too. Could you fix it? > > @rwestrel So we would check both, right? But is that what we want for all predicates? > > `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`: > > if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) { > > > But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this: > > if (md->has_trap_at(bci, m, reason) != 0) { > // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic. > // Also, if there are multiple reasons, or if there is no per-BCI record, > // assume the worst. > > So the `bci` check fails if there has been even a single trapping recorded. > > So it seems that such a change would affect the behavior in ways I cannot yet predict. > > What do you think? That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959437628 From epeter at openjdk.org Tue Feb 18 10:20:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:20:15 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:09:00 GMT, Roland Westrelin wrote: > That code Which code are you referring to? Ah, probably you are talking about `PhaseIdealLoop::add_parse_predicate`, which is using the method wide check. And `GraphKit::add_parse_predicate` actually queries `GraphKit::too_many_traps`, which knows the current `bci()`, and can query the per-bci count. > Would you like me to fix this separately? Yes, please. I definitely don't want to do it in this PR ;) And I don't have as much experience with traps as you do. We'd have to think a little about what cases this affects, and if performance would go up or down in all those cases. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959451204 From epeter at openjdk.org Tue Feb 18 10:29:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 10:29:12 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:09:00 GMT, Roland Westrelin wrote: >> @rwestrel So we would check both, right? But is that what we want for all predicates? >> >> `C->too_many_traps(reason)` checks against `PerMethodTrapLimit`: >> >> if (trap_count(reason) >= Deoptimization::per_method_trap_limit(reason)) { >> >> >> But the `bci` check works with `PerBytecodeTrapLimit`, and it actually has a comment like this: >> >> if (md->has_trap_at(bci, m, reason) != 0) { >> // Assume PerBytecodeTrapLimit==0, for a more conservative heuristic. >> // Also, if there are multiple reasons, or if there is no per-BCI record, >> // assume the worst. >> >> So the `bci` check fails if there has been even a single trapping recorded. >> >> So it seems that such a change would affect the behavior in ways I cannot yet predict. >> >> What do you think? > > That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately? @rwestrel do you consider that a blocking issue for this PR here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959463556 From roland at openjdk.org Tue Feb 18 10:29:13 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Feb 2025 10:29:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:25:08 GMT, Emanuel Peter wrote: >> That code is supposed to mirror the `GraphKit::add_parse_predicate()`. It doesn't. Would you like me to fix this separately? > > @rwestrel do you consider that a blocking issue for this PR here? No ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1959465988 From adinn at openjdk.org Tue Feb 18 13:36:27 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 18 Feb 2025 13:36:27 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: <3kiI1J7jcczgzTRi9HZztzhGe1blcy8Ga11xoGhzueY=.98543172-5b38-4199-bead-0988de0e0e75@github.com> On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Adding comments + some code reorganization src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2594: > 2592: guarantee(T != T1Q && T != T1D, "incorrect arrangement"); \ > 2593: if (!acceptT2D) guarantee(T != T2D, "incorrect arrangement"); \ > 2594: if (strcmp(#NAME, "sqdmulh") == 0) guarantee(T != T8B && T != T16B, "incorrect arrangement"); \ Suggestion: I think it might be better to change this test from a strcmp call to (opc2 == 0b101101). The strcmp test is clearer to a reader of the code but the call may not be guaranteed to be compiled out at build time while the latter will. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1959758334 From adinn at openjdk.org Tue Feb 18 13:46:17 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 18 Feb 2025 13:46:17 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Adding comments + some code reorganization src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4066: > 4064: } > 4065: > 4066: // Execute on round of keccak of two computations in parallel. Suggestion: It would be helpful to add comments that relate the register and instruction selection to the original Java source code. e.g. change the header as follows // Performs 2 keccak round transformations using vector parallelism // // Two sets of 25 * 64-bit input states a0[lo:hi]...a24[lo:hi] are passed in // the lower/upper halves of registers v0...v24 and the transformed states // are returned in the same registers. Intermediate 64-bit pairs // c0...c5 and d0...d5 are computed in registers v25...v30. v31 is // loaded with the required pair of 64 bit rounding constants. // During computation of the output states some intermediate results are // shuffled around registers v0...v30. Comments on each line indicate // how the values in registers correspond to variables ai, ci, di in // the Java source code, likewise how the generated machine instructions // correspond to Java source operations (n.b. rol means rotate left). The annotate the generation steps as follows: __ eor3(v29, __ T16B, v4, v9, v14); // c4 = a4 ^ a9 ^ a14 __ eor3(v26, __ T16B, v1, v6, v11); // c1 = a1 ^ a16 ^ a11 __ eor3(v28, __ T16B, v3, v8, v13); // c3 = a3 ^ a8 ^a13 __ eor3(v25, __ T16B, v0, v5, v10); // c0 = a0 ^ a5 ^ a10 __ eor3(v27, __ T16B, v2, v7, v12); // c2 = a2 ^ a7 ^ a12 __ eor3(v29, __ T16B, v29, v19, v24); // c4 ^= a19 ^ a24 __ eor3(v26, __ T16B, v26, v16, v21); // c1 ^= a16 ^ a21 __ eor3(v28, __ T16B, v28, v18, v23); // c3 ^= a18 ^ a23 __ eor3(v25, __ T16B, v25, v15, v20); // c0 ^= a15 ^ a20 __ eor3(v27, __ T16B, v27, v17, v22); // c2 ^= a17 ^ a22 __ rax1(v30, __ T2D, v29, v26); // d0 = c4 ^ rol(c1, 1) __ rax1(v26, __ T2D, v26, v28); // d2 = c1 ^ rol(c3, 1) __ rax1(v28, __ T2D, v28, v25); // d4 = c3 ^ rol(c0, 1) __ rax1(v25, __ T2D, v25, v27); // d1 = c0 ^ rol(c2, 1) __ rax1(v27, __ T2D, v27, v29); // d3 = c2 ^ rol(c4, 1) __ eor(v0, __ T16B, v0, v30); // a0 = a0 ^ d0 __ xar(v29, __ T2D, v1, v25, (64 - 1)); // a10' = rol((a1^d1), 1) __ xar(v1, __ T2D, v6, v25, (64 - 44)); // a1 = rol(a6^d1), 44) __ xar(v6, __ T2D, v9, v28, (64 - 20)); // a6 = rol((a9^d4), 20) __ xar(v9, __ T2D, v22, v26, (64 - 61)); // a9 = rol((a22^d2), 61) __ xar(v22, __ T2D, v14, v28, (64 - 39)); // a22 = rol((a14^d4), 39) __ xar(v14, __ T2D, v20, v30, (64 - 18)); // a14 = rol((a20^d0), 18) __ xar(v31, __ T2D, v2, v26, (64 - 62)); // a20' = rol((a2^d2), 62) __ xar(v2, __ T2D, v12, v26, (64 - 43)); // a2 = rol((a12^d2), 43) __ xar(v12, __ T2D, v13, v27, (64 - 25)); // a12 = rol((a13^d3), 25) __ xar(v13, __ T2D, v19, v28, (64 - 8)); // a13 = rol((a19^d4), 8) __ xar(v19, __ T2D, v23, v27, (64 - 56)); // a19 = rol((a23^d3), 56) __ xar(v23, __ T2D, v15, v30, (64 - 41)); // a23 = rol((a15^d0), 41) __ xar(v15, __ T2D, v4, v28, (64 - 27)); // a15 = rol((a4^d4), 27) __ xar(v28, __ T2D, v24, v28, (64 - 14)); // a4' = rol((a24^d4), 14) __ xar(v24, __ T2D, v21, v25, (64 - 2)); // a24 = rol((a21^d1), 2) __ xar(v8, __ T2D, v8, v27, (64 - 55)); // a21' = rol((a8^d3), 55) __ xar(v4, __ T2D, v16, v25, (64 - 45)); // a8' = rol((a16^d1), 45) __ xar(v16, __ T2D, v5, v30, (64 - 36)); // a16 = rol((a5^d0), 36) __ xar(v5, __ T2D, v3, v27, (64 - 28)); // a5 = rol((a3^d3), 28) __ xar(v27, __ T2D, v18, v27, (64 - 21)); // a3' = rol((a18^d3), 21) __ xar(v3, __ T2D, v17, v26, (64 - 15)); // a18' = rol((a17^d2), 15) __ xar(v25, __ T2D, v11, v25, (64 - 10)); // a17' = rol((a11^d1), 10) __ xar(v26, __ T2D, v7, v26, (64 - 6)); // a11' = rol((a7^d2), 6) __ xar(v30, __ T2D, v10, v30, (64 - 3)); // a7' = rol((a10^d0), 3) __ bcax(v20, __ T16B, v31, v22, v8); // a20 = a20' ^ (~a21 & a22') __ bcax(v21, __ T16B, v8, v23, v22); // a21 = a21' ^ (~a22 & a23) __ bcax(v22, __ T16B, v22, v24, v23); // a22 = a22 ^ (~a23 & a24) __ bcax(v23, __ T16B, v23, v31, v24); // a23 = a23 ^ (~a24 & a20') __ bcax(v24, __ T16B, v24, v8, v31); // a24 = a24 ^ (~a20' & a21') __ ld1r(v31, __ T2D, __ post(rscratch1, 8)); // rc = round_constants[i] __ bcax(v17, __ T16B, v25, v19, v3); // a17 = a17' ^ (~a18' & a19) __ bcax(v18, __ T16B, v3, v15, v19); // a18 = a18' ^ (~a19 & a15') __ bcax(v19, __ T16B, v19, v16, v15); // a19 = a19 ^ (~a15 & a16) __ bcax(v15, __ T16B, v15, v25, v16); // a15 = a15 ^ (~a16 & a17') __ bcax(v16, __ T16B, v16, v3, v25); // a16 = a16 ^ (~a17' & a18') __ bcax(v10, __ T16B, v29, v12, v26); // a10 = a10' ^ (~a11' & a12) __ bcax(v11, __ T16B, v26, v13, v12); // a11 = a11' ^ (~a12 & a13) __ bcax(v12, __ T16B, v12, v14, v13); // a12 = a12 ^ (~a13 & a14) __ bcax(v13, __ T16B, v13, v29, v14); // a13 = a13 ^ (~a14 & a10') __ bcax(v14, __ T16B, v14, v26, v29); // a14 = a14 ^ (~a10' & a11') __ bcax(v7, __ T16B, v30, v9, v4); // a7 = a7' ^ (~a8' & a9) __ bcax(v8, __ T16B, v4, v5, v9); // a8 = a8' ^ (~a9 & a5) __ bcax(v9, __ T16B, v9, v6, v5); // a9 = a9 ^ (~a5 & a6) __ bcax(v5, __ T16B, v5, v30, v6); // a5 = a5 ^ (~a6 & a7) __ bcax(v6, __ T16B, v6, v4, v30); // a6 = a6 ^ (~a7 & a8') __ bcax(v3, __ T16B, v27, v0, v28); // a3 = a3' ^ (~a4' & a0) __ bcax(v4, __ T16B, v28, v1, v0); // a4 = a4' ^ (~a0 & a1) __ bcax(v0, __ T16B, v0, v2, v1); // a0 = a0 ^ (~a1 & a2) __ bcax(v1, __ T16B, v1, v27, v2); // a1 = a1 ^ (~a2 & a3) __ bcax(v2, __ T16B, v2, v28, v27); // a2 = a2 ^ (~a3 & a4') __ eor(v0, __ T16B, v0, v31); // a0 = a0 ^ rc ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1959776475 From kvn at openjdk.org Tue Feb 18 19:24:22 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:24:22 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v2] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Wed, 12 Feb 2025 03:03:34 GMT, Chris Plummer wrote: >> Before I forgot to answer you, @plummercj >> I completely agree with your comment about cleaning up wrapper subclasses which do nothing. >> >> I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? >> >> An other purpose could be a place holder for additional information in a future which never come. >> >> Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. >> >> So yes, feel free to clean this up. I will help with review. > >> I think some wrapper subclasses for CodeBlob were kept because of `is*()` which were used only in `PStack` to print name. Why not use `getName()` for this purpose without big `if/else` there? > > Possibly getName() didn't exist when PStack was first written. It would be good if PStack not only included the type name as it does now, but also the actual name of the blob, which getName() would return. > >> An other purpose could be a place holder for additional information in a future which never come. > > Yes, and you also see that with the Observer registration and the `Type type = db.lookupType()` code, which are only needed if you are going to lookup fields of the subtypes, which most don't ever do, yet they all have this code. > >> Other wrapper provides information available in `CodeBlob`. Like `RuntimeStub. callerMustGCArguments()`. `_caller_must_gc_arguments` field is part of VM's `CodeBlob` class for some time now. Looks like I missed change in SA when did change in VM. > > Yeah, that's not working right for CodeBlob subtypes that are not RuntimeStubs. Easy to fix though. > >> So yes, feel free to clean this up. I will help with review. > > Ok. Let me see where things are at after you are done with the PR. Thank you, @plummercj , for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2666228333 From kvn at openjdk.org Tue Feb 18 19:24:34 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:24:34 GMT Subject: RFR: 8349088: De-virtualize Codeblob and nmethod [v10] In-Reply-To: References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sat, 15 Feb 2025 06:34:56 GMT, Vladimir Kozlov wrote: >> Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. >> >> Added C++ static asserts to make sure no virtual methods are added in a future. >> >> Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. >> >> Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Remove commented lines left by mistake Thank you all for reviews and suggestions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23533#issuecomment-2666253220 From kvn at openjdk.org Tue Feb 18 19:26:04 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:04 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: On Tue, 18 Feb 2025 10:07:07 GMT, Emanuel Peter wrote: >>> That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? >> >> There is code that removes the `OuterStripMinedLoop` if the `CountedLoop` goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoop` is left behind without a `CountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. > >> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? > >>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. > > Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one? @eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance. Do we really need it for this changes? Can we simply generate un-vectorized loop? " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support? An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666176147 From epeter at openjdk.org Tue Feb 18 19:26:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 19:26:07 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: On Tue, 18 Feb 2025 16:10:20 GMT, Vladimir Kozlov wrote: >>> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that? >> >>>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity. >> >> Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one? > > @eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance. Do we really need it for this changes? Can we simply generate un-vectorized loop? > > " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support? > > An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too? @vnkozlov > " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support? I would say most of the code here, i.e. the predicate and multi-version parts are also relevant for the up-coming patch for aliasing analysis runtime-checks. These are especially important for `MemorySegment` cases where there could basically always be aliasing and only runtime-checks can help us vectorize. There is really only a small part, which is emitting the actual alignment-check. > Do we really need it for this changes? Can we simply generate un-vectorized loop? The alternatives on architectures that are actually affected by this bug: - Not fix the bug, and risk possible `SIGBUS`. And on our platforms, that just means living with the HALT caused by `VerifyAlignVector`. - Disable ALL vectorization of cases where we cannot guarantee statically that accesses are aligned. That would certainly disable all uses of `MemorySegment`, and that is probably not preferrable. > my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance. Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. Also, with OSR we already currently don't generate predicates, and so it is generating the multi-versioning for those. And I really could not measure any difference in the performance benchmarking. I doubt it is even noticable on compile-time. > An other question is what deoptimization Action is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too? Yes, when we deopt for the bci, we recompile immediately. The alternative is to make the check per method, but then the risk is that one loop deopting causes other loops to be multi-versioned instead of using predicates too. Counting deopts per bci is currently not done at all. But I suppose we could make it a bit more "forgiving"... but is that worth it? I suppose if in reallity we do see non-aligned cases (or in the future cases where we have problematic aliasing), then it will probably repeat, and is worth recompiling to handle both cases. But that is speculation, and we can discuss :) TLDR: @vnkozlov I would not have fixed the bug with such a heavy mechanism if I did not intend to use it for runtime-check for aliasing analysis. And 90% of the code here is reusable for that. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666357998 From kvn at openjdk.org Tue Feb 18 19:26:11 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:11 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: <7CUvxR76ROhB7TB2qqbF2nQB5RNIj4GpRvKqZSw-dDM=.8917fc6a-3e84-4a9b-8df7-2eec07cfa768@github.com> On Tue, 18 Feb 2025 17:20:23 GMT, Emanuel Peter wrote: > Do we really need it for this changes? Can we simply generate un-vectorized loop? To clarify. This question was about second phase after we deoptimize and recompile when hit predicate check failure. I am fine with predicate change. > And I really could not measure any difference in the performance benchmarking. I doubt it is even noticable on compile-time. Right. If a method has a vectorizable loop, it is most likely has big generated code and not inlined already. So adding 4th loop may not affected significantly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666506254 PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666525354 From kvn at openjdk.org Tue Feb 18 19:26:16 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:16 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code? About actual probability value. I was thinking PROB_LIKELY_MAG(3). PROB_LIKELY_MAG(1) will only guarantee that vectorized loop will be first but it could be enough without moving other loop from hot path. Needs testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666554240 PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666710345 From epeter at openjdk.org Tue Feb 18 19:26:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Feb 2025 19:26:18 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 18:29:42 GMT, Vladimir Kozlov wrote: > What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code? I'm not sure what you are asking. Are you asking what probability I'm setting for the multi-version branch? This is the loop selector, which later gets copied for each of the checks. `const LoopSelector loop_selector(lpt, opaque, PROB_FAIR, COUNT_UNKNOWN);` So 50%. But maybe you are suggesting it should really be biased towards the fast-path, right? What probability would you suggest? It should probably be fairly low, since there can be multiple checks added, and each one lowers the probability of arriving at the true-loop. So for scheduling, we should keep the probability high, so the true-loop is scheduled closer, right? Is that what you meant? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666602599 From kvn at openjdk.org Tue Feb 18 19:26:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 19:26:19 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 18:45:34 GMT, Emanuel Peter wrote: > > What probabilities for multi-version loops branches? Did non-vectorized version is move out of hot path in generated code? > > I'm not sure what you are asking. Are you asking what probability I'm setting for the multi-version branch? > > This is the loop selector, which later gets copied for each of the checks. `const LoopSelector loop_selector(lpt, opaque, PROB_FAIR, COUNT_UNKNOWN);` > > So 50%. But maybe you are suggesting it should really be biased towards the fast-path, right? What probability would you suggest? It should probably be fairly low, since there can be multiple checks added, and each one lowers the probability of arriving at the true-loop. So for scheduling, we should keep the probability high, so the true-loop is scheduled closer, right? > > Is that what you meant? Yes. I want prioritize fast path assuming it is vectorized loop and that we get aligned data more frequently. It is actually difficult to judge without statistic from real applications. It should be reversed if an application works mostly on unaligned data. Can we profile alignment in Interpreter (and C1)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666635167 From kvn at openjdk.org Tue Feb 18 20:11:04 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 18 Feb 2025 20:11:04 GMT Subject: Integrated: 8349088: De-virtualize Codeblob and nmethod In-Reply-To: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> References: <0PvzE8go0Q4VGhH_OF3OyPPgoD4qhxxfBgSHe41chBU=.e471b2a1-96ad-490d-b3d0-a050bd00d7d8@github.com> Message-ID: On Sun, 9 Feb 2025 17:45:30 GMT, Vladimir Kozlov wrote: > Remove virtual methods from CodeBlob and nmethod to simplify saving/restoring in Leyden AOT cache. It avoids the need to patch hidden VPTR pointer to class's virtual table. > > Added C++ static asserts to make sure no virtual methods are added in a future. > > Fixed/cleaned SA code which process CodeBlob and its subclasses. Use `CodeBlob::_kind` field value to determine the type of blob. > > Tested tier1-5, hs-tier6-rt (for JFR testing), stress, xcomp This pull request has now been integrated. Changeset: 46d4a601 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/46d4a601e04f90b11d4ccc97a49f4e7010b4fd83 Stats: 529 lines in 23 files changed: 262 ins; 152 del; 115 mod 8349088: De-virtualize Codeblob and nmethod Co-authored-by: Stefan Karlsson Co-authored-by: Chris Plummer Reviewed-by: cjplummer, aboldtch, dlong ------------- PR: https://git.openjdk.org/jdk/pull/23533 From coleenp at openjdk.org Tue Feb 18 23:49:52 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 18 Feb 2025 23:49:52 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native Message-ID: Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. Tested with tier1-4 and performance tests. ------------- Commit messages: - Add ')' removed from jvmci test. - Shrink modifiers flag so isPrimitive can share word. - Remove isPrimitive intrinsic in favor of a boolean. - Make isInterface non-native. - Make isArray non-native Changes: https://git.openjdk.org/jdk/pull/23572/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349860 Stats: 178 lines in 19 files changed: 37 ins; 115 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From liach at openjdk.org Tue Feb 18 23:49:54 2025 From: liach at openjdk.org (Chen Liang) Date: Tue, 18 Feb 2025 23:49:54 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com> On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. We often need to determine what primitive type a `class` is. Currently we do it through `Wrapper.forPrimitiveType`. Do you see potential value in encoding the primitive status in a byte, so primitive info also knows what primitive type this class is instead of doing identity comparisons? @cl4es Can you offer some insight here? src/java.base/share/classes/jdk/internal/reflect/Reflection.java line 59: > 57: Reflection.class, ALL_MEMBERS, > 58: AccessibleObject.class, ALL_MEMBERS, > 59: Class.class, Set.of("classLoader", "classData", "modifiers", "isPrimitive"), I think the field is named `isPrimitive`, right? test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.runtime.test/src/jdk/vm/ci/runtime/test/TestResolvedJavaType.java line 933: > 931: if (f.getDeclaringClass().equals(metaAccess.lookupJavaType(Class.class))) { > 932: String name = f.getName(); > 933: return name.equals("classLoader") || name.equals("classData") || name.equals("modifiers") || name.equals("isPrimitive"); Same field name remark. test/jdk/jdk/internal/reflect/Reflection/Filtering.java line 59: > 57: { Class.class, "classData" }, > 58: { Class.class, "modifiers" }, > 59: { Class.class, "isPrimitive" }, Same field name remark. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2654120983 PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2659605250 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1951773863 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1951774073 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1951774214 From coleenp at openjdk.org Tue Feb 18 23:49:54 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 18 Feb 2025 23:49:54 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. I had a look at Wrapper.forPrimitiveType() and it's not an intrinsic so I don't really know how hot it is. It's a comparison, vs getting a field out of Class. Not sure how to measure it. So I can't address it in this change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2659396480 From redestad at openjdk.org Tue Feb 18 23:49:54 2025 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 18 Feb 2025 23:49:54 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Touching `Wrapper` seems out of scope for this PR, but if `Class.isPrimitive` gets cheaper from this then `Wrapper.forPrimitiveType` should definitely be examined in a follow-up. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2661970849 From coleenp at openjdk.org Tue Feb 18 23:49:55 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 18 Feb 2025 23:49:55 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com> References: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com> Message-ID: On Wed, 12 Feb 2025 00:05:13 GMT, Chen Liang wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > src/java.base/share/classes/jdk/internal/reflect/Reflection.java line 59: > >> 57: Reflection.class, ALL_MEMBERS, >> 58: AccessibleObject.class, ALL_MEMBERS, >> 59: Class.class, Set.of("classLoader", "classData", "modifiers", "isPrimitive"), > > I think the field is named `isPrimitive`, right? The method is isPrimitive so I think I had to give the field isPrimitiveType as a name, so this is wrong. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1952521536 From dlong at openjdk.org Wed Feb 19 02:38:03 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 02:38:03 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com> Message-ID: On Wed, 12 Feb 2025 12:05:22 GMT, Coleen Phillimore wrote: >> src/java.base/share/classes/jdk/internal/reflect/Reflection.java line 59: >> >>> 57: Reflection.class, ALL_MEMBERS, >>> 58: AccessibleObject.class, ALL_MEMBERS, >>> 59: Class.class, Set.of("classLoader", "classData", "modifiers", "isPrimitive"), >> >> I think the field is named `isPrimitive`, right? > > The method is isPrimitive so I think I had to give the field isPrimitiveType as a name, so this is wrong. I don't know if we have a style guide that covers this, but I believe the method and field could both be named `isPrimitive`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960863953 From dlong at openjdk.org Wed Feb 19 02:56:57 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 02:56:57 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. src/hotspot/share/classfile/javaClasses.inline.hpp line 301: > 299: #ifdef ASSERT > 300: // The heapwalker walks through Classes that have had their Klass pointers removed, so can't assert this. > 301: // assert(is_primitive == java_class->bool_field(_is_primitive_offset), "must match what we told Java"); I don't understand this comment about the heapwalker. It sounds like we could have `is_primitive` set to true incorrectly. If so, what prevents the asserts below from failing? And why not use the value from _is_primitive_offset instead? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960876174 From liach at openjdk.org Wed Feb 19 02:56:58 2025 From: liach at openjdk.org (Chen Liang) Date: Wed, 19 Feb 2025 02:56:58 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com> Message-ID: On Wed, 19 Feb 2025 02:35:25 GMT, Dean Long wrote: >> The method is isPrimitive so I think I had to give the field isPrimitiveType as a name, so this is wrong. > > I don't know if we have a style guide that covers this, but I believe the method and field could both be named `isPrimitive`. I would personally name such a boolean field `primitive`, but I don't have a strong preference on the field naming as long as its references in tests and other locations are correct. In addition, I believe this field may soon be widened to carry more hotspot-specific flags (such as hidden, etc.) so the name is bound to change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960876569 From haosun at openjdk.org Wed Feb 19 02:58:03 2025 From: haosun at openjdk.org (Hao Sun) Date: Wed, 19 Feb 2025 02:58:03 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 18:47:54 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: > > Adding comments + some code reorganization Hi. Here is the test result of our CI. ### copyright year the following files should update the copyright year to 2025. src/hotspot/cpu/aarch64/assembler_aarch64.hpp src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp src/hotspot/share/runtime/globals.hpp src/java.base/share/classes/sun/security/provider/ML_DSA.java src/java.base/share/classes/sun/security/provider/SHA3Parallel.java test/micro/org/openjdk/bench/java/security/MLDSA.java ### cross-build failure Cross build for riscv64/s390/ppc64 failed. Here shows the error msg for ppc64 === Output from failing command(s) repeated here === * For target support_interim-jmods_support__create_java.base.jmod_exec: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769 # assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0 # # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3) # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # V [libjvm.so+0x3b391c] Instruction_aarch64::~Instruction_aarch64()+0xbc # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/ # # An error report file with more information is saved as: # /tmp/jdk-src/make/hs_err_pid72752.log ... (rest of output omitted) * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs. === End of repeated output === I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2667389849 From dlong at openjdk.org Wed Feb 19 03:32:53 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 03:32:53 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. src/java.base/share/classes/java/lang/Class.java line 1287: > 1285: */ > 1286: public Class getComponentType() { > 1287: // Only return for array types. Storage may be reused for Class for instance types. I don't see any changes to componentType related to reuse. So was this comment and the code below already obsolete? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960897176 From dlong at openjdk.org Wed Feb 19 03:37:52 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 03:37:52 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. src/hotspot/share/prims/jvm.cpp line 2283: > 2281: // Otherwise it returns its argument value which is the _the_class Klass*. > 2282: // Please, refer to the description in the jvmtiThreadState.hpp. > 2283: Does this "RedefineClasses support" comment still belong here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960900041 From dholmes at openjdk.org Wed Feb 19 05:14:58 2025 From: dholmes at openjdk.org (David Holmes) Date: Wed, 19 Feb 2025 05:14:58 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Just a few passing comments as this is mainly compiler stuff. Does the SA not need any updates in relation to this? src/hotspot/share/classfile/javaClasses.cpp line 1371: > 1369: #endif > 1370: set_modifiers(java_class, JVM_ACC_ABSTRACT | JVM_ACC_FINAL | JVM_ACC_PUBLIC); > 1371: set_is_primitive(java_class); Just wondering what the comments at the start of this method are alluding to now that we do have a field at the Java level. ??? src/hotspot/share/prims/jvm.cpp line 1262: > 1260: JVM_END > 1261: > 1262: JVM_ENTRY(jboolean, JVM_IsArrayClass(JNIEnv *env, jclass cls)) Where are the changes to jvm.h? src/java.base/share/classes/java/lang/Class.java line 1009: > 1007: private transient Object classData; // Set by VM > 1008: private transient Object[] signers; // Read by VM, mutable > 1009: private final transient char modifiers; // Set by the VM Why the change of type here? ------------- PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2625638624 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960955739 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960959718 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1960960668 From epeter at openjdk.org Wed Feb 19 07:19:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 07:19:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 19:18:34 GMT, Vladimir Kozlov wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > About actual probability value. I was thinking PROB_LIKELY_MAG(3). PROB_LIKELY_MAG(1) will only guarantee that vectorized loop will be first but it could be enough without moving other loop from hot path. Needs testing. @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesting benchmark results there. Does that sound ok? > Can we profile alignment in Interpreter (and C1)? It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2667703955 From epeter at openjdk.org Wed Feb 19 07:42:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 07:42:52 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: Message-ID: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 63 commits: - Merge branch 'master' into JDK-8323582-SW-native-alignment - remove multiversion mark if we break the structure - register opaque with igvn - copyright and rm CFG check - IR rules for all cases - 3 test versions - test changed to unaligned ints - stub for slicing - add Verify/AlignVector runs to test - refactor verify - ... and 53 more: https://git.openjdk.org/jdk/compare/9042aa82...a98ffabf ------------- Changes: https://git.openjdk.org/jdk/pull/22016/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=01 Stats: 1074 lines in 27 files changed: 951 ins; 28 del; 95 mod Patch: https://git.openjdk.org/jdk/pull/22016.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016 PR: https://git.openjdk.org/jdk/pull/22016 From adinn at openjdk.org Wed Feb 19 10:44:55 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 19 Feb 2025 10:44:55 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v2] In-Reply-To: References: <7UgNYEuTu6rj7queOgM9xIy-6kQMdACrZiDLtlniMYw=.dff6f18b-1236-43b1-8280-2bce9160f32a@github.com> Message-ID: On Tue, 4 Feb 2025 18:57:28 GMT, Ferenc Rakoczi wrote: >>> @ferakocz I'm afraid you lucked out on getting your change committed before my reorganization of the stub generation code. If you are unsure of how to do the merge so your new stub is declared and generated following the new model (see the doc comments in stubDeclarations.hpp for details) let me know and I'll be happy to help you sort it out. >> >> @adinn I think I managed to figure it out. Please take a look at the PR and let me know if I should have done anything differently. > >> @ferakocz Yes, the stub declaration part of it looks to be correct. >> >> The rest of the patch will need at least two reviewers (@theRealAph? @martinuy? @franferrax) and may take some time to review, given that they will probably need to read up on the maths and algorithms. As an aid for reviewers and maintainers it would be good to insert a comment into the generator file linking the implementations to the relevant maths and algorithm. I found the FIPS-204 spec and the CRYSTALS-Dilithium Algorithm Speci?cations and Supporting Documentation paper, Shi Bai, L?o Ducas et al, 2021 - are they the best ones to look at? > > The Java implementation of ML-DSA is based on the FIPS-204 standard and the intrinsicss' implementations are based on the corresponding Java methods, except that the montMul() calls in them are inlined. The rest of the transformation from Java code to intrinsic code is pretty straightforward, so a reviewer need not necessarily understand the whole mathematics of the ML-DSA algorithms, just that the Java and the corresponding intrinsic code do the same thing. @ferakocz Apologies for the delays in reviewing and the limited feedback up to now. The code clearly does the job well but I think it would be made clearer and easier to maintain by tweaking/extending some of the generator methods and adding more detailed commenting. I am afraid I may take a few days to provide the relevant details because of other commitments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2668251335 From roland at openjdk.org Wed Feb 19 12:14:56 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 19 Feb 2025 12:14:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> Message-ID: <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> On Tue, 18 Feb 2025 17:20:23 GMT, Emanuel Peter wrote: > Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. Wouldn't usual optimizations be applied to the slow loop as well (pre/main/post, unrolling)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668476997 From epeter at openjdk.org Wed Feb 19 13:08:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 13:08:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 12:12:27 GMT, Roland Westrelin wrote: > > Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. > > Wouldn't usual optimizations be applied to the slow loop as well (pre/main/post, unrolling)? That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. Does that make sense? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668601537 From roland at openjdk.org Wed Feb 19 13:20:56 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 19 Feb 2025 13:20:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:06:02 GMT, Emanuel Peter wrote: > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668625485 From epeter at openjdk.org Wed Feb 19 13:20:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 13:20:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:15:46 GMT, Roland Westrelin wrote: > > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. > > So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)? Exactly. In a sense that would give you similar results as with unswitching, where we also possibly optimize both branches / loops. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668632094 From roland at openjdk.org Wed Feb 19 13:28:55 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 19 Feb 2025 13:28:55 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:18:18 GMT, Emanuel Peter wrote: > > > That is what I'm avoiding by `stalling` the slow-loop ;) I only `un-stall` the slow-loop if a we actually add a check to the multiversion-if, and at that point we do care about the slow-loop. > > > > > > So if the slow loop is kept, it's fully optimized (other than what misaligned accesses prevent)? > > Exactly. In a sense that would give you similar results as with unswitching, where we also possibly optimize both branches / loops. So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668653066 From coleenp at openjdk.org Wed Feb 19 13:54:55 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 13:54:55 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: <6EpQLprXKfUDUQ6UIl0Vo0M5OPmCJ4SjcnOeprbO40w=.7d6cd0d3-ec59-4935-adb9-484764f0235c@github.com> Message-ID: On Wed, 19 Feb 2025 02:54:36 GMT, Chen Liang wrote: >> I don't know if we have a style guide that covers this, but I believe the method and field could both be named `isPrimitive`. > > I would personally name such a boolean field `primitive`, but I don't have a strong preference on the field naming as long as its references in tests and other locations are correct. In addition, I believe this field may soon be widened to carry more hotspot-specific flags (such as hidden, etc.) so the name is bound to change. I like 'primitive'. 'hidden' is also a possibility to add to this and give it the same treatment. I didn't do that one here to limit the changes and I haven't seen all the calls to isHidden so would need to find out how to measure the effects of that change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961722833 From rriggs at openjdk.org Wed Feb 19 15:12:59 2025 From: rriggs at openjdk.org (Roger Riggs) Date: Wed, 19 Feb 2025 15:12:59 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Is the change to isInterface and isPrimitive performance neutral? As @IntrinsicCandidates, there would be some performance gain. src/hotspot/share/prims/jvm.cpp line 2284: > 2282: // Please, refer to the description in the jvmtiThreadState.hpp. > 2283: > 2284: JVM_ENTRY(jboolean, JVM_IsInterface(JNIEnv *env, jclass cls)) JVM_IsInteface is deleted in Class.c, what purpose is this? ------------- PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2627122068 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961858757 From rriggs at openjdk.org Wed Feb 19 15:15:57 2025 From: rriggs at openjdk.org (Roger Riggs) Date: Wed, 19 Feb 2025 15:15:57 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. src/java.base/share/classes/java/lang/Class.java line 807: > 805: */ > 806: public boolean isArray() { > 807: return componentType != null; The componentType declaration should have a comment indicating that == null is the sole indication that the class is an interface. Perhaps there should be an assert somewhere validating/cross checking that requirement. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961869286 From epeter at openjdk.org Wed Feb 19 15:25:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 15:25:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v2] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 13:26:37 GMT, Roland Westrelin wrote: > So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x. Yes, if you get to the point where you add a multi-version-if condition, i.e. where SuperWord has decided it needs a speculative assumption (here for alignment, later for aliasing), then we get the whole loop 2x. I suppose we could try to make the pre-main-post loop more complicated and just multi-version the main-loop, but that sounds much more complicated. Do you see any better way than having the 2x code size if we need both a slow and fast loop? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2668974247 From liach at openjdk.org Wed Feb 19 15:45:56 2025 From: liach at openjdk.org (Chen Liang) Date: Wed, 19 Feb 2025 15:45:56 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: <2sugnK5bK-SWGVluAWw-UNTKKkErTTNYTxCk7t0mOGo=.3734936f-7a10-48ec-8901-01ece733791f@github.com> On Wed, 19 Feb 2025 05:08:36 GMT, David Holmes wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > src/java.base/share/classes/java/lang/Class.java line 1009: > >> 1007: private transient Object classData; // Set by VM >> 1008: private transient Object[] signers; // Read by VM, mutable >> 1009: private final transient char modifiers; // Set by the VM > > Why the change of type here? This is to improve the layout so the introduction of a boolean field does not increase the size of a Class object. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961925828 From kvn at openjdk.org Wed Feb 19 16:08:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 16:08:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 07:17:30 GMT, Emanuel Peter wrote: > The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesting benchmark results there. > > Does that sound ok? Yes, it is good plan. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669094347 From kvn at openjdk.org Wed Feb 19 16:18:57 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Feb 2025 16:18:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 07:17:30 GMT, Emanuel Peter wrote: > > Can we profile alignment in Interpreter (and C1)? > > It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. > > What do you think? You should not worry about `-Xcomp` it is testing flag - we can use some default there. I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. I will look on changes more later. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669115673 From epeter at openjdk.org Wed Feb 19 16:18:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 19 Feb 2025 16:18:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: > I am fine if you think profiling will not bring us much benefits Yeah, I think it is a good assumption that we will always get aligned and non-aliasing inputs. And if that is not the case, then this is a rare case, and it should be ok to pay the price of recompilation, I think. > I will look on changes more later. Thanks you :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2669122452 From liach at openjdk.org Wed Feb 19 16:21:57 2025 From: liach at openjdk.org (Chen Liang) Date: Wed, 19 Feb 2025 16:21:57 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com> On Wed, 19 Feb 2025 03:30:04 GMT, Dean Long wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > src/java.base/share/classes/java/lang/Class.java line 1287: > >> 1285: */ >> 1286: public Class getComponentType() { >> 1287: // Only return for array types. Storage may be reused for Class for instance types. > > I don't see any changes to componentType related to reuse. So was this comment and the code below already obsolete? It was. Before the componentType field was reused for the class initialization monitor int array, and it caused problems with core reflection if a program reflectively accesses this field after a few hundred times. See [JDK-8337622](https://bugs.openjdk.org/browse/JDK-8337622). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961989175 From liach at openjdk.org Wed Feb 19 16:25:55 2025 From: liach at openjdk.org (Chen Liang) Date: Wed, 19 Feb 2025 16:25:55 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Re roger's IntrinsicCandidate remark: One behavior that might be affected would be C2's inlining preferences. Some inline-sensitive workloads like FFM API might be affected if some Class attribute access cannot be inlined because the incoming Class object is not constant. See #23460 and #23628. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2669138528 From coleenp at openjdk.org Wed Feb 19 17:16:02 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 17:16:02 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Thanks for looking at this change. ------------- PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2626906239 From coleenp at openjdk.org Wed Feb 19 17:16:04 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 17:16:04 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 05:01:53 GMT, David Holmes wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > src/hotspot/share/classfile/javaClasses.cpp line 1371: > >> 1369: #endif >> 1370: set_modifiers(java_class, JVM_ACC_ABSTRACT | JVM_ACC_FINAL | JVM_ACC_PUBLIC); >> 1371: set_is_primitive(java_class); > > Just wondering what the comments at the start of this method are alluding to now that we do have a field at the Java level. ??? I think this comment is talking about java.lang.Class.klass field is null. Which it still is since there's no Klass pointer for basic types. But no idea what the comment is in ClassFileParser and I don't think introducing a new Klass for primitive types is an improvement. There are comments elsewhere that the klass is null for primitive types, including the call to java_lang_Class::is_primitive(), so this whole comment is only confusing so I'll remove it. Or change it to: // Mirrors for basic types have a null klass field, which makes them special. > src/hotspot/share/prims/jvm.cpp line 1262: > >> 1260: JVM_END >> 1261: >> 1262: JVM_ENTRY(jboolean, JVM_IsArrayClass(JNIEnv *env, jclass cls)) > > Where are the changes to jvm.h? Good catch, I also removed getProtectionDomain. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961739084 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961773882 From coleenp at openjdk.org Wed Feb 19 17:16:05 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 17:16:05 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com> On Wed, 19 Feb 2025 02:54:05 GMT, Dean Long wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > src/hotspot/share/classfile/javaClasses.inline.hpp line 301: > >> 299: #ifdef ASSERT >> 300: // The heapwalker walks through Classes that have had their Klass pointers removed, so can't assert this. >> 301: // assert(is_primitive == java_class->bool_field(_is_primitive_offset), "must match what we told Java"); > > I don't understand this comment about the heapwalker. It sounds like we could have `is_primitive` set to true incorrectly. If so, what prevents the asserts below from failing? And why not use the value from _is_primitive_offset instead? This is a good question. The heapwalker walks through dead mirrors so I can't assert that a null klass field matches our boolean setting but I don't know why this never asserts (can't find any instances in the bug database) but it seems like it could. I'll use the bool field in the mirror in the assert though but not in the return since the caller likely will fetch the klass pointer next. > src/hotspot/share/prims/jvm.cpp line 2283: > >> 2281: // Otherwise it returns its argument value which is the _the_class Klass*. >> 2282: // Please, refer to the description in the jvmtiThreadState.hpp. >> 2283: > > Does this "RedefineClasses support" comment still belong here? I think so. The comment in jvmtiThreadState.hpp has details why this is. We do a mirror switch before verification apparently because of bug 6214132 it says. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1961770573 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962059680 From coleenp at openjdk.org Wed Feb 19 17:16:06 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 17:16:06 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: <2sugnK5bK-SWGVluAWw-UNTKKkErTTNYTxCk7t0mOGo=.3734936f-7a10-48ec-8901-01ece733791f@github.com> References: <2sugnK5bK-SWGVluAWw-UNTKKkErTTNYTxCk7t0mOGo=.3734936f-7a10-48ec-8901-01ece733791f@github.com> Message-ID: On Wed, 19 Feb 2025 15:42:54 GMT, Chen Liang wrote: >> src/java.base/share/classes/java/lang/Class.java line 1009: >> >>> 1007: private transient Object classData; // Set by VM >>> 1008: private transient Object[] signers; // Read by VM, mutable >>> 1009: private final transient char modifiers; // Set by the VM >> >> Why the change of type here? > > This is to improve the layout so the introduction of a boolean field does not increase the size of a Class object. I changed modifiers to u2 so that we won't have an alignment gap with the bool isPrimitiveType flag. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962060783 From coleenp at openjdk.org Wed Feb 19 17:16:07 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 17:16:07 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com> References: <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com> Message-ID: On Wed, 19 Feb 2025 16:19:22 GMT, Chen Liang wrote: >> src/java.base/share/classes/java/lang/Class.java line 1287: >> >>> 1285: */ >>> 1286: public Class getComponentType() { >>> 1287: // Only return for array types. Storage may be reused for Class for instance types. >> >> I don't see any changes to componentType related to reuse. So was this comment and the code below already obsolete? > > It was. Before the componentType field was reused for the class initialization monitor int array, and it caused problems with core reflection if a program reflectively accesses this field after a few hundred times. See [JDK-8337622](https://bugs.openjdk.org/browse/JDK-8337622). Yes, this comment is obsolete. We used to share the componentType mirror with an internal 'init-lock' but it caused a bug that was fixed. If it's not an array the componentType is now always null. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962069719 From galder at openjdk.org Wed Feb 19 17:42:08 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 19 Feb 2025 17:42:08 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/75abfbc2...a190ae68 Following our discussion, I've run `MinMaxVector.long` benchmarks with superword disabled and with/without `_maxL` intrinsic in both AVX-512 and AVX2 modes. The first thing I've observed is that lacking superword, the results with AVX-512 or AVX2 are identical, so I will just focus on AVX-512 results below. Benchmark (probability) (range) (seed) (size) Mode Cnt -maxL +maxLr Units MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 4 1012.017 1011.8109 ops/ms MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 4 1012.113 1011.9530 ops/ms MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 4 463.946 473.9408 ops/ms MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 4 465.391 473.8063 ops/ms MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 4 510.992 471.6280 ops/ms (-8%) MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 4 496.036 495.3142 ops/ms MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 4 495.797 497.1214 ops/ms MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 4 495.302 495.1535 ops/ms MinMaxVector.longReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 405.495 405.3936 ops/ms MinMaxVector.longReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 405.342 405.4505 ops/ms MinMaxVector.longReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 846.492 405.4779 ops/ms (-52%) MinMaxVector.longReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 414.755 414.7036 ops/ms MinMaxVector.longReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 414.705 414.7093 ops/ms MinMaxVector.longReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 414.761 414.7150 ops/ms MinMaxVector.longReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.435 460.3764 ops/ms MinMaxVector.longReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.438 460.4718 ops/ms MinMaxVector.longReductionSimpleMax 100 N/A N/A 2048 thrpt 4 1023.005 460.5417 ops/ms (-55%) MinMaxVector.longReductionSimpleMin 50 N/A N/A 2048 thrpt 4 459.184 459.1662 ops/ms MinMaxVector.longReductionSimpleMin 80 N/A N/A 2048 thrpt 4 459.265 459.2588 ops/ms MinMaxVector.longReductionSimpleMin 100 N/A N/A 2048 thrpt 4 459.263 459.1304 ops/ms `longLoopMax at 100%`, `longReductionMultiplyMax at 100%` and `longReductionSimpleMax at 100%` are regressions with the `_maxL` intrinsic. The cause is familiar: without the intrinsic cmp+mov are emitted, while with the intrinsic and conditions above, `cmov` is emitted: # `longLoopMax` @ 100% -maxL: 4.18% ???? ??? ? 0x00007fb7580f84b2: cmpq %r13, %r11 ????? ??? ? 0x00007fb7580f84b5: jl 0x7fb7580f84ec ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ????? ??? ? ; - java.lang.Math::max at 11 (line 2038) ????? ??? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 27 (line 256) ????? ??? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) 4.23% ????? ???? ? 0x00007fb7580f84bb: movq %r11, 0x10(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0} ????? ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256) ????? ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) +maxL: 1.06% ??? 0x00007fe1b40f5ed1: movq 0x20(%rbx, %r10, 8), %r14;*laload {reexecute=0 rethrow=0 return_oop=0} ??? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 26 (line 256) ??? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) 1.34% ??? 0x00007fe1b40f5ed6: cmpq %r14, %r9 2.78% ??? 0x00007fe1b40f5ed9: cmovlq %r14, %r9 2.58% ??? 0x00007fe1b40f5edd: movq %r9, 0x20(%rax, %r10, 8);*lastore {reexecute=0 rethrow=0 return_oop=0} ??? ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax at 30 (line 256) ??? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub at 19 (line 124) # `longReductionMultiplyMax` @ 100% -maxL: 6.71% ?? ??? 0x00007f8af40f6278: imulq $0xb, 0x18(%r14, %r8, 8), %rdx ?? ??? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ?? ??? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285) ?? ??? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124) 5.28% ?? ??? 0x00007f8af40f627e: nop 10.23% ?? ??? 0x00007f8af40f6280: cmpq %rdx, %rdi ??? ??? 0x00007f8af40f6283: jge 0x7f8af40f62a7 ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ??? ??? ; - java.lang.Math::max at 11 (line 2038) ??? ??? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286) ??? ??? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124) +maxL: 11.07% ?? 0x00007f47000f5c4d: imulq $0xb, 0x18(%r14, %r11, 8), %rax ?? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 24 (line 285) ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124) 0.07% ?? 0x00007f47000f5c53: cmpq %rdx, %rax 11.87% ?? 0x00007f47000f5c56: cmovlq %rdx, %rax ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMax at 30 (line 286) ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMax_jmhTest::longReductionMultiplyMax_thrpt_jmhStub at 19 (line 124) # `longReductionSimpleMax` @ 100% -maxL: 5.71% ????? ???? ? 0x00007fc2380f75f9: movq 0x20(%r14, %r8, 8), %rdi;*laload {reexecute=0 rethrow=0 return_oop=0} ????? ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295) ????? ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124) 1.85% ????? ???? ? 0x00007fc2380f75fe: nop 4.52% ????? ???? ? 0x00007fc2380f7600: cmpq %rdi, %rdx ?????? ???? ? 0x00007fc2380f7603: jge 0x7fc2380f7667 ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ?????? ???? ? ; - java.lang.Math::max at 11 (line 2038) ?????? ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296) ?????? ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124) +maxL: 3.06% ?????? 0x00007fa6d00f6020: movq 0x70(%r14, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0} ?????? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 20 (line 295) ?????? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124) ?????? 0x00007fa6d00f6025: cmpq %r8, %r13 2.88% ?????? 0x00007fa6d00f6028: cmovlq %r8, %r13 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ?????? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionSimpleMax at 26 (line 296) ?????? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionSimpleMax_jmhTest::longReductionSimpleMax_thrpt_jmhStub at 19 (line 124) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669329851 From galder at openjdk.org Wed Feb 19 17:47:06 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 19 Feb 2025 17:47:06 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/557d790a...a190ae68 I will run a comparison next with the same batch of tests but looking at `int` and see if there are any differences compared with `long` or not. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669342758 From coleenp at openjdk.org Wed Feb 19 18:40:36 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 18:40:36 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v2] In-Reply-To: References: Message-ID: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Code review comments. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23572/files - new: https://git.openjdk.org/jdk/pull/23572/files/2d9b9ff5..3e731b9f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=00-01 Stats: 17 lines in 3 files changed: 3 ins; 10 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From coleenp at openjdk.org Wed Feb 19 18:40:37 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 18:40:37 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v2] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 15:07:57 GMT, Roger Riggs wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Code review comments. > > src/hotspot/share/prims/jvm.cpp line 2284: > >> 2282: // Please, refer to the description in the jvmtiThreadState.hpp. >> 2283: >> 2284: JVM_ENTRY(jboolean, JVM_IsInterface(JNIEnv *env, jclass cls)) > > JVM_IsInteface is deleted in Class.c, what purpose is this? The old classfile verifier uses JVM_IsInterface. > src/java.base/share/classes/java/lang/Class.java line 807: > >> 805: */ >> 806: public boolean isArray() { >> 807: return componentType != null; > > The componentType declaration should have a comment indicating that == null is the sole indication that the class is an interface. > Perhaps there should be an assert somewhere validating/cross checking that requirement. I added an assert for set_component_mirror() in the vm, but I don't see how to assert it in Java. Is the comment like: // The componentType field's null value is the sole indication that the class is an array, see isArray() ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962078501 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962186820 From coleenp at openjdk.org Wed Feb 19 18:40:37 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 18:40:37 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v2] In-Reply-To: References: <-rVJ4riSt_UybCT4tvNKCBxGfrHr-xnGx0DNDZyGgsA=.11b43081-86f2-47db-b52c-5f74b8e27960@github.com> Message-ID: <3orjlwIP5PIjb_UBpCUiIV7ZM1U_5BJfZws3PCleKhw=.55438aa0-1c98-476f-b1db-56672a1bbe4a@github.com> On Wed, 19 Feb 2025 17:10:09 GMT, Coleen Phillimore wrote: >> It was. Before the componentType field was reused for the class initialization monitor int array, and it caused problems with core reflection if a program reflectively accesses this field after a few hundred times. See [JDK-8337622](https://bugs.openjdk.org/browse/JDK-8337622). > > Yes, this comment is obsolete. We used to share the componentType mirror with an internal 'init-lock' but it caused a bug that was fixed. If it's not an array the componentType is now always null. So for JDK 8 and 21+, the init_lock and componentType are not shared. In JDK 11 and 17, Hotspot shares the fields, but it's not observable with the older implementation of reflection. See https://bugs.openjdk.org/browse/JDK-8337622. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962189932 From coleenp at openjdk.org Wed Feb 19 18:42:56 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 18:42:56 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. I ran our standard set of benchmarks on this change with no differences in performance. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2669470645 From eastigeevich at openjdk.org Wed Feb 19 19:54:05 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 19 Feb 2025 19:54:05 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Wed, 19 Feb 2025 17:43:54 GMT, Galder Zamarre?o wrote: >> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: >> >> - Merge branch 'master' into topic.intrinsify-max-min-long >> - Fix typo >> - Renaming methods and variables and add docu on algorithms >> - Fix copyright years >> - Make sure it runs with cpus with either avx512 or asimd >> - Test can only run with 256 bit registers or bigger >> >> * Remove platform dependant check >> and use platform independent configuration instead. >> - Fix license header >> - Tests should also run on aarch64 asimd=true envs >> - Added comment around the assertions >> - Adjust min/max identity IR test expectations after changes >> - ... and 34 more: https://git.openjdk.org/jdk/compare/384bab03...a190ae68 > > I will run a comparison next with the same batch of tests but looking at `int` and see if there are any differences compared with `long` or not. Hi @galderz, Results from Graviton 3(Neoverse-V1). Without the patch: Benchmark (probability) (range) (seed) (size) Mode Cnt Score Error Units MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 8 12565.427 ? 37.538 ops/ms MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 8 12462.072 ? 84.067 ops/ms MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 8 5113.090 ? 68.720 ops/ms MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 8 5129.857 ? 35.005 ops/ms MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 8 5116.081 ? 8.946 ops/ms MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 8 6174.544 ? 52.573 ops/ms MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 8 6110.884 ? 54.447 ops/ms MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 8 6178.661 ? 48.450 ops/ms MinMaxVector.intReductionMax 50 N/A N/A 2048 thrpt 8 5109.270 ? 10.525 ops/ms MinMaxVector.intReductionMax 80 N/A N/A 2048 thrpt 8 5123.426 ? 28.229 ops/ms MinMaxVector.intReductionMax 100 N/A N/A 2048 thrpt 8 5133.799 ? 7.693 ops/ms MinMaxVector.intReductionMin 50 N/A N/A 2048 thrpt 8 5130.209 ? 15.491 ops/ms MinMaxVector.intReductionMin 80 N/A N/A 2048 thrpt 8 5127.823 ? 27.767 ops/ms MinMaxVector.intReductionMin 100 N/A N/A 2048 thrpt 8 5118.217 ? 22.186 ops/ms MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 8 1831.026 ? 15.502 ops/ms MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 8 1827.194 ? 22.076 ops/ms MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 8 2643.383 ? 9.830 ops/ms MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 8 2640.417 ? 7.797 ops/ms MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 8 1244.321 ? 1.001 ops/ms MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 8 3239.234 ? 8.813 ops/ms MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 8 3252.713 ? 3.446 ops/ms MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 8 1204.370 ? 10.537 ops/ms MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 8 2536.322 ? 0.127 ops/ms MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 8 2536.318 ? 0.277 ops/ms MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 8 1395.273 ? 13.862 ops/ms MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 8 2536.325 ? 0.146 ops/ms MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 8 2536.265 ? 0.272 ops/ms MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 8 1389.982 ? 5.345 ops/ms With the patch: Benchmark (probability) (range) (seed) (size) Mode Cnt Score Error Units MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 8 12598.201 ? 52.631 ops/ms MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 8 12555.284 ? 62.472 ops/ms MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 8 5079.499 ? 16.392 ops/ms MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 8 5100.673 ? 30.376 ops/ms MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 8 5082.544 ? 23.540 ops/ms MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 8 6137.512 ? 30.198 ops/ms MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 8 6136.233 ? 7.726 ops/ms MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 8 6142.262 ? 96.510 ops/ms MinMaxVector.intReductionMax 50 N/A N/A 2048 thrpt 8 5116.055 ? 23.270 ops/ms MinMaxVector.intReductionMax 80 N/A N/A 2048 thrpt 8 5111.481 ? 12.236 ops/ms MinMaxVector.intReductionMax 100 N/A N/A 2048 thrpt 8 5106.367 ? 9.035 ops/ms MinMaxVector.intReductionMin 50 N/A N/A 2048 thrpt 8 5115.666 ? 15.539 ops/ms MinMaxVector.intReductionMin 80 N/A N/A 2048 thrpt 8 5133.127 ? 4.918 ops/ms MinMaxVector.intReductionMin 100 N/A N/A 2048 thrpt 8 5120.469 ? 24.355 ops/ms MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 8 5094.259 ? 14.092 ops/ms MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 8 5096.835 ? 16.517 ops/ms MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 8 2636.438 ? 18.760 ops/ms MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 8 2644.069 ? 3.933 ops/ms MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 8 2646.250 ? 2.007 ops/ms MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 8 2648.504 ? 18.294 ops/ms MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 8 2658.082 ? 3.362 ops/ms MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 8 2647.532 ? 5.600 ops/ms MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 8 2536.254 ? 0.086 ops/ms MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 8 2536.209 ? 0.129 ops/ms MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 8 2536.342 ? 0.068 ops/ms MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 8 2536.271 ? 0.203 ops/ms MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 8 2536.250 ? 0.343 ops/ms MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 8 2536.246 ? 0.179 ops/ms ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2669613497 From coleenp at openjdk.org Wed Feb 19 20:30:34 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 19 Feb 2025 20:30:34 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v3] In-Reply-To: References: Message-ID: <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com> > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Rename isPrimitiveType field to primitive. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23572/files - new: https://git.openjdk.org/jdk/pull/23572/files/3e731b9f..d08091ac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=01-02 Stats: 11 lines in 5 files changed: 2 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From dlong at openjdk.org Wed Feb 19 21:19:58 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Feb 2025 21:19:58 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v3] In-Reply-To: <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com> References: <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com> Message-ID: <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com> On Wed, 19 Feb 2025 14:19:58 GMT, Coleen Phillimore wrote: > ... but not in the return since the caller likely will fetch the klass pointer next. I notice that too. Callers are using is_primitive() to short-circuit calls to as_Klass(), which means they seem to be aware of this implementation detail when maybe they shouldn't. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962384926 From sviswanathan at openjdk.org Wed Feb 19 23:21:07 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 19 Feb 2025 23:21:07 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: <2OIYkOt8CJ-CqnQIK8sgMDtvLxJUyD5r_mKj5QT7_a8=.10b1d382-d9ae-40a1-b895-09086c80dee6@github.com> On Tue, 18 Feb 2025 02:36:13 GMT, Julian Waters wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolutions > > Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux > > * For target hotspot_variant-server_libjvm_objs_mulnode.o: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous > 1944 | return TypeH::make(fma(f1, f2, f3)); > | ^ > In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)? > 544 | static const TypeH* make(float f); > | ^~~~ > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)? > 545 | static const TypeH* make(short f); > | ^~~~ @TheShermanTanker I don't see any compile failures on Linux. Both the fastdebug and release build successfully. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2669979058 From dholmes at openjdk.org Thu Feb 20 02:52:58 2025 From: dholmes at openjdk.org (David Holmes) Date: Thu, 20 Feb 2025 02:52:58 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v3] In-Reply-To: <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com> References: <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com> Message-ID: On Wed, 19 Feb 2025 20:30:34 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Rename isPrimitiveType field to primitive. src/java.base/share/classes/java/lang/Class.java line 1296: > 1294: > 1295: // The componentType field's null value is the sole indication that the class is an array, > 1296: // see isArray(). Suggestion: // The componentType field's null value is the sole indication that the class // is an array - see isArray(). src/java.base/share/classes/java/lang/Class.java line 1297: > 1295: // The componentType field's null value is the sole indication that the class is an array, > 1296: // see isArray(). > 1297: private transient final Class componentType; Why the `transient` and how does this impact serialization?? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962781718 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962782083 From liach at openjdk.org Thu Feb 20 04:31:55 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 20 Feb 2025 04:31:55 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v3] In-Reply-To: References: <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com> Message-ID: On Thu, 20 Feb 2025 02:50:17 GMT, David Holmes wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename isPrimitiveType field to primitive. > > src/java.base/share/classes/java/lang/Class.java line 1297: > >> 1295: // The componentType field's null value is the sole indication that the class is an array, >> 1296: // see isArray(). >> 1297: private transient final Class componentType; > > Why the `transient` and how does this impact serialization?? The fields in `Class` are just inconsistently transient or not. `Class` has special treatment in the serialization specification, so the presence or absence of the `transient` modifier has no effect. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1962841415 From galder at openjdk.org Thu Feb 20 06:27:57 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 20 Feb 2025 06:27:57 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Wed, 19 Feb 2025 19:50:50 GMT, Evgeny Astigeevich wrote: >> I will run a comparison next with the same batch of tests but looking at `int` and see if there are any differences compared with `long` or not. > > Hi @galderz, > Results from Graviton 3(Neoverse-V1). > Without the patch: > > Benchmark (probability) (range) (seed) (size) Mode Cnt Score Error Units > MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 8 12565.427 ? 37.538 ops/ms > MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 8 12462.072 ? 84.067 ops/ms > MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 8 5113.090 ? 68.720 ops/ms > MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 8 5129.857 ? 35.005 ops/ms > MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 8 5116.081 ? 8.946 ops/ms > MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 8 6174.544 ? 52.573 ops/ms > MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 8 6110.884 ? 54.447 ops/ms > MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 8 6178.661 ? 48.450 ops/ms > MinMaxVector.intReductionMax 50 N/A N/A 2048 thrpt 8 5109.270 ? 10.525 ops/ms > MinMaxVector.intReductionMax 80 N/A N/A 2048 thrpt 8 5123.426 ? 28.229 ops/ms > MinMaxVector.intReductionMax 100 N/A N/A 2048 thrpt 8 5133.799 ? 7.693 ops/ms > MinMaxVector.intReductionMin 50 N/A N/A 2048 thrpt 8 5130.209 ? 15.491 ops/ms > MinMaxVector.intReductionMin 80 N/A N/A 2048 thrpt 8 5127.823 ? 27.767 ops/ms > MinMaxVector.intReductionMin 100 N/A N/A 2048 thrpt 8 5118.217 ? 22.186 ops/ms > MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 8 1831.026 ? 15.502 ops/ms > MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 8 1827.194 ? 22.076 ops/ms > MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 8 2643.383 ? 9.830 ops/ms > MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 8 2640.417 ? 7.797 ops/ms > MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 8 1244.321 ? 1.001 ops/ms > MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 8 3239.234 ? 8.813 ops/ms > MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 8 3252.713 ? 3... Thanks @eastig for the results on Graviton 3. I'm summarising them here: Benchmark (probability) (range) (seed) (size) Mode Cnt Base Patch Units MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 8 1831.026 5094.259 ops/ms (+178%) MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 8 1827.194 5096.835 ops/ms (+180%) MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 8 2643.383 2636.438 ops/ms MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 8 2640.417 2644.069 ops/ms MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 8 1244.321 2646.250 ops/ms (+112%) MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 8 3239.234 2648.504 ops/ms (-18%) MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 8 3252.713 2658.082 ops/ms (-18%) MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 8 1204.370 2647.532 ops/ms (+119%) MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 8 2536.322 2536.254 ops/ms MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 8 2536.318 2536.209 ops/ms MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 8 1395.273 2536.342 ops/ms (+81%) MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 8 2536.325 2536.271 ops/ms MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 8 2536.265 2536.250 ops/ms MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 8 1389.982 2536.246 ops/ms (+82%) On Graviton 3 there are wide enough registers for vectorization to kick in, so we see similar improvements to x64 AVX-512 in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364. There is some variance in the 50/80% probability range, this was also observed slightly there, but on the aarch64 system it looks more pronounced. Interesting that it happened with min but not max but could be variance. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670574593 From galder at openjdk.org Thu Feb 20 06:53:04 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 20 Feb 2025 06:53:04 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/af7645e5...a190ae68 To follow up https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851, I've run `MinMaxVector.int` benchmarks with **superword disabled** and with/without `_max`/`_min` intrinsics in both AVX-512 and AVX2 modes. # AVX-512 Benchmark (probability) (range) (seed) (size) Mode Cnt -min/-max +min/+max Units MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 4 1067.050 1038.640 ops/ms MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 4 1041.922 1039.004 ops/ms MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 4 605.173 604.337 ops/ms MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 4 605.106 604.309 ops/ms MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 4 604.547 604.432 ops/ms MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 4 495.042 605.216 ops/ms (+22%) MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 4 495.105 495.217 ops/ms MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 4 495.040 495.176 ops/ms MinMaxVector.intReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 407.920 407.984 ops/ms MinMaxVector.intReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 407.710 407.965 ops/ms MinMaxVector.intReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 874.881 407.922 ops/ms (-53%) MinMaxVector.intReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 407.911 407.947 ops/ms MinMaxVector.intReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 408.015 408.024 ops/ms MinMaxVector.intReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 407.978 407.994 ops/ms MinMaxVector.intReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.538 460.439 ops/ms MinMaxVector.intReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.579 460.542 ops/ms MinMaxVector.intReductionSimpleMax 100 N/A N/A 2048 thrpt 4 998.211 460.404 ops/ms (-53%) MinMaxVector.intReductionSimpleMin 50 N/A N/A 2048 thrpt 4 460.570 460.447 ops/ms MinMaxVector.intReductionSimpleMin 80 N/A N/A 2048 thrpt 4 460.552 460.493 ops/ms MinMaxVector.intReductionSimpleMin 100 N/A N/A 2048 thrpt 4 460.455 460.485 ops/ms There is some improvement in `intLoopMin` @ 50% but this didn't materialize in the `perfasm` run, so I don't think it can be strictly be correlated with the use/not-use of the intrinsic. `intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% regressions with the `max` intrinsic activated are consistent with what the saw with long. ### `intReductionMultiplyMin` and `intReductionSimpleMin` @ 100% same performance There is something very intriguing happening here, which I don't know it's due to min itself or int vs long. Basically, with or without the `min` intrinsic the performance of these 2 benchmarks is same at 100% branch probability. What is going on? Let's look at one of them: -min # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_max,_min -XX:-UseSuperWord ... 3.04% ???? ? 0x00007f49280f76e9: cmpl %edi, %r10d 3.14% ???? ? 0x00007f49280f76ec: cmovgl %edi, %r10d ;*ireturn {reexecute=0 rethrow=0 return_oop=0} ???? ? ; - java.lang.Math::min at 10 (line 2119) ???? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212) ???? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124) +min # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord ... 3.10% ?? ? 0x00007fbf340f6b97: cmpl %edi, %r10d 3.08% ?? ? 0x00007fbf340f6b9a: cmovgl %edi, %r10d ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0} ?? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin at 23 (line 212) ?? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionSimpleMin_jmhTest::intReductionSimpleMin_thrpt_jmhStub at 19 (line 124) Both are `cmov`. You can see how without the intrinsic the `Math::min` bytecode gets executed and transformed into a `cmov` and the same with the intrinsic. I will verify this with long shortly to see if this behaviour is specific to `min` operation or something to do with int vs long. # AVX2 Here are the AVX2 numbers: Benchmark (probability) (range) (seed) (size) Mode Cnt -min/-max +min/+max Units MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 4 1068.265 1039.087 ops/ms MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 4 1067.705 1038.760 ops/ms MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 4 605.015 604.364 ops/ms MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 4 605.169 604.366 ops/ms MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 4 604.527 604.494 ops/ms MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 4 605.099 605.057 ops/ms MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 4 495.071 605.080 ops/ms (+22%) MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 4 495.134 495.047 ops/ms MinMaxVector.intReductionMultiplyMax 50 N/A N/A 2048 thrpt 4 407.953 407.987 ops/ms MinMaxVector.intReductionMultiplyMax 80 N/A N/A 2048 thrpt 4 407.861 408.005 ops/ms MinMaxVector.intReductionMultiplyMax 100 N/A N/A 2048 thrpt 4 873.915 407.995 ops/ms (-53%) MinMaxVector.intReductionMultiplyMin 50 N/A N/A 2048 thrpt 4 408.019 407.987 ops/ms MinMaxVector.intReductionMultiplyMin 80 N/A N/A 2048 thrpt 4 407.971 408.009 ops/ms MinMaxVector.intReductionMultiplyMin 100 N/A N/A 2048 thrpt 4 407.970 407.956 ops/ms MinMaxVector.intReductionSimpleMax 50 N/A N/A 2048 thrpt 4 460.443 460.514 ops/ms MinMaxVector.intReductionSimpleMax 80 N/A N/A 2048 thrpt 4 460.484 460.581 ops/ms MinMaxVector.intReductionSimpleMax 100 N/A N/A 2048 thrpt 4 1015.601 460.446 ops/ms (-54%) MinMaxVector.intReductionSimpleMin 50 N/A N/A 2048 thrpt 4 460.494 460.532 ops/ms MinMaxVector.intReductionSimpleMin 80 N/A N/A 2048 thrpt 4 460.489 460.451 ops/ms MinMaxVector.intReductionSimpleMin 100 N/A N/A 2048 thrpt 4 1021.420 460.435 ops/ms (-55%) This time we see an improvement in `intLoopMin` @ 80% but again it was not observable in the `perfasm` run. `intReductionMultiplyMax` and `intReductionSimpleMax` @ 100% have regressions, the familiar one of cmp+mov vs cmov. `intReductionMultiplyMin` @ 100% does not have a regression for the same reasons above, both use cmov. The interesting thing is `intReductionSimpleMin` @ 100%. We see a regression there but I didn't observe it with the `perfasm` run. So, this could be due to variance in the application of `cmov` or not? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670609470 From epeter at openjdk.org Thu Feb 20 07:21:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Feb 2025 07:21:45 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: Message-ID: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: adjust selector if probability ------------- Changes: - all: https://git.openjdk.org/jdk/pull/22016/files - new: https://git.openjdk.org/jdk/pull/22016/files/a98ffabf..b3044bc5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/22016.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016 PR: https://git.openjdk.org/jdk/pull/22016 From roland at openjdk.org Thu Feb 20 09:46:58 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:46:58 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Wed, 19 Feb 2025 15:23:13 GMT, Emanuel Peter wrote: > Do you see any better way than having the 2x code size if we need both a slow and fast loop? No but I was confused by your comment about 3x and 4x which is why I asked for clarification. Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2670957288 From roland at openjdk.org Thu Feb 20 09:46:58 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:46:58 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Thu, 20 Feb 2025 09:39:59 GMT, Roland Westrelin wrote: >>> So the overhead in the final code is 2x: we can expect the fast and slow paths to be about the same size so the section of code for the loop would see its size grow by 2x. >> >> Yes, if you get to the point where you add a multi-version-if condition, i.e. where SuperWord has decided it needs a speculative assumption (here for alignment, later for aliasing), then we get the whole loop 2x. I suppose we could try to make the pre-main-post loop more complicated and just multi-version the main-loop, but that sounds much more complicated. >> >> Do you see any better way than having the 2x code size if we need both a slow and fast loop? > >> Do you see any better way than having the 2x code size if we need both a slow and fast loop? > > No but I was confused by your comment about 3x and 4x which is why I asked for clarification. > Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said. > @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. > > In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. Do you understand when that happens? It doesn't feel right that the pre loop can be lost. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2670971210 From roland at openjdk.org Thu Feb 20 09:47:01 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:47:01 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com> References: <47tXBG3sQGZVEE5Ya2wr46CopmDjy8OClbpqagIsjgA=.6d07b495-4777-4c7e-a3b7-820f100ec2c0@github.com> Message-ID: On Tue, 18 Feb 2025 09:42:17 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopUnswitch.cpp line 513: >> >>> 511: >>> 512: // Create new Region. >>> 513: RegionNode* region = new RegionNode(1); >> >> So we create a new `Region` every time a new condition is added? > > Yes. Are you ok with that? Or would you prefer if we extended an existing region (is that possible?) and then we'd have 2 cases, one where there is none yet, and one where we'd extend. I think adding one each time is easier, and it would get commoned anyway, right? That sounds ok to me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1963217281 From roland at openjdk.org Thu Feb 20 09:47:03 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 20 Feb 2025 09:47:03 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-h_j1wlUqiWpk7lHDe2qqLlTPUdRLJ2NBaid6KJURCQ=.e1ef0bfa-4043-42b0-be58-ac130373c788@github.com> Message-ID: On Tue, 18 Feb 2025 10:26:37 GMT, Roland Westrelin wrote: >> @rwestrel do you consider that a blocking issue for this PR here? > > No I filed: https://bugs.openjdk.org/browse/JDK-8350330 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1963215126 From epeter at openjdk.org Thu Feb 20 10:35:06 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 20 Feb 2025 10:35:06 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Thu, 20 Feb 2025 09:44:16 GMT, Roland Westrelin wrote: > > @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. > > In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. > > Do you understand when that happens? It doesn't feel right that the pre loop can be lost. `VLoop::check_preconditions_helper` has a check like this: // To align vector memory accesses in the main-loop, we will have to adjust // the pre-loop limit. if (_cl->is_main_loop()) { CountedLoopEndNode* pre_end = _cl->find_pre_loop_end(); if (pre_end == nullptr) { return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT); } Node* pre_opaq1 = pre_end->limit(); if (pre_opaq1->Opcode() != Op_Opaque1) { return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT); } _pre_loop_end = pre_end; } I don't remember exactly why the pre-loop disappears. They are rare cases. The pre-loop somehow folds away, maybe because it only has a single iteration, or just so few that it would never take the backedge. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2671093141 From galder at openjdk.org Thu Feb 20 10:56:58 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 20 Feb 2025 10:56:58 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Thu, 20 Feb 2025 06:50:07 GMT, Galder Zamarre?o wrote: > There is something very intriguing happening here, which I don't know it's due to min itself or int vs long. Benchmark (probability) (size) Mode Cnt -min/-max +min/+max Units MinMaxVector.intReductionMultiplyMax 100 2048 thrpt 4 876.867 407.905 ops/ms (-53%) MinMaxVector.intReductionMultiplyMin 100 2048 thrpt 4 407.963 407.956 ops/ms (1) MinMaxVector.longReductionMultiplyMax 100 2048 thrpt 4 838.845 405.371 ops/ms (-51%) MinMaxVector.longReductionMultiplyMin 100 2048 thrpt 4 825.602 414.757 ops/ms (-49%) MinMaxVector.intReductionSimpleMax 100 2048 thrpt 4 1032.561 460.486 ops/ms (-55%) MinMaxVector.intReductionSimpleMin 100 2048 thrpt 4 460.530 460.490 ops/ms (2) MinMaxVector.longReductionSimpleMax 100 2048 thrpt 4 1017.560 460.436 ops/ms (-54%) MinMaxVector.longReductionSimpleMin 100 2048 thrpt 4 959.507 459.197 ops/ms (-52%) (1) (2) It seems it's a combination of both int AND min reduction operations and disabling the intrinsic. The rest of reduction operations seems to use cmp+mov in that situation but not int+min, which uses cmov. Maybe this is intentional or maybe it's a bug, but it's interesting to notice. `intReductionMultiplyMin` -min: # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_min -XX:-UseSuperWord # Benchmark: org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMin # Parameters: (probability = 100, size = 2048) ... 2.29% ??? ? 0x00007f4aa40f5835: cmpl %edi, %r10d 4.25% ??? ? 0x00007f4aa40f5838: cmovgl %edi, %r10d ;*ireturn {reexecute=0 rethrow=0 return_oop=0} ??? ? ; - java.lang.Math::min at 10 (line 2119) ??? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMin at 26 (line 202) ??? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMin_jmhTest::intReductionMultiplyMin_thrpt_jmhStub at 19 (line 124) `intReductionMultiplyMin` +min: # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord # Benchmark: org.openjdk.bench.java.lang.MinMaxVector.intReductionMultiplyMin # Parameters: (probability = 100, size = 2048) ... 2.06% ??? ? 0x00007ff8ec0f4c35: cmpl %edi, %r10d 4.31% ??? ? 0x00007ff8ec0f4c38: cmovgl %edi, %r10d ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0} ??? ? ; - org.openjdk.bench.java.lang.MinMaxVector::intReductionMultiplyMin at 26 (line 202) ??? ? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_intReductionMultiplyMin_jmhTest::intReductionMultiplyMin_thrpt_jmhStub at 19 (line 124) `longReductionMultiplyMin` -min: # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_minL -XX:-UseSuperWord # Benchmark: org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMin # Parameters: (probability = 100, size = 2048) ... 0.01% ? ? ?? ? ?? 0x00007ff9d80f7609: imulq $0xb, 0x10(%r12, %r10, 8), %rbp ? ? ?? ? ?? ;*lmul {reexecute=0 rethrow=0 return_oop=0} ? ? ?? ? ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMin at 24 (line 265) ? ? ?? ? ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMin_jmhTest::longReductionMultiplyMin_thrpt_jmhStub at 19 (line 124) ? ? ?? ? ?? 0x00007ff9d80f760f: testq %rbp, %rbp ? ? ???? ?? 0x00007ff9d80f7612: jge 0x7ff9d80f7646 ;*lreturn {reexecute=0 rethrow=0 return_oop=0} ? ? ???? ?? ; - java.lang.Math::min at 11 (line 2134) ? ? ???? ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMin at 30 (line 266) ? ? ???? ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMin_jmhTest::longReductionMultiplyMin_thrpt_jmhStub at 19 (line 124) `longReductionMultiplyMin` +min: # VM options: -Djava.library.path=/home/vagrant/1/jdk-intrinsify-max-min-long/build/release-linux-x86_64/images/test/micro/native -XX:-UseSuperWord # Benchmark: org.openjdk.bench.java.lang.MinMaxVector.longReductionMultiplyMin # Parameters: (probability = 100, size = 2048) ... 0.01% ? ?? 0x00007f83400f7d76: cmpq %r13, %rdx 0.12% ? ?? 0x00007f83400f7d79: cmovlq %rdx, %r13 ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0} ? ?? ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMultiplyMin at 30 (line 266) ? ?? ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMultiplyMin_jmhTest::longReductionMultiplyMin_thrpt_jmhStub at 19 (line 124) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2671144644 From galder at openjdk.org Thu Feb 20 11:03:58 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 20 Feb 2025 11:03:58 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: On Tue, 18 Feb 2025 08:43:38 GMT, Emanuel Peter wrote: >> To make it more explicit: implementing long min/max in ad files as cmp will likely remove all the 100% regressions that are observed here. I'm going to repeat the same MinMaxVector int min/max reduction test above with the ad changes @rwestrel suggested to see what effect they have. > > @galderz I think we will have the same issue with both `int` and `long`: As far as I know, it is really a difficult problem to decide at compile-time if a `cmove` or `branch` is the better choice. I'm not sure there is any heuristic for which you will not find a micro-benchmark where the heuristic made the wrong choice. > > To my understanding, these are the factors that impact the performance: > - `cmove` requires all inputs to complete before it can execute, and it has an inherent latency of a cycle or so itself. But you cannot have any branch mispredictions, and hence no branch misprediction penalties (i.e. when the CPU has to flush out the ops from the wrong branch and restart at the branch). > - `branch` can hide some latencies, because we can already continue with the branch that is speculated on. We do not need to wait for the inputs of the comparison to arrive, and we can already continue with the speculated resulting value. But if the speculation is ever wrong, we have to pay the misprediction penalty. > > In my understanding, there are roughly 3 scenarios: > - The branch probability is so extreme that the branch predictor would be correct almost always, and so it is profitable to do branching code. > - The branching probability is somewhere in the middle, and the branch is not predictable. Branch mispredictions are very expensive, and so it is better to use `cmove`. > - The branching probability is somewhere in the middle, but the branch is predictable (e.g. swapps back and forth). The branch predictor will have almost no mispredictions, and it is faster to use branching code. > > Modeling this precisely is actually a little complex. You would have to know the cost of the `cmove` and the `branching` version of the code. That depends on the latency of the inputs, and the outputs: does the `cmove` dramatically increase the latency on the critical path, and `branching` could hide some of that latency? And you would have to know how good the branch predictor is, which you cannot derive from the branching probability of our profiling (at least not when the probabilities are in the middle, and you don't know if it is a random or predictable pattern). > > If we can find a perfect heuristic - that would be fantastic ;) > > If we cannot find a perfect heuristic, then we should think about what are the most "common" or "relevant" scenarios, I think. > > But let's discuss all of this in a call / offline :) FYI @eme64 @chhagedorn @rwestrel Since we know that vectorization does not always kick in, there was a worry if scalar fallbacks would heavily suffer with the work included in this PR to add long intrinsic for min/max. Looking at the same scenarios with int (read my comments https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851 and https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644), it looks clear that the same kind of regressions are also present there. So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long. Re: https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644 - I was trying to think what could be causing this. I thought maybe it's due to the int min/max backend, which is implemented in platform specific way, vs the long min/max backend which relies on platform independent macro expansion. But if that theory was true, I would expect the same behaviour with int max vs long max, but that's not the case. It seems odd to only see this difference with min. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2671163220 From jbhateja at openjdk.org Thu Feb 20 11:37:08 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 20 Feb 2025 11:37:08 GMT Subject: RFR: 8342103: C2 compiler support for Float16 type and associated scalar operations [v18] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 02:36:13 GMT, Julian Waters wrote: > Is anyone else getting compile failures after this was integrated? This weirdly seems to only happen on Linux > > ``` > * For target hotspot_variant-server_libjvm_objs_mulnode.o: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp: In member function ?virtual const Type* FmaHFNode::Value(PhaseGVN*) const?: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:1944:37: error: call of overloaded ?make(double)? is ambiguous > 1944 | return TypeH::make(fma(f1, f2, f3)); > | ^ > In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/node.hpp:31, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/addnode.hpp:28, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/mulnode.cpp:26: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:544:23: note: candidate: ?static const TypeH* TypeH::make(float)? > 544 | static const TypeH* make(float f); > | ^~~~ > /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:545:23: note: candidate: ?static const TypeH* TypeH::make(short int)? > 545 | static const TypeH* make(short f); > | ^~~~ > ``` Hi @TheShermanTanker , Please file a separate JBS issue for the errors you are observing with non-standard build options. I am also seeing some other build issues with the following configuration --with-extra-cxxflags=-D__CORRECT_ISO_CPP11_MATH_H_PROTO_FP Best Regards, Jatin ------------- PR Comment: https://git.openjdk.org/jdk/pull/22754#issuecomment-2671231948 From coleenp at openjdk.org Thu Feb 20 13:00:03 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 20 Feb 2025 13:00:03 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v3] In-Reply-To: References: <9ZTXNeE806c5EDt4Y6QFMqull0_SobjS7mOQGk2wE5s=.81291418-85a7-4826-9ecf-dcdd050ecaf1@github.com> Message-ID: On Thu, 20 Feb 2025 04:29:04 GMT, Chen Liang wrote: >> src/java.base/share/classes/java/lang/Class.java line 1297: >> >>> 1295: // The componentType field's null value is the sole indication that the class is an array, >>> 1296: // see isArray(). >>> 1297: private transient final Class componentType; >> >> Why the `transient` and how does this impact serialization?? > > The fields in `Class` are just inconsistently transient or not. `Class` has special treatment in the serialization specification, so the presence or absence of the `transient` modifier has no effect. Thanks Chen. I was wondering why the other JVM installed fields were transient and this one wasn't so I added it to see if someone noticed and could verify whether it's right or not. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1963520059 From duke at openjdk.org Thu Feb 20 17:24:57 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 20 Feb 2025 17:24:57 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Tue, 11 Feb 2025 10:40:31 GMT, Bhavana Kilambi wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding comments + some code reorganization > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618: > >> 2616: INSN(smaxp, 0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S >> 2617: INSN(sminp, 0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S >> 2618: INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S > > Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions. I have tried that, but the python script (actually the as command that it started) threw error messages: aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760]. prfm PLDL1KEEP, [x15, 43] ^ aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] sub x1, x10, x23, sxth #2 ^ aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] add x11, x21, x5, uxtb #3 ^ aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] adds x11, x17, x17, uxtw #1 ^ aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] sub x11, x0, x15, uxtb #1 ^ aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] subs x7, x1, x0, sxth #2 ^ This is without any modifications from what is in the master branch currently. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1964049673 From duke at openjdk.org Thu Feb 20 17:33:18 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 20 Feb 2025 17:33:18 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: - Accepting suggested change from Andrew Dinn - Added comments suggested by Andrew Dinn - Fixed copyright years - renaming a couple of functions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23300/files - new: https://git.openjdk.org/jdk/pull/23300/files/9a3a9444..54373d5a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=04-05 Stats: 98 lines in 6 files changed: 2 ins; 0 del; 96 mod Patch: https://git.openjdk.org/jdk/pull/23300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300 PR: https://git.openjdk.org/jdk/pull/23300 From coleenp at openjdk.org Thu Feb 20 20:11:11 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 20 Feb 2025 20:11:11 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v4] In-Reply-To: References: Message-ID: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Update src/java.base/share/classes/java/lang/Class.java Co-authored-by: David Holmes <62092539+dholmes-ora at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23572/files - new: https://git.openjdk.org/jdk/pull/23572/files/d08091ac..7a4c595b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From coleenp at openjdk.org Thu Feb 20 20:19:15 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 20 Feb 2025 20:19:15 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: Message-ID: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Fix whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23572/files - new: https://git.openjdk.org/jdk/pull/23572/files/7a4c595b..02347433 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From vlivanov at openjdk.org Thu Feb 20 21:56:55 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 20 Feb 2025 21:56:55 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace Looks good! Regarding @IntrinsicCandidate and its effects on JIT-compiler inlining decisions, @ForceInline could be added, but IMO it's not necessary since new implementations are small. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2631244815 From coleenp at openjdk.org Thu Feb 20 23:25:57 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 20 Feb 2025 23:25:57 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com> References: <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com> <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com> Message-ID: On Wed, 19 Feb 2025 21:16:51 GMT, Dean Long wrote: >> This is a good question. The heapwalker walks through dead mirrors so I can't assert that a null klass field matches our boolean setting but I don't know why this never asserts (can't find any instances in the bug database) but it seems like it could. I'll use the bool field in the mirror in the assert though but not in the return since the caller likely will fetch the klass pointer next. > >> ... but not in the return since the caller likely will fetch the klass pointer next. > > I notice that too. Callers are using is_primitive() to short-circuit calls to as_Klass(), which means they seem to be aware of this implementation detail when maybe they shouldn't. There are 136 callers so yes, it might be something that shouldn't be known in this many places. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1964492501 From coleenp at openjdk.org Thu Feb 20 23:31:55 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 20 Feb 2025 23:31:55 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace Thanks Vladimir for review and for answering my earlier questions on this change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2672941007 From liach at openjdk.org Thu Feb 20 23:40:55 2025 From: liach at openjdk.org (Chen Liang) Date: Thu, 20 Feb 2025 23:40:55 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace You are right, using the field directly is indeed better. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1964502825 From epeter at openjdk.org Fri Feb 21 07:04:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Feb 2025 07:04:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: >> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti ng benchmark results there. >> >> Does that sound ok? >> >>> Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > >> > Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > > You should not worry about `-Xcomp` it is testing flag - we can use some default there. > I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. > > I will look on changes more later. @vnkozlov I made the change with the probability `PROB_FAIR` -> `PROB_LIKELY_MAG(3)` and ran testing again. @rwestrel Do you want me to find examples for the pre-loop disappearing, I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2673745463 From epeter at openjdk.org Fri Feb 21 08:22:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Feb 2025 08:22:59 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: On Thu, 20 Feb 2025 11:00:59 GMT, Galder Zamarre?o wrote: > So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long. Do you know when they were added? If that was a long time ago, we might not have noticed back then, but we might notice now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2673875104 From epeter at openjdk.org Fri Feb 21 08:23:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Feb 2025 08:23:00 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Thu, 20 Feb 2025 06:50:07 GMT, Galder Zamarre?o wrote: > The interesting thing is intReductionSimpleMin @ 100%. We see a regression there but I didn't observe it with the perfasm run. So, this could be due to variance in the application of cmov or not? I don't see the error / variance in the results you posted. Often I look at those, and if it is anywhere above 10% of the average, then I'm suspicious ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2673879859 From epeter at openjdk.org Fri Feb 21 08:30:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Feb 2025 08:30:00 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <6-Fgj-Lrd7GSpR0ZAi8YFlOZB12hCBB6p3oGZ1xodvA=.1ce2fa12-daff-4459-8fb8-1052acaf5639@github.com> <5oGMaD5b87inAMkco6l5ODRvWv7FRsHGJiu_UMrGrTc=.0be44429-d322-4a6f-b91d-b64a146fad05@github.com> <3ArmrOQcUoj8DhHTq1a40Oz3GE8bCDDy3FF eVgbladg=.b8e0e13b-39f3-41a6-8a1b-5ca4febb4a41@github.com> Message-ID: On Thu, 20 Feb 2025 11:00:59 GMT, Galder Zamarre?o wrote: > Re: https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644 - I was trying to think what could be causing this. Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2673892612 From duke at openjdk.org Fri Feb 21 10:09:56 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Fri, 21 Feb 2025 10:09:56 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: <3kiI1J7jcczgzTRi9HZztzhGe1blcy8Ga11xoGhzueY=.98543172-5b38-4199-bead-0988de0e0e75@github.com> References: <3kiI1J7jcczgzTRi9HZztzhGe1blcy8Ga11xoGhzueY=.98543172-5b38-4199-bead-0988de0e0e75@github.com> Message-ID: On Tue, 18 Feb 2025 13:33:52 GMT, Andrew Dinn wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding comments + some code reorganization > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2594: > >> 2592: guarantee(T != T1Q && T != T1D, "incorrect arrangement"); \ >> 2593: if (!acceptT2D) guarantee(T != T2D, "incorrect arrangement"); \ >> 2594: if (strcmp(#NAME, "sqdmulh") == 0) guarantee(T != T8B && T != T16B, "incorrect arrangement"); \ > > Suggestion: > > I think it might be better to change this test from a strcmp call to (opc2 == 0b101101). The strcmp test is clearer to a reader of the code but the call may not be guaranteed to be compiled out at build time while the latter will. Changed as suggested. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1965215153 From duke at openjdk.org Fri Feb 21 10:14:00 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Fri, 21 Feb 2025 10:14:00 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Tue, 18 Feb 2025 13:43:18 GMT, Andrew Dinn wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding comments + some code reorganization > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4066: > >> 4064: } >> 4065: >> 4066: // Execute on round of keccak of two computations in parallel. > > Suggestion: > > It would be helpful to add comments that relate the register and instruction selection to the original Java source code. e.g. change the header as follows > > // Performs 2 keccak round transformations using vector parallelism > // > // Two sets of 25 * 64-bit input states a0[lo:hi]...a24[lo:hi] are passed in > // the lower/upper halves of registers v0...v24 and the transformed states > // are returned in the same registers. Intermediate 64-bit pairs > // c0...c5 and d0...d5 are computed in registers v25...v30. v31 is > // loaded with the required pair of 64 bit rounding constants. > // During computation of the output states some intermediate results are > // shuffled around registers v0...v30. Comments on each line indicate > // how the values in registers correspond to variables ai, ci, di in > // the Java source code, likewise how the generated machine instructions > // correspond to Java source operations (n.b. rol means rotate left). > > The annotate the generation steps as follows: > > __ eor3(v29, __ T16B, v4, v9, v14); // c4 = a4 ^ a9 ^ a14 > __ eor3(v26, __ T16B, v1, v6, v11); // c1 = a1 ^ a16 ^ a11 > __ eor3(v28, __ T16B, v3, v8, v13); // c3 = a3 ^ a8 ^a13 > __ eor3(v25, __ T16B, v0, v5, v10); // c0 = a0 ^ a5 ^ a10 > __ eor3(v27, __ T16B, v2, v7, v12); // c2 = a2 ^ a7 ^ a12 > __ eor3(v29, __ T16B, v29, v19, v24); // c4 ^= a19 ^ a24 > __ eor3(v26, __ T16B, v26, v16, v21); // c1 ^= a16 ^ a21 > __ eor3(v28, __ T16B, v28, v18, v23); // c3 ^= a18 ^ a23 > __ eor3(v25, __ T16B, v25, v15, v20); // c0 ^= a15 ^ a20 > __ eor3(v27, __ T16B, v27, v17, v22); // c2 ^= a17 ^ a22 > > __ rax1(v30, __ T2D, v29, v26); // d0 = c4 ^ rol(c1, 1) > __ rax1(v26, __ T2D, v26, v28); // d2 = c1 ^ rol(c3, 1) > __ rax1(v28, __ T2D, v28, v25); // d4 = c3 ^ rol(c0, 1) > __ rax1(v25, __ T2D, v25, v27); // d1 = c0 ^ rol(c2, 1) > __ rax1(v27, __ T2D, v27, v29); // d3 = c2 ^ rol(c4, 1) > > __ eor(v0, __ T16B, v0, v30); // a0 = a0 ^ d0 > __ xar(v29, __ T2D, v1, v25, (64 - 1)); // a10' = rol((a1^d1), 1) > __ xar(v1, __ T2D, v6, v25, (64 - 44)); // a1 = rol(a6^d1), 44) > __ xar(v6, __ T2D, v9, v28, (64 - 20)); // a6 = rol((a9^d4), 20) > __ xar(v... Although this piece of code is not new, and I don't really think that this level of commenting is necessary, especially in code that is very unlikely to change, I added the comments. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1965220606 From duke at openjdk.org Fri Feb 21 10:25:59 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Fri, 21 Feb 2025 10:25:59 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 02:55:18 GMT, Hao Sun wrote: >> Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding comments + some code reorganization > > Hi. Here is the test result of our CI. > > ### copyright year > > the following files should update the copyright year to 2025. > > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp > src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp > src/hotspot/share/runtime/globals.hpp > src/java.base/share/classes/sun/security/provider/ML_DSA.java > src/java.base/share/classes/sun/security/provider/SHA3Parallel.java > test/micro/org/openjdk/bench/java/security/MLDSA.java > > > ### cross-build failure > > Cross build for riscv64/s390/ppc64 failed. > > Here shows the error msg for ppc64 > > > === Output from failing command(s) repeated here === > * For target support_interim-jmods_support__create_java.base.jmod_exec: > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769 > # assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0 > # > # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3) > # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) > # Problematic frame: > # V [libjvm.so+0x3b391c] Instruction_aarch64::~Instruction_aarch64()+0xbc > # > # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/ > # > # An error report file with more information is saved as: > # /tmp/jdk-src/make/hs_err_pid72752.log > ... (rest of output omitted) > > * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs. > === End of repeated output === > > > I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build. Was this a build attempted on an aarch64 for the other architectures? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2674156680 From yzheng at openjdk.org Fri Feb 21 12:14:57 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Fri, 21 Feb 2025 12:14:57 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 20:19:15 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace LGTM! As @iwanowww said, not inlining such trivial methods seems more like an inliner bug/enhancement opportunity. ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2632877796 From coleenp at openjdk.org Fri Feb 21 12:31:46 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 21 Feb 2025 12:31:46 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v6] In-Reply-To: References: Message-ID: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Remove JVM_GetClassModifiers from jvm.h too. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23572/files - new: https://git.openjdk.org/jdk/pull/23572/files/02347433..c23718b3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=04-05 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From coleenp at openjdk.org Fri Feb 21 12:31:48 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 21 Feb 2025 12:31:48 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v6] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 14:21:47 GMT, Coleen Phillimore wrote: >> src/hotspot/share/prims/jvm.cpp line 1262: >> >>> 1260: JVM_END >>> 1261: >>> 1262: JVM_ENTRY(jboolean, JVM_IsArrayClass(JNIEnv *env, jclass cls)) >> >> Where are the changes to jvm.h? > > Good catch, I also removed JVM_GetProtectionDomain. and JVM_GetClassModifiers. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1965401052 From coleenp at openjdk.org Fri Feb 21 12:31:49 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 21 Feb 2025 12:31:49 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: Message-ID: <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com> On Thu, 20 Feb 2025 23:38:34 GMT, Chen Liang wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix whitespace > > You are right, using the field directly is indeed better. I don't use the field directly because the field is a short and getModifiers makes it into Modifier. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1965399996 From liach at openjdk.org Fri Feb 21 14:04:02 2025 From: liach at openjdk.org (Chen Liang) Date: Fri, 21 Feb 2025 14:04:02 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com> References: <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com> Message-ID: On Fri, 21 Feb 2025 12:27:56 GMT, Coleen Phillimore wrote: >> You are right, using the field directly is indeed better. > > I don't use the field directly because the field is a short and getModifiers makes it into Modifier. Indeed, even though this checks for the specific bit so widening has no effect, it is better to be cautious here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1965522767 From kvn at openjdk.org Fri Feb 21 19:08:01 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Feb 2025 19:08:01 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 07:21:45 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > adjust selector if probability How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted? src/hotspot/share/opto/loopTransform.cpp line 3363: > 3361: if (cl->is_pre_loop() || cl->is_post_loop()) return true; > 3362: > 3363: // If we are stalled, check if we can get unstalled. Can you expand comment explaining cases when we "stall" and what it means? src/hotspot/share/opto/loopopts.cpp line 4514: > 4512: // and then rejecting the slow_loop by constant folding the multiversion_if. > 4513: // > 4514: // Therefore, we "stall" the optimization of the slow_loop until we add We don't use "stall" term. We use "delay" - this is what happens here if I understand it correctly. src/hotspot/share/opto/loopopts.cpp line 4520: > 4518: // multiversion_if folds away the "stalled" slow_loop. If we add any > 4519: // speculative assumption, then we mark the OpaqueMultiversioningNode > 4520: // with "unstall_slow_loop", so that the slow_loop can be optimized. "unstall_slow_loop" - > "optimize_slow_loop" ------------- PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2633960596 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966019182 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966028103 PR Review Comment: https://git.openjdk.org/jdk/pull/22016#discussion_r1966032230 From dlong at openjdk.org Fri Feb 21 21:10:58 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 21 Feb 2025 21:10:58 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com> Message-ID: On Fri, 21 Feb 2025 14:01:20 GMT, Chen Liang wrote: >> I don't use the field directly because the field is a short and getModifiers makes it into Modifier. > > Indeed, even though this checks for the specific bit so widening has no effect, it is better to be cautious here. > I don't use the field directly because the field is a short and getModifiers makes it into Modifier. But getModifiers() returns `int`, not `Modifier` (which is all static). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1966170358 From coleenp at openjdk.org Sat Feb 22 14:49:38 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Sat, 22 Feb 2025 14:49:38 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v7] In-Reply-To: References: Message-ID: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Use modifiers field directly in isInterface. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23572/files - new: https://git.openjdk.org/jdk/pull/23572/files/c23718b3..db7c9782 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=05-06 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From coleenp at openjdk.org Sat Feb 22 14:49:38 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Sat, 22 Feb 2025 14:49:38 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v5] In-Reply-To: References: <3jNPEzaXa0Ncf8eu3vct6a_jyH7k4tH_mbRBaKmbMc0=.d3a86a0f-1bed-4084-af92-959f4dbd52f4@github.com> Message-ID: On Fri, 21 Feb 2025 21:08:33 GMT, Dean Long wrote: >> Indeed, even though this checks for the specific bit so widening has no effect, it is better to be cautious here. > >> I don't use the field directly because the field is a short and getModifiers makes it into Modifier. > > But getModifiers() returns `int`, not `Modifier` (which is all static). I mis-remembered why I called getModifiers(), maybe because all the other calls to getModifiers() in Class.java which used be needed, but I did want to call Modifier.isInterface(). If using the 'modifiers' field directly is better, I'll change it to that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1966527692 From epeter at openjdk.org Mon Feb 24 07:25:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 07:25:59 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: >> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti ng benchmark results there. >> >> Does that sound ok? >> >>> Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > >> > Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > > You should not worry about `-Xcomp` it is testing flag - we can use some default there. > I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. > > I will look on changes more later. @vnkozlov I'll think about the "stall" vs "delay" suggestion. > How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted? I suppose that depends on if the slow path loop will be taken. Imagine we are working on some unaligned MemorySegment (or with aliasing runtime-checks failing). In these cases without optimizing we would for example not unroll. But unrolling can give quite the speedup, of course at the cost of more compile time and code size. Also some RangeCheck eliminations only happen if you have a pre-main-post loop structure. There are probably other optimizations as well. So yes, if the slow path loop is taken often, then optimizing is probably worth it. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2677607527 From haosun at openjdk.org Mon Feb 24 07:44:55 2025 From: haosun at openjdk.org (Hao Sun) Date: Mon, 24 Feb 2025 07:44:55 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi wrote: > Was this a build attempted on an aarch64 for the other architectures? Yes. It's a cross-build on AArch64 for other architectures. > Instruction_aarch64 should not have been there in a ppc build Oops. I didn't check the error message carefully. It might be some issue in our CI. I will check that. Sorry for the noise. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2677637524 From epeter at openjdk.org Mon Feb 24 08:03:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 08:03:59 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov wrote: >> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti ng benchmark results there. >> >> Does that sound ok? >> >>> Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > >> > Can we profile alignment in Interpreter (and C1)? >> >> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it. >> >> What do you think? > > You should not worry about `-Xcomp` it is testing flag - we can use some default there. > I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes. > > I will look on changes more later. @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2677667789 From adinn at openjdk.org Mon Feb 24 08:39:54 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 08:39:54 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 07:41:58 GMT, Hao Sun wrote: >> @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build. >> Was this a build attempted on an aarch64 for the other architectures? > >> Was this a build attempted on an aarch64 for the other architectures? > > Yes. It's a cross-build on AArch64 for other architectures. > >> Instruction_aarch64 should not have been there in a ppc build > > Oops. I didn't check the error message carefully. It might be some issue in our CI. I will check that. > > Sorry for the noise. @shqking There is a [known issue](https://bugs.openjdk.org/browse/JDK-8349921) with cross-builds that is still being investigated. I think that may explain the problem you are seeing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2677735964 From bkilambi at openjdk.org Mon Feb 24 09:37:54 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 24 Feb 2025 09:37:54 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Thu, 20 Feb 2025 17:22:25 GMT, Ferenc Rakoczi wrote: >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618: >> >>> 2616: INSN(smaxp, 0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S >>> 2617: INSN(sminp, 0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S >>> 2618: INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S >> >> Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions. > > I have tried that, but the python script (actually the as command that it started) threw error messages: > > aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760]. > prfm PLDL1KEEP, [x15, 43] > ^ > aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > sub x1, x10, x23, sxth #2 > ^ > aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > add x11, x21, x5, uxtb #3 > ^ > aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > adds x11, x17, x17, uxtw #1 > ^ > aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > sub x11, x0, x15, uxtb #1 > ^ > aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > subs x7, x1, x0, sxth #2 > ^ > This is without any modifications from what is in the master branch currently. You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967284270 From adinn at openjdk.org Mon Feb 24 11:50:55 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 11:50:55 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: > > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4593: > 4591: // chunks of) vector registers v30 and v31, resp. > 4592: // The inputs are in v0-v7 and v16-v23 and the results go to v16-v23, > 4593: // four 32-bit values in each register Suggestion: Once again it would be good to annotate the lines in this code with comments that relate the generated code back to the original Java code. In the header comment you should refer to the relevant Java class and the var names there: // computes (in parallel across 8 x 4S vectors) // a = b * c * 2^-32 mod MONT_Q // where // inputs b and c are in v0, ..., v7 and v16, ... v23, // scratch registers v24, ... v27 are clobbered // output a is written back into v16, ... v23 // constants q and q_inv are in v30, v31 // // See the equivalent Java code in method ML_DSA.montMul Then comment the generation lines as shown below ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967490923 From adinn at openjdk.org Mon Feb 24 11:53:55 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 11:53:55 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: > > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4604: > 4602: FloatRegister vr7 = by_constant ? v29 : v7; > 4603: > 4604: __ sqdmulh(v24, __ T4S, vr0, v16); + __ sqdmulh(v24, __ T4S, v0, v16); // aHigh = hi32(2 * b * c) + __ mulv(v16, __ T4S, v0, v16); // aLow = lo32(b * c) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4613: > 4611: __ mulv(v19, __ T4S, vr3, v19); > 4612: > 4613: __ mulv(v16, __ T4S, v16, v30); __ mulv(v16, __ T4S, v16, v30); // m = aLow * qinv src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4618: > 4616: __ mulv(v19, __ T4S, v19, v30); > 4617: > 4618: __ sqdmulh(v16, __ T4S, v16, v31); __ sqdmulh(v16, __ T4S, v16, v31); // n = hi32(2 * m * q) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4623: > 4621: __ sqdmulh(v19, __ T4S, v19, v31); > 4622: > 4623: __ shsubv(v16, __ T4S, v24, v16); __ shsubv(v16, __ T4S, v24, v16); // a = (aHigh - n) / 2 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967491928 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967492635 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967493031 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967493643 From bkilambi at openjdk.org Mon Feb 24 12:14:03 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 24 Feb 2025 12:14:03 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 operations Message-ID: This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. ------------- Commit messages: - 8345125: Aarch64: Add aarch64 backend for Float16 operations Changes: https://git.openjdk.org/jdk/pull/23748/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23748&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8345125 Stats: 1007 lines in 13 files changed: 326 ins; 1 del; 680 mod Patch: https://git.openjdk.org/jdk/pull/23748.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23748/head:pull/23748 PR: https://git.openjdk.org/jdk/pull/23748 From roland at openjdk.org Mon Feb 24 12:54:56 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 24 Feb 2025 12:54:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Thu, 20 Feb 2025 09:44:16 GMT, Roland Westrelin wrote: >>> Do you see any better way than having the 2x code size if we need both a slow and fast loop? >> >> No but I was confused by your comment about 3x and 4x which is why I asked for clarification. >> Compiled code size affects inlining decisions: if a callee has compiled code and it's larger than some threshold, then the callee is considered too expensive to inline. With your change, some method that was considered ok to inline could now be considered too big. I think that's what Vladimir is concerned by. I don't see what you can do about it, this said. > >> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. >> >> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. > > Do you understand when that happens? It doesn't feel right that the pre loop can be lost. > @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. Yes, if not too much work. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678332801 From epeter at openjdk.org Mon Feb 24 14:32:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 14:32:57 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin wrote: > > @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. > > Yes, if not too much work. Ok, let's add this: diff --git a/src/hotspot/share/opto/vectorization.cpp b/src/hotspot/share/opto/vectorization.cpp index e607a1065dd..290ee249a42 100644 --- a/src/hotspot/share/opto/vectorization.cpp +++ b/src/hotspot/share/opto/vectorization.cpp @@ -98,6 +98,7 @@ VStatus VLoop::check_preconditions_helper() { // the pre-loop limit. CountedLoopEndNode* pre_end = _cl->find_pre_loop_end(); if (pre_end == nullptr) { + assert(false, "found no pre-loop"); return VStatus::make_failure(VLoop::FAILURE_PRE_LOOP_LIMIT); } Node* pre_opaq1 = pre_end->limit(); And run that: rr /oracle-work/jdk-fork7/build/linux-x64-slowdebug/jdk/bin/java -Xcomp -XX:+TraceLoopOpts -XX:CompileCommand=compileonly,jdk.internal.classfile.impl.StackMapGenerator::processBlock --version .... PreMainPost Loop: N7127/N4014 limit_check profile_predicated predicated counted [0,int),+1 (2147483648 iters) rc has_sfpt strip_mined Unroll 2 Loop: N7127/N4014 counted [int,int),+1 (2147483648 iters) main rc has_sfpt strip_mined Loop: N0/N0 has_call has_sfpt Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre rc has_sfpt Loop: N7126/N7125 sfpts={ 7128 } Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main rc has_sfpt strip_mined Loop: N7409/N7416 counted [int,int),+1 (4 iters) post rc has_sfpt Parallel IV: 7728 Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre has_sfpt Parallel IV: 7725 Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main has_sfpt strip_mined Parallel IV: 7718 Loop: N7409/N7416 counted [int,int),+1 (4 iters) post has_sfpt Loop: N0/N0 has_call has_sfpt Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre has_sfpt Loop: N7126/N7125 sfpts={ 7128 } Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main has_sfpt strip_mined Loop: N7409/N7416 counted [int,int),+1 (4 iters) post has_sfpt RangeCheck Loop: N7508/N4014 counted [int,int),+2 (2147483648 iters) main has_sfpt rce strip_mined Unroll 4 Loop: N7508/N4014 limit_check counted [int,int),+2 (2147483648 iters) main has_sfpt rce strip_mined Loop: N0/N0 has_call has_sfpt Loop: N7453/N7460 limit_check profile_predicated predicated counted [0,int),+1 (4 iters) pre rc has_sfpt Loop: N7126/N7125 limit_check sfpts={ 7128 } Loop: N8146/N4014 limit_check counted [int,int),+4 (2147483648 iters) main has_sfpt strip_mined Loop: N7409/N7416 counted [int,int),+1 (4 iters) post rc has_sfpt ... # Internal Error (/oracle-work/jdk-fork7/open/src/hotspot/share/opto/vectorization.cpp:101), pid=1381339, tid=1381348 # assert(false) failed: found no pre-loop The pre-loop node is not dead actually. The issue is with the main-loop in `CountedLoopNode::is_canonical_loop_entry`. We skip through some predicates, but then we cannot find the ZeroTripGuard, rather I'm seeing this: (rr) p ctrl->dump_bfs(2,0,"#cd") dist dump --------------------------------------------- 2 974 ConI === 0 [[ ... ]] #int:1 2 8060 IfTrue === 8056 [[ 8073 ]] #1 1 8073 If === 8060 974 [[ 8074 8077 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 0 8077 IfTrue === 8073 [[ 8103 ]] #1 The pre-loop is further up though: (rr) p this->dump_bfs(26,0,"#c") dist dump --------------------------------------------- 26 7453 CountedLoop === 7453 4015 7460 [[ 7452 7453 7454 7455 ]] inner stride: 1 pre of N7127 !orig=[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671) 25 7455 If === 7453 7441 [[ 7456 7464 ]] P=0.000001, C=-1.000000 !orig=[2686] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671) 24 7456 IfFalse === 7455 [[ 7448 7457 ]] #0 !orig=[2631],[2628] !jvms: StackMapGenerator$Frame::popStack @ bci:5 (line 1001) StackMapGenerator::processBlock @ bci:2681 (line 671) 23 7457 RangeCheck === 7456 7446 [[ 7458 7467 ]] P=0.999999, C=-1.000000 !orig=[1189] !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671) 22 7458 IfTrue === 7457 [[ 7459 ]] #1 !orig=[777],385 !jvms: StackMapGenerator$Frame::popStack @ bci:33 (line 1002) StackMapGenerator::processBlock @ bci:2681 (line 671) 21 7459 CountedLoopEnd === 7458 7443 [[ 7460 7482 ]] [lt] P=0.900000, C=-1.000000 !orig=7122,[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 20 7482 IfFalse === 7459 [[ 7486 ]] #0 19 7486 If === 7482 7485 [[ 7461 7487 ]] P=0.999999, C=-1.000000 18 7487 IfTrue === 7486 [[ 7977 ]] #1 17 7977 If === 7487 974 [[ 7978 7981 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 16 7981 IfTrue === 7977 [[ 7994 ]] #1 15 7994 If === 7981 974 [[ 7995 7998 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 14 7998 IfTrue === 7994 [[ 8118 ]] #1 13 8118 If === 7998 8117 [[ 8119 8122 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 12 8122 IfTrue === 8118 [[ 8007 ]] #1 11 8007 If === 8122 8006 [[ 8008 8011 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 10 8011 IfTrue === 8007 [[ 8056 ]] #1 9 8056 If === 8011 974 [[ 8057 8060 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 8 8060 IfTrue === 8056 [[ 8073 ]] #1 7 8073 If === 8060 974 [[ 8074 8077 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 6 8077 IfTrue === 8073 [[ 8103 ]] #1 5 8173 IfFalse === 7122 [[ 7128 7129 ]] #0 !orig=[7524],[7123],[5442] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 5 8103 If === 8077 8102 [[ 8104 8107 ]] #Last Value Assertion Predicate P=0.999999, C=-1.000000 4 7128 SafePoint === 8173 1 778 1 1 7129 780 1 1 781 781 782 783 784 1 1 1 785 786 [[ 7124 ]] SafePoint !orig=385 !jvms: StackMapGenerator::processBlock @ bci:2688 (line 670) 4 8107 IfTrue === 8103 [[ 8086 ]] #1 3 7124 OuterStripMinedLoopEnd === 7128 781 [[ 7125 7471 ]] P=0.900000, C=-1.000000 3 8086 If === 8107 8085 [[ 8087 8090 ]] #Init Value Assertion Predicate P=0.999999, C=-1.000000 2 7122 CountedLoopEnd === 8146 7121 [[ 8173 4014 ]] [lt] P=0.900000, C=-1.000000 !orig=[5398] !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 2 7125 IfTrue === 7124 [[ 7126 ]] #1 2 8090 IfTrue === 8086 [[ 7126 ]] #1 1 4014 IfTrue === 7122 [[ 8146 ]] #1 !jvms: StackMapGenerator::processBlock @ bci:2674 (line 670) 1 7126 OuterStripMinedLoop === 7126 8090 7125 [[ 7126 8146 ]] 0 8146 CountedLoop === 8146 7126 4014 [[ 8146 1191 8157 8158 7122 7503 ]] inner stride: 4 main of N8146 strip mined !orig=[7508],[7127],[7118],[2645] !jvms: StackMapGenerator::processBlock @ bci:2677 (line 671) It looks like we are skipping some predicates, but not enough of them maybe? In `AssertionPredicates::find_entry` we see: - `8090 IfTrue === 8086 [[ 7126 ]] #1`: `is_predicate` returns `true`. - `8107 IfTrue === 8103 [[ 8086 ]] #1`: `is_predicate` returns `true`. - `8077 IfTrue === 8073 [[ 8103 ]] #1`: `is_predicate` returns `false`. The reason is that the assertion predicate Opaque nodes have already disappeared. I talked with @chhagedorn and he says that there are some "dying" initialized assertion predicates from unrolling that can be in the way. They would be cleaned out by IGVN later, and then we can see through. But at this point they are in the way and we cannot see through and find the ZeroTripGuard, the predicate iterator is not good enough yet. But @chhagedorn is working on that. https://bugs.openjdk.org/browse/JDK-8350579 The implication is that the ZeroTripGuard can be temporarily not be found, and so we cannot even find the pre-loop, and also not the multiversion-if. So I cannot really add an assert now. And who knows, there may be other blocking reasons on top of that. @rwestrel Does that make sense? What do you think we should do? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678602660 From adinn at openjdk.org Mon Feb 24 14:58:57 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 14:58:57 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: <_ApJlty8yCwyY8FiRhczpoKGf1G83hvMuXvOWeKHb90=.5758138f-b03b-49be-ab7a-3b4b56cbe7a6@github.com> On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: > > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4654: > 4652: > 4653: void dilithium_add_sub32() { > 4654: __ addv(v24, __ T4S, v0, v16); __ addv(v24, __ T4S, v0, v16); // a0 = b + c src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4663: > 4661: __ addv(v31, __ T4S, v7, v23); > 4662: > 4663: __ subv(v0, __ T4S, v0, v16); __ subv(v0, __ T4S, v0, v16); // a1 = b - c src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4674: > 4672: > 4673: void dilithium_montmul_sub_add16() { > 4674: __ sqdmulh(v24, __ T4S, v1, v16); __ mulv(v16, __ T4S, v16, v30); // m = aLow * qinv ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967809436 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967809840 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967811299 From epeter at openjdk.org Mon Feb 24 15:30:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Feb 2025 15:30:07 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin wrote: >>> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. >>> >>> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. >> >> Do you understand when that happens? It doesn't feel right that the pre loop can be lost. > >> @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. > > Yes, if not too much work. @rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678803600 From adinn at openjdk.org Mon Feb 24 15:33:08 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 15:33:08 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: > > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4683: > 4681: __ mulv(v19, __ T4S, v7, v19); > 4682: > 4683: __ mulv(v16, __ T4S, v16, v30); __ mulv(v16, __ T4S, v16, v30); // m = aLow * qinv src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4688: > 4686: __ mulv(v19, __ T4S, v19, v30); > 4687: > 4688: __ sqdmulh(v16, __ T4S, v16, v31); __ sqdmulh(v16, __ T4S, v16, v31); // n = hi32(2 * m * q) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4693: > 4691: __ sqdmulh(v19, __ T4S, v19, v31); > 4692: > 4693: __ shsubv(v16, __ T4S, v24, v16); __ shsubv(v16, __ T4S, v24, v16); // a = (aHigh - n) / 2 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4698: > 4696: __ shsubv(v19, __ T4S, v27, v19); > 4697: > 4698: __ subv(v1, __ T4S, v0, v16); __ subv(v1, __ T4S, v0, v16); // x1 = x - a src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4703: > 4701: __ subv(v7, __ T4S, v6, v19); > 4702: > 4703: __ addv(v0, __ T4S, v0, v16); __ addv(v0, __ T4S, v0, v16); // x0 = x + a src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4742: > 4740: > 4741: for (int i = 0; i < 4; i++) { > 4742: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4813: > 4811: // level 5 > 4812: for (int i = 0; i < 1024; i += 256) { > 4813: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4853: > 4851: // level 6 > 4852: for (int i = 0; i < 1024; i += 128) { > 4853: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4876: > 4874: // level 7 > 4875: for (int i = 0; i < 1024; i += 128) { > 4876: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4905: > 4903: > 4904: void dilithium_sub_add_montmul16() { > 4905: __ subv(v20, __ T4S, v0, v1); __ subv(v20, __ T4S, v0, v1); // b = x0 - x1 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4910: > 4908: __ subv(v23, __ T4S, v6, v7); > 4909: > 4910: __ addv(v0, __ T4S, v0, v1); __ addv(v0, __ T4S, v0, v1); // a0 = x0 + x1 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4915: > 4913: __ addv(v6, __ T4S, v6, v7); > 4914: > 4915: __ sqdmulh(v24, __ T4S, v20, v16); __ sqdmulh(v24, __ T4S, v20, v16); // aHigh = hi32(2 * b * c) __ mulv(v1, __ T4S, v20, v16); // aLow = lo32(b * c) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4924: > 4922: __ mulv(v7, __ T4S, v23, v19); > 4923: > 4924: __ mulv(v1, __ T4S, v1, v30); __ mulv(v1, __ T4S, v1, v30); // m = (aLow * q) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4929: > 4927: __ mulv(v7, __ T4S, v7, v30); > 4928: > 4929: __ sqdmulh(v1, __ T4S, v1, v31); __ sqdmulh(v1, __ T4S, v1, v31); // n = hi32(2 * m * q) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4934: > 4932: __ sqdmulh(v7, __ T4S, v7, v31); > 4933: > 4934: __ shsubv(v1, __ T4S, v24, v1); __ shsubv(v1, __ T4S, v24, v1); // a1 = (aHigh - n) / 2 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5044: > 5042: // level0 > 5043: for (int i = 0; i < 1024; i += 128) { > 5044: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); //qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5115: > 5113: __ str(v31, __ Q, Address(coeffs, i + 224)); > 5114: dilithium_load32zetas(zetas); > 5115: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); //qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5166: > 5164: __ lea(dilithiumConsts, ExternalAddress((address) StubRoutines::aarch64::_dilithiumConsts)); > 5165: > 5166: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q __ ldr(v29, __ Q, Address(dilithiumConsts, 48)); // rsquare src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5228: > 5226: __ lea(dilithiumConsts, ExternalAddress((address) StubRoutines::aarch64::_dilithiumConsts)); > 5227: > 5228: __ ldpq(v30, v31, Address(dilithiumConsts, 0)); __ ldpq(v30, v31, Address(dilithiumConsts, 0)); // qinv, q ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967863821 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967864748 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967865658 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967866379 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967866822 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967867752 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967869143 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967870036 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967870373 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967871386 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967871949 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967872681 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967873281 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967873918 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967874418 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967875655 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967876745 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967877717 PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1967878884 From roland at openjdk.org Mon Feb 24 15:49:01 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 24 Feb 2025 15:49:01 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v3] In-Reply-To: References: <-9c7vyB-BTXBPy8qurDSvPUzcAv9LY_d8g8Xj5wnhi4=.7bac2991-37d1-40f5-be3e-bb7a9bdb9f26@github.com> <5hd7BMjze01r6SZOvQ_Ogf_XV1UekB_mYQbpR5_Wzms=.a911ee76-094f-477c-8d24-564c4f0c39d3@github.com> Message-ID: On Mon, 24 Feb 2025 12:52:42 GMT, Roland Westrelin wrote: >>> @rwestrel I think I had tried some verifications above, but I could not even get it to work in all cases in `SuperWord`. >>> >>> In `VLoop::check_preconditions_helper`, I try to find either the predicate or the multiversioning if. But I cannot always find it, and I think that one reason was that the pre-loop can be lost. At least that is what I remember from 4+ weeks ago. >> >> Do you understand when that happens? It doesn't feel right that the pre loop can be lost. > >> @rwestrel Do you want me to find examples for the pre-loop disappearing? I suppose I can find some easily by adding an assert in SuperWord, where we bail out, as I showed above. > > Yes, if not too much work. > @rwestrel I think we should just file an RFE to keep track of these assertions we would like to add once those issues are fixed. That sounds reasonable to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2678873056 From coleenp at openjdk.org Mon Feb 24 15:59:58 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 24 Feb 2025 15:59:58 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v7] In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 05:12:38 GMT, David Holmes wrote: > Does the SA not need any updates in relation to this? No, the SA doesn't know about these compiler intrinsics. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2678913119 From coleenp at openjdk.org Mon Feb 24 15:59:59 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 24 Feb 2025 15:59:59 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v7] In-Reply-To: References: <_j9Wkg21aBltyVrbO4wxGFKmmLDy0T-eorRL4epfS4k=.5a453b6b-d673-4cc6-b29f-192fa74e290c@github.com> <3qpqR3PC8PFmdgaIoSYA3jDWdl-oon0-AcIzXcI76rY=.38635503-c067-4f6e-a4f1-92c1b6d991d1@github.com> Message-ID: <4eQr952WCBhGqlLqX0q2TCDLuFrwh_UmxgJcb2BOs_s=.8e7f55a7-60ec-4cc8-9a8b-cca84ccbba10@github.com> On Thu, 20 Feb 2025 23:23:08 GMT, Coleen Phillimore wrote: >>> ... but not in the return since the caller likely will fetch the klass pointer next. >> >> I notice that too. Callers are using is_primitive() to short-circuit calls to as_Klass(), which means they seem to be aware of this implementation detail when maybe they shouldn't. > > There are 70 callers so yes, it might be something that shouldn't be known in this many places. Definitely out of the scope of this PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1967943222 From adinn at openjdk.org Mon Feb 24 16:21:58 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 16:21:58 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: <6B25PDNMw8dDUm8r5rX4heL3cfvbsPVKqnVg7e1Ax84=.43b91704-15fa-4445-b8be-216fffcf12d4@github.com> On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: > > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions Please add comments as indicated to relate generated code to original Java source. Otherwise good to go. ------------- Marked as reviewed by adinn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2637711807 From adinn at openjdk.org Mon Feb 24 16:21:59 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 16:21:59 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: <36J5kPTCknNCBjMx56e9JmLK2vFbvxBXXXOvTmv5pDs=.6aaa25e2-4cd9-4217-8da3-3280c1d3c4db@github.com> On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi wrote: >> Hi. Here is the test result of our CI. >> >> ### copyright year >> >> the following files should update the copyright year to 2025. >> >> >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp >> src/hotspot/share/runtime/globals.hpp >> src/java.base/share/classes/sun/security/provider/ML_DSA.java >> src/java.base/share/classes/sun/security/provider/SHA3Parallel.java >> test/micro/org/openjdk/bench/java/security/MLDSA.java >> >> >> ### cross-build failure >> >> Cross build for riscv64/s390/ppc64 failed. >> >> Here shows the error msg for ppc64 >> >> >> === Output from failing command(s) repeated here === >> * For target support_interim-jmods_support__create_java.base.jmod_exec: >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769 >> # assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0 >> # >> # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3) >> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) >> # Problematic frame: >> # V [libjvm.so+0x3b391c] Instruction_aarch64::~Instruction_aarch64()+0xbc >> # >> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/ >> # >> # An error report file with more information is saved as: >> # /tmp/jdk-src/make/hs_err_pid72752.log >> ... (rest of output omitted) >> >> * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs. >> === End of repeated output === >> >> >> I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms > > @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build. > Was this a build attempted on an aarch64 for the other architectures? @ferakocz I have indicated a few places where I think you should add comments to clarify the relationship to the original Java code or just clarify what data is being used. I think the code is ok to go in as it is but I would really like to investigate a better structuring of the generator code. This can be done as a follow-up rather than delay getting this version committed. There are two things I still see as problematic with the current code. 1) There are lots of places in your auxiliary generator methods and also in their client methods where you generate distinct sequences of calls to the assembler sharing essentially the same code shape i.e. the same instructions but with different vector register arguments. For example, in `dilithium_montmul32` you generate the multiply sequence to montgomery multiply 4x4s registers in v0..v3 by 4x4s registers in v16..v19 and then repeat exactly the same code in exactly the same sequence to multiply the 4x4s registers in v4..v7 by 4x4s registers in v20..v23. Likewise, `dilithium_sub_add_montmul16` generates that same shape code but uses the montmul sequence with odd registers v1..v7 paired against the compact sequence v16..19. As another example, you generate various 4 or 8 long sequences of subv and addv operations at various points, including in some of the top level methods. I appreciate that you have folded one of the montmult cases into the other by adding the `bool by_constant` parameter to `dilithium_montmul32`. However, I think it would be worth investigating an alternative that would allow more use more, systematic use of auxiliary methods. 2) Your current auxiliary generator methods rely on a fixed mapping of input, output and scratch registers to specific registers. This is part of why the reason why you cannot always call your auxiliaries (or smaller pieces of them) from other locations where the same code shape is generated -- the input and output mappings of data to registers expected by the auxiliary do not match the register sequences in which the relevant data are (transiently) located. This same fact also means that the repeated code sections heavily depend on naming exactly the right register on each generator line. That makes it harder for a maintainer to recognize how, essentially, what is really just one common, abstract operation is, at each different occurrence, consuming, combining and updating several input sequences of related registers to generate one or more output sequences. That also means that it would be very easy to introduce an error if the code ever needed to be changed. I would like to investigate an alternative approach where your auxiliary generator methods and their callers pass arguments that identify the vector register sequences to be consumed as inputs, used as temporaries and written as outputs. In cases where the routines operate on sequences of 4 or 8 successive vectors then, at the very least, that would involve specifying the first register for each input, temporary or output e.g. for the montmult32 multiply v0+ by v16+ using v24+ as temporaries and v30+ as constants and output the results to v16+. However, that leaves it implicit that the first two inputs involve 8 registers while the temporaries involves 4 and the constants 2. The more general requirement is not just to specify the vector sequence length (2, 4 or 8) but also allow the default stride of one (e.g. v0, v1, ...) to be varied to allow for skip sequences (e.g. v0, v2, ...) or constant sequences (v28, v28, ... as would be needed for multiply constant). I have prototyped a simple vector sequence type `VRSeq` that models an indexable sequence of FloatRegisters and allows many of your higher level routines to simply declare register sets they operate on and then pass them as arguments to a range of simply auxiliary generator functions that can be used in many places where you currently have a lot of inline calls to the assembler -- see attachment: [vseq.zip](https://github.com/user-attachments/files/18946470/vseq.zip) I'll raise a JIRA to cover recoding the current implementation using this type and post a follow-up PR that uses it to see how far it helps simplify the code. I believe it will make it easier for maintainers to understand the structure of the generated code and observe/verify the use of registers to store specific values. It should also allow assertions about the use of registers to be added to the code to ensure that values are not being overwritten (expect in circumstances where that is legitimate). Meanwhile I'll approve this PR modulo the commenting I suggested. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2678977770 From adinn at openjdk.org Mon Feb 24 16:33:54 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 16:33:54 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: > > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions I raised [JDK-8350589](https://bugs.openjdk.org/browse/JDK-8350589) to cover investigation of an alternative implementation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2679012108 From aph at openjdk.org Mon Feb 24 17:09:54 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 24 Feb 2025 17:09:54 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. src/hotspot/cpu/aarch64/aarch64.ad line 17275: > 17273: > 17274: // This pattern would result in the following instructions (the first two are for ConvF2HF > 17275: // and the last instruction is for ReinterpretS2HF) - Suggestion: // Without this pattern, (ReinterpretS2HF (ConvF2HF src)) would result in the following instructions (the first two for ConvF2HF // and the last instruction for ReinterpretS2HF) - Reads a little better, I think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1968070079 From adinn at openjdk.org Mon Feb 24 17:15:59 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 17:15:59 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v6] In-Reply-To: References: Message-ID: On Thu, 20 Feb 2025 17:33:18 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with four additional commits since the last revision: > > - Accepting suggested change from Andrew Dinn > - Added comments suggested by Andrew Dinn > - Fixed copyright years > - renaming a couple of functions Marked as reviewed by adinn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2637878768 From adinn at openjdk.org Mon Feb 24 17:16:00 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 24 Feb 2025 17:16:00 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Thu, 20 Feb 2025 17:22:25 GMT, Ferenc Rakoczi wrote: >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2618: >> >>> 2616: INSN(smaxp, 0, 0b101001, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S >>> 2617: INSN(sminp, 0, 0b101011, false); // accepted arrangements: T8B, T16B, T4H, T8H, T2S, T4S >>> 2618: INSN(sqdmulh,0, 0b101101, false); // accepted arrangements: T4H, T8H, T2S, T4S >> >> Hi, not a comment on the algorithm itself but you might have to add these new instructions in the gtest for aarch64 here - test/hotspot/gtest/aarch64/aarch64-asmtest.py and use this file to generate test/hotspot/gtest/aarch64/asmtest.out.h which would contain these newly added instructions. > > I have tried that, but the python script (actually the as command that it started) threw error messages: > > aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760]. > prfm PLDL1KEEP, [x15, 43] > ^ > aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > sub x1, x10, x23, sxth #2 > ^ > aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > add x11, x21, x5, uxtb #3 > ^ > aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > adds x11, x17, x17, uxtw #1 > ^ > aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > sub x11, x0, x15, uxtb #1 > ^ > aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > subs x7, x1, x0, sxth #2 > ^ > This is without any modifications from what is in the master branch currently. @ferakocz This also really needs addressing before committing the patch. Perhaps @theRealAph can advise on how to circumvent the problems you found when trying to update the python script? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1968076559 From aph at openjdk.org Mon Feb 24 17:31:52 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 24 Feb 2025 17:31:52 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. src/hotspot/cpu/aarch64/aarch64.ad line 6978: > 6976: // ldr instruction has 32/64/128 bit variants but not a 16-bit variant. This > 6977: // loads the 16-bit value from constant pool into a 32-bit register but only > 6978: // the bottom half will be populated. Surely what actually happens here is that it loads a 32-bit word from the constant pool. The bottom 16 bits of this word contain the half-precision constant, the top 16 bits are zero. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1968101418 From bkilambi at openjdk.org Mon Feb 24 17:44:52 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 24 Feb 2025 17:44:52 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 17:28:43 GMT, Andrew Haley wrote: >> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. > > src/hotspot/cpu/aarch64/aarch64.ad line 6978: > >> 6976: // ldr instruction has 32/64/128 bit variants but not a 16-bit variant. This >> 6977: // loads the 16-bit value from constant pool into a 32-bit register but only >> 6978: // the bottom half will be populated. > > Surely what actually happens here is that it loads a 32-bit word from the constant pool. The bottom 16 bits of this word contain the half-precision constant, the top 16 bits are zero. I agree. The wording didn't quite convey that. I will change it in my next PS. Thank you for looking into the patch! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1968120239 From liach at openjdk.org Mon Feb 24 17:52:00 2025 From: liach at openjdk.org (Chen Liang) Date: Mon, 24 Feb 2025 17:52:00 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v7] In-Reply-To: References: Message-ID: On Sat, 22 Feb 2025 14:49:38 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Use modifiers field directly in isInterface. The limited changes to the Java codebase looks reasonable. We should probably get a double check from Alan or some other architect. ------------- Marked as reviewed by liach (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2637961573 From rriggs at openjdk.org Mon Feb 24 19:10:02 2025 From: rriggs at openjdk.org (Roger Riggs) Date: Mon, 24 Feb 2025 19:10:02 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v7] In-Reply-To: References: Message-ID: On Sat, 22 Feb 2025 14:49:38 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Use modifiers field directly in isInterface. A nice simplification. src/java.base/share/classes/java/lang/Class.java line 241: > 239: private Class(ClassLoader loader, Class arrayComponentType, char mods, ProtectionDomain pd, boolean isPrim) { > 240: // Initialize final field for classLoader. The initialization value of non-null > 241: // prevents future JIT optimizations from assuming this final field is null. To add a bit more depth to this comment, I'd add. "The following assignments are done directly by the VM without calling this constructor." Or something to that effect. ------------- Marked as reviewed by rriggs (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2638174546 PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1968254793 From coleenp at openjdk.org Mon Feb 24 19:30:41 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 24 Feb 2025 19:30:41 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v8] In-Reply-To: References: Message-ID: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Add a comment about Class constructor. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23572/files - new: https://git.openjdk.org/jdk/pull/23572/files/db7c9782..591abdda Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23572&range=06-07 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23572.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23572/head:pull/23572 PR: https://git.openjdk.org/jdk/pull/23572 From coleenp at openjdk.org Mon Feb 24 19:30:41 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 24 Feb 2025 19:30:41 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v7] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 19:06:30 GMT, Roger Riggs wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Use modifiers field directly in isInterface. > > src/java.base/share/classes/java/lang/Class.java line 241: > >> 239: private Class(ClassLoader loader, Class arrayComponentType, char mods, ProtectionDomain pd, boolean isPrim) { >> 240: // Initialize final field for classLoader. The initialization value of non-null >> 241: // prevents future JIT optimizations from assuming this final field is null. > > To add a bit more depth to this comment, I'd add. > > "The following assignments are done directly by the VM without calling this constructor." > Or something to that effect. Okay, that's a good comment. I'll add it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23572#discussion_r1968297499 From coleenp at openjdk.org Mon Feb 24 19:30:41 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 24 Feb 2025 19:30:41 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v7] In-Reply-To: References: Message-ID: <5i_vwoj0oivW08tMAX5Bp2m7yK_pgQOy0b7_MizQ-uM=.0f54046e-8972-4d05-89d6-aee42b079b48@github.com> On Sat, 22 Feb 2025 14:49:38 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Use modifiers field directly in isInterface. Thanks for reviewing Roger. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2679447427 From dlong at openjdk.org Mon Feb 24 21:09:57 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 24 Feb 2025 21:09:57 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v8] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 19:30:41 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Add a comment about Class constructor. Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23572#pullrequestreview-2638441924 From kvn at openjdk.org Tue Feb 25 00:37:00 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 00:37:00 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> On Mon, 24 Feb 2025 08:00:24 GMT, Emanuel Peter wrote: > But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. Okay. PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680031423 From epeter at openjdk.org Tue Feb 25 07:11:55 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:11:55 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov wrote: > > But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. > > Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. > > Okay. Sounds good, we will revisit and write more benchmarks there. > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680885496 From epeter at openjdk.org Tue Feb 25 07:15:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 07:15:56 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov wrote: >> @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree? > >> But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. > > Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. > > Okay. > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. @vnkozlov @rwestrel Let me summarize the tasks left to do here: - Rename `stalled` -> `delayed`. And `unstall` -> `resume_optimizations` or alike. Improve some comments. - File follow-up RFE for more verification (must find multiversion-if from multiversioned loop) - currently blocked by predicate traversal issue. Maybe we can also assert that we can always find the pre-loop from the main-loop, at least during loop-opts. - When working on aliasing-analysis runtime-check, we have to do more performance analysis, and show the need of both the fast and slow path loops. Let me know if there is more ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2680894298 From epeter at openjdk.org Tue Feb 25 09:27:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 09:27:13 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: - Merge branch 'master' into JDK-8323582-SW-native-alignment - stall -> delay, plus some more comments - adjust selector if probability - Merge branch 'master' into JDK-8323582-SW-native-alignment - remove multiversion mark if we break the structure - register opaque with igvn - copyright and rm CFG check - IR rules for all cases - 3 test versions - test changed to unaligned ints - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 ------------- Changes: https://git.openjdk.org/jdk/pull/22016/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22016&range=03 Stats: 1089 lines in 27 files changed: 966 ins; 28 del; 95 mod Patch: https://git.openjdk.org/jdk/pull/22016.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22016/head:pull/22016 PR: https://git.openjdk.org/jdk/pull/22016 From epeter at openjdk.org Tue Feb 25 09:36:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Feb 2025 09:36:58 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: On Tue, 25 Feb 2025 00:34:14 GMT, Vladimir Kozlov wrote: >> @vnkozlov I mean the issue this: once I implement aliasing-analysis runtime-checks with this multiversion approach, then we'd get regressions if we do not optimize the slow path loop. Currently, we would not vectorize (because we have to be ready for aliasing cases), but we at least unroll, and whatever else we can except vectorization. But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. I think we need to avoid that - would you agree? > >> But if we do not optimize the slow path loop, then we would get performance regressions in aliasing cases because we have no unrolling for them any more. > > Okay, we are back to our previous conversation - we will wait your aliasing-analysis runtime-checks implementation and do performance runs to see if "slow" path affects performance. > > Okay. > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. @vnkozlov @rwestrel - I did the `stall` -> `delay` renaming, and added some more comments in places you asked for it. Let me know if that looks better. - Filed: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if - I added a comment to [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751) C2 SuperWord: Aliasing Analysis runtime check, to check performance around slow_loop. Let me know what more I can do ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2681315131 From aph at openjdk.org Tue Feb 25 09:40:57 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 25 Feb 2025 09:40:57 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Mon, 24 Feb 2025 17:11:24 GMT, Andrew Dinn wrote: >> I have tried that, but the python script (actually the as command that it started) threw error messages: >> >> aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, 32760]. >> prfm PLDL1KEEP, [x15, 43] >> ^ >> aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> sub x1, x10, x23, sxth #2 >> ^ >> aarch64ops.s:359:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> add x11, x21, x5, uxtb #3 >> ^ >> aarch64ops.s:360:22: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> adds x11, x17, x17, uxtw #1 >> ^ >> aarch64ops.s:361:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> sub x11, x0, x15, uxtb #1 >> ^ >> aarch64ops.s:362:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> subs x7, x1, x0, sxth #2 >> ^ >> This is without any modifications from what is in the master branch currently. > > @ferakocz This also really needs addressing before committing the patch. Perhaps @theRealAph can advise on how to circumvent the problems you found when trying to update the python script? > You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems. People have been running this script for a decade now. Let's look at just one of these: aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] sub x1, x10, x23, sxth #2 >From the AArch64 manual: SUB (extended register) SUB , , {, {#}} It thinks this is a SUB (shifted register), bit it's really a SUB (extended register). fedora:aarch64 $ cat t.s sub x1, x10, x23, sxth #2 fedora:aarch64 $ as t.s fedora:aarch64 $ objdump -D a.out Disassembly of section .text: 0000000000000000 <.text>: 0: cb37a941 sub x1, x10, w23, sxth #2 So perhaps binutils expects w23 here, not x23. But the manual (ARM DDI 0487K.a) says x23 should be just fine, and, what's more, gives the x form preferred status. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969374124 From duke at openjdk.org Tue Feb 25 11:17:58 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 25 Feb 2025 11:17:58 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Tue, 25 Feb 2025 09:36:49 GMT, Andrew Haley wrote: >> @ferakocz This also really needs addressing before committing the patch. Perhaps @theRealAph can advise on how to circumvent the problems you found when trying to update the python script? > >> You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems. > > People have been running this script for a decade now. > > Let's look at just one of these: > > > aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > sub x1, x10, x23, sxth #2 > > > From the AArch64 manual: > > SUB (extended register) > SUB , , {, {#}} > > It thinks this is a SUB (shifted register), bit it's really a SUB (extended register). > > > fedora:aarch64 $ cat t.s > sub x1, x10, x23, sxth #2 > fedora:aarch64 $ as t.s > fedora:aarch64 $ objdump -D a.out > Disassembly of section .text: > > 0000000000000000 <.text>: > 0: cb37a941 sub x1, x10, w23, sxth #2 > > > So perhaps binutils expects w23 here, not x23. But the manual (ARM DDI 0487K.a) says x23 should be just fine, and, what's more, gives the x form preferred status. @theRealAlph, maybe we are not reading the same manual (ARM DDI 0487K.a). In my copy: SUB (extended register) is defined as SUB , , {, {#}} and should be W when is SXTH and the as I have enforces this: ferakocz at ferakocz-mac aarch64 % cat t.s sub x1, x10, w23, sxth #2 ferakocz at ferakocz-mac aarch64 % cat > t1.s sub x1, x10, x23, sxth #2 ferakocz at ferakocz-mac aarch64 % cat t.s sub x1, x10, w23, sxth #2 ferakocz at ferakocz-mac aarch64 % cat t1.s sub x1, x10, x23, sxth #2 ferakocz at ferakocz-mac aarch64 % as --version Apple clang version 16.0.0 (clang-1600.0.26.6) Target: arm64-apple-darwin24.3.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin ferakocz at ferakocz-mac aarch64 % as t.s ferakocz at ferakocz-mac aarch64 % objdump -D t.o t.o: file format mach-o arm64 Disassembly of section __TEXT,__text: 0000000000000000 : 0: cb37a941 sub x1, x10, w23, sxth #2 ferakocz at ferakocz-mac aarch64 % as t1.s t1.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] sub x1, x10, x23, sxth #2 ^ I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I haven't read through all of the 14568 pages. So I'm stuck for now. What 'as' are you using? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969561791 From coleenp at openjdk.org Tue Feb 25 12:40:03 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 25 Feb 2025 12:40:03 GMT Subject: RFR: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native [v8] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 19:30:41 GMT, Coleen Phillimore wrote: >> Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. >> Tested with tier1-4 and performance tests. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Add a comment about Class constructor. Thanks for reviewing Dean, Roger, Vladimir, Yudi and Chen, and comments David. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23572#issuecomment-2681823548 From coleenp at openjdk.org Tue Feb 25 12:40:04 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 25 Feb 2025 12:40:04 GMT Subject: Integrated: 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native In-Reply-To: References: Message-ID: On Tue, 11 Feb 2025 20:56:39 GMT, Coleen Phillimore wrote: > Class.isInterface() can check modifier flags, Class.isArray() can check whether component mirror is non-null and Class.isPrimitive() needs a new final transient boolean in java.lang.Class that the JVM code initializes. > Tested with tier1-4 and performance tests. This pull request has now been integrated. Changeset: c413549e Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/c413549eb775f4209416c718dc9aa0748144a6b4 Stats: 202 lines in 20 files changed: 43 ins; 128 del; 31 mod 8349860: Make Class.isArray(), Class.isInterface() and Class.isPrimitive() non-native Reviewed-by: dlong, rriggs, vlivanov, yzheng, liach ------------- PR: https://git.openjdk.org/jdk/pull/23572 From coleenp at openjdk.org Tue Feb 25 13:19:11 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 25 Feb 2025 13:19:11 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg wrote: > I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. > > This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. > > In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. > > The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. > > You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. > > The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. > > Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). > > Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. > > However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b... src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332: > 330: volatile_nonstatic_field(ObjectMonitor, _owner, int64_t) \ > 331: volatile_nonstatic_field(ObjectMonitor, _recursions, intptr_t) \ > 332: volatile_nonstatic_field(ObjectMonitor, _EntryListTail, ObjectWaiter*) \ You may need to coordinate with @mur47x111 to see what graal does with this field. I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList. In which case we don't need to export EntryListTail. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1947058523 From fbredberg at openjdk.org Tue Feb 25 13:19:10 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Tue, 25 Feb 2025 13:19:10 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists Message-ID: I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check both `EntryList` and `cxq` makes this PR worthwhile, I think. Tests tier1-7 passes okay as well as micro-benchmarks like `vm.lang.LockUnlock`. Unsupported platforms { ppc, riscv, s390 } has been tested with QEmu. ------------- Commit messages: - Moved set_bad_pointers() and added accessors. - Merge branch 'master' into 8343840_rewrite_objectmonitor_lists - Atomic hygiene - Fixed a bug in UnlinkAfterAcquire - General cleanup - Updated theory of operations comment - 8343840: Rewrite the ObjectMonitor lists Changes: https://git.openjdk.org/jdk/pull/23421/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8343840 Stats: 594 lines in 9 files changed: 213 ins; 219 del; 162 mod Patch: https://git.openjdk.org/jdk/pull/23421.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23421/head:pull/23421 PR: https://git.openjdk.org/jdk/pull/23421 From aph at openjdk.org Tue Feb 25 13:19:02 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 25 Feb 2025 13:19:02 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Tue, 25 Feb 2025 11:15:39 GMT, Ferenc Rakoczi wrote: >>> You might have to use an assembler from the latest binutils build (if the system default isn't the latest) and add the path to the assembler in the "AS" variable. Also you can run it something like - `python aarch64-asmtest.py | expand > asmtest.out.h`. Please let me know if you still face problems. >> >> People have been running this script for a decade now. >> >> Let's look at just one of these: >> >> >> aarch64ops.s:357:20: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> sub x1, x10, x23, sxth #2 >> >> >> From the AArch64 manual: >> >> SUB (extended register) >> SUB , , {, {#}} >> >> It thinks this is a SUB (shifted register), bit it's really a SUB (extended register). >> >> >> fedora:aarch64 $ cat t.s >> sub x1, x10, x23, sxth #2 >> fedora:aarch64 $ as t.s >> fedora:aarch64 $ objdump -D a.out >> Disassembly of section .text: >> >> 0000000000000000 <.text>: >> 0: cb37a941 sub x1, x10, w23, sxth #2 >> >> >> So perhaps binutils expects w23 here, not x23. But the manual (ARM DDI 0487K.a) says x23 should be just fine, and, what's more, gives the x form preferred status. > > @theRealAlph, maybe we are not reading the same manual (ARM DDI 0487K.a). In my copy: > SUB (extended register) is defined as > SUB , , {, {#}} > and should be W when is SXTH > and the as I have enforces this: > > ferakocz at ferakocz-mac aarch64 % cat t.s > sub x1, x10, w23, sxth #2 > ferakocz at ferakocz-mac aarch64 % cat > t1.s > sub x1, x10, x23, sxth #2 > ferakocz at ferakocz-mac aarch64 % cat t.s > sub x1, x10, w23, sxth #2 > ferakocz at ferakocz-mac aarch64 % cat t1.s > sub x1, x10, x23, sxth #2 > ferakocz at ferakocz-mac aarch64 % as --version > Apple clang version 16.0.0 (clang-1600.0.26.6) > Target: arm64-apple-darwin24.3.0 > Thread model: posix > InstalledDir: /Library/Developer/CommandLineTools/usr/bin > ferakocz at ferakocz-mac aarch64 % as t.s > ferakocz at ferakocz-mac aarch64 % objdump -D t.o > > t.o: file format mach-o arm64 > > Disassembly of section __TEXT,__text: > > 0000000000000000 : > 0: cb37a941 sub x1, x10, w23, sxth #2 > ferakocz at ferakocz-mac aarch64 % as t1.s > t1.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > sub x1, x10, x23, sxth #2 > ^ > > I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I haven't read through all of the 14568 pages. > > So I'm stuck for now. What 'as' are you using? > I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I > haven't read through all of the 14568 pages. Yes, you've got a point, but it's always worked. Is this a macos thing, maybe? > So I'm stuck for now. What 'as' are you using? Latest binutils, today. I checked it out half an hour ago. GNU assembler (GNU Binutils) 2.44.50.20250225 Copyright (C) 2025 Free Software Foundation, Inc. Try this: diff --git a/test/hotspot/gtest/aarch64/aarch64-asmtest.py b/test/hotspot/gtest/aarch64/aarch64-asmtest.py index 9c770632e25..b1674fff04d 100644 --- a/test/hotspot/gtest/aarch64/aarch64-asmtest.py +++ b/test/hotspot/gtest/aarch64/aarch64-asmtest.py @@ -476,8 +476,13 @@ class AddSubExtendedOp(ThreeRegInstruction): + ", " + str(self.amount) + ");")) def astr(self): - return (super(AddSubExtendedOp, self).astr() - + (", " + AddSubExtendedOp.optNames[self.option] + prefix = self.asmRegPrefix + return (super(ThreeRegInstruction, self).astr() + + ('%s, %s, %s' + % (self.reg[0].astr(prefix), + self.reg[1].astr(prefix), + self.reg[1].astr("w")) + + ", " + AddSubExtendedOp.optNames[self.option] + " #" + str(self.amount))) class AddSubImmOp(TwoRegImmedInstruction): ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969760509 From aph at openjdk.org Tue Feb 25 13:19:03 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 25 Feb 2025 13:19:03 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Tue, 25 Feb 2025 13:14:52 GMT, Andrew Haley wrote: >> @theRealAlph, maybe we are not reading the same manual (ARM DDI 0487K.a). In my copy: >> SUB (extended register) is defined as >> SUB , , {, {#}} >> and should be W when is SXTH >> and the as I have enforces this: >> >> ferakocz at ferakocz-mac aarch64 % cat t.s >> sub x1, x10, w23, sxth #2 >> ferakocz at ferakocz-mac aarch64 % cat > t1.s >> sub x1, x10, x23, sxth #2 >> ferakocz at ferakocz-mac aarch64 % cat t.s >> sub x1, x10, w23, sxth #2 >> ferakocz at ferakocz-mac aarch64 % cat t1.s >> sub x1, x10, x23, sxth #2 >> ferakocz at ferakocz-mac aarch64 % as --version >> Apple clang version 16.0.0 (clang-1600.0.26.6) >> Target: arm64-apple-darwin24.3.0 >> Thread model: posix >> InstalledDir: /Library/Developer/CommandLineTools/usr/bin >> ferakocz at ferakocz-mac aarch64 % as t.s >> ferakocz at ferakocz-mac aarch64 % objdump -D t.o >> >> t.o: file format mach-o arm64 >> >> Disassembly of section __TEXT,__text: >> >> 0000000000000000 : >> 0: cb37a941 sub x1, x10, w23, sxth #2 >> ferakocz at ferakocz-mac aarch64 % as t1.s >> t1.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> sub x1, x10, x23, sxth #2 >> ^ >> >> I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I haven't read through all of the 14568 pages. >> >> So I'm stuck for now. What 'as' are you using? > >> I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I > haven't read through all of the 14568 pages. > > Yes, you've got a point, but it's always worked. Is this a macos thing, maybe? > >> So I'm stuck for now. What 'as' are you using? > > Latest binutils, today. I checked it out half an hour ago. > > GNU assembler (GNU Binutils) 2.44.50.20250225 > Copyright (C) 2025 Free Software Foundation, Inc. > > Try this: > > > diff --git a/test/hotspot/gtest/aarch64/aarch64-asmtest.py b/test/hotspot/gtest/aarch64/aarch64-asmtest.py > index 9c770632e25..b1674fff04d 100644 > --- a/test/hotspot/gtest/aarch64/aarch64-asmtest.py > +++ b/test/hotspot/gtest/aarch64/aarch64-asmtest.py > @@ -476,8 +476,13 @@ class AddSubExtendedOp(ThreeRegInstruction): > + ", " + str(self.amount) + ");")) > > def astr(self): > - return (super(AddSubExtendedOp, self).astr() > - + (", " + AddSubExtendedOp.optNames[self.option] > + prefix = self.asmRegPrefix > + return (super(ThreeRegInstruction, self).astr() > + + ('%s, %s, %s' > + % (self.reg[0].astr(prefix), > + self.reg[1].astr(prefix), > + self.reg[1].astr("w")) > + + ", " + AddSubExtendedOp.optNames[self.option] > + " #" + str(self.amount))) > > class AddSubImmOp(TwoRegImmedInstruction): I just tried it with top-of trunk latest binutils: fedora:aarch64 $ ~/binutils-gdb-install/bin/as -march=armv9-a+sha3+sve2-bitperm aarch64ops.s fedora:aarch64 $ ~/binutils-gdb-install/bin/as --version GNU assembler (GNU Binutils) 2.44.50.20250225 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969761898 From dholmes at openjdk.org Tue Feb 25 13:19:16 2025 From: dholmes at openjdk.org (David Holmes) Date: Tue, 25 Feb 2025 13:19:16 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg wrote: > I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. > > This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. > > In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. > > The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. > > You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. > > The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. > > Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). > > Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. > > However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b... src/hotspot/share/runtime/objectMonitor.cpp line 704: > 702: > 703: for (;;) { > 704: ObjectWaiter* front = Atomic::load_acquire(&_entry_list); Technically you don't need a load_acquire here because you do not access any members of front before hitting the cmpxchg that gives you a full fence.. For good code hygiene Atomic::load would suffice. src/hotspot/share/runtime/objectMonitor.cpp line 723: > 721: > 722: for (;;) { > 723: ObjectWaiter* front = Atomic::load_acquire(&_entry_list); Technically you don't need a `load_acquire` here because you do not access any members of `front` before hitting the cmpxchg that gives you a full fence.. For good code hygiene `Atomic::load` would suffice. src/hotspot/share/runtime/objectMonitor.cpp line 1264: > 1262: return w; > 1263: } > 1264: w = Atomic::load_acquire(&_entry_list); Suggestion: // Need acquire here to match the implicit release of the cmpxchg that updated _entry_list, so we // can access w->_next. w = Atomic::load_acquire(&_entry_list); src/hotspot/share/runtime/objectMonitor.cpp line 1303: > 1301: // Check if we are unlinking the last element in the _entry_list. > 1302: // This is by far the most common case. > 1303: if (currentNode->_next == nullptr) { The direct checks of `_next` and _prev` for null/non-null do not work with your use of `set_bad_pointers`. If you actually intend to keep `set_bad_pointers` in the final code then you should be using accessors e.g. ObjectWaiter* next() { assert (_next != 0xBAD, "corrupted list!"); return _next; } src/hotspot/share/runtime/objectMonitor.cpp line 1306: > 1304: assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant"); > 1305: > 1306: ObjectWaiter* v = Atomic::load_acquire(&_entry_list); Again technically you do not need `load_acquire` here because you do not access any fields of `v` when `v` could be other than the current node. `Atomic::load` will suffice. src/hotspot/share/runtime/objectMonitor.cpp line 1315: > 1313: } > 1314: // The CAS above can fail from interference IFF a contending > 1315: // thread "pushed" itself onto entry_list. Suggestion: // The CAS above can fail from interference IFF a contending // thread "pushed" itself onto entry_list. So fall-through to // building the doubly-linked list. assert(currentNode->prev == nullptr, "invariant"); src/hotspot/share/runtime/objectMonitor.cpp line 1334: > 1332: } > 1333: > 1334: assert(currentNode->_next != nullptr, "invariant"); Suggestion: else { // currentNode->_next != nullptr // If we get here it means the current thread enqueued itself on the EntryList but was then able to // "steal" the lock before the chosen successor was able to. Consequently currentNode must be an // interior node in the EntryList, or the head. src/hotspot/share/runtime/objectMonitor.cpp line 1337: > 1335: assert(currentNode != _entry_list_tail, "invariant"); > 1336: > 1337: if (currentNode->_prev == nullptr) { Suggestion: // Check if we are in the singly-linked portion of the EntryList. If we are the head then we try to remove // ourselves, else we convert to the doubly-linked list. if (currentNode->_prev == nullptr) { src/hotspot/share/runtime/objectMonitor.cpp line 1347: > 1345: // else we convert to the doubly-linked list. > 1346: if (currentNode->_prev == nullptr) { > 1347: ObjectWaiter* v = Atomic::load_acquire(&_entry_list); Again no `load_acquire` needed. src/hotspot/share/runtime/objectMonitor.cpp line 1352: > 1350: // The CAS above can fail from interference IFF a contending > 1351: // thread "pushed" itself onto entry_list, in which case > 1352: // currentNode must now be in the interior of the list. Suggestion: // currentNode must now be in the interior of the list. Fall-through // to building the doubly-linked list. src/hotspot/share/runtime/objectMonitor.cpp line 1353: > 1351: // thread "pushed" itself onto entry_list, in which case > 1352: // currentNode must now be in the interior of the list. > 1353: assert(_entry_list != currentNode, "invariant"); Not sure you really need this. The fact the cmpxchg failed means we can't be the head of the list. Also by reading it again you are potentially finding a different head to that which existed when the cmpxchg failed. src/hotspot/share/runtime/objectMonitor.cpp line 1362: > 1360: } > 1361: > 1362: // We now assume we are unlinking currentNode from the interior of a Suggestion: // We now know we are unlinking currentNode from the interior of a src/hotspot/share/runtime/objectMonitor.cpp line 1534: > 1532: ObjectWaiter* w = nullptr; > 1533: > 1534: w = _entry_list; Use `Atomic::load` for consistency and good code hygiene. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962360900 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962359972 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962364788 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957707916 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962368696 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957692735 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957696030 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957698728 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962370002 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957699877 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957701253 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1957701596 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1962372883 From fbredberg at openjdk.org Tue Feb 25 13:19:11 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Tue, 25 Feb 2025 13:19:11 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 19:17:24 GMT, Coleen Phillimore wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332: > >> 330: volatile_nonstatic_field(ObjectMonitor, _owner, int64_t) \ >> 331: volatile_nonstatic_field(ObjectMonitor, _recursions, intptr_t) \ >> 332: volatile_nonstatic_field(ObjectMonitor, _EntryListTail, ObjectWaiter*) \ > > You may need to coordinate with @mur47x111 to see what graal does with this field. I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList. In which case we don't need to export EntryListTail. Thanks for the heads up @coleenp . I was planing on contacting the Graal team when this PR gets closer to getting integrated. I'll delete the `_EntryListTail` export, and make sure to ask for a review from @mur47x111 when that time comes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1949002357 From yzheng at openjdk.org Tue Feb 25 13:19:11 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 25 Feb 2025 13:19:11 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Fri, 7 Feb 2025 19:17:24 GMT, Coleen Phillimore wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332: > >> 330: volatile_nonstatic_field(ObjectMonitor, _owner, int64_t) \ >> 331: volatile_nonstatic_field(ObjectMonitor, _recursions, intptr_t) \ >> 332: volatile_nonstatic_field(ObjectMonitor, _EntryListTail, ObjectWaiter*) \ > > You may need to coordinate with @mur47x111 to see what graal does with this field. I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList. In which case we don't need to export EntryListTail. Indeed. You may delete this export and I will make the Graal side changes accordingly at [MonitorSnippets.java#L680](https://github.com/oracle/graal/blob/3d543641b056fdaa8e7444f09615067f8d766f6e/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/hotspot/replacements/MonitorSnippets.java#L680) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1948809912 From fbredberg at openjdk.org Tue Feb 25 13:19:16 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Tue, 25 Feb 2025 13:19:16 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Wed, 19 Feb 2025 20:55:28 GMT, David Holmes wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > src/hotspot/share/runtime/objectMonitor.cpp line 704: > >> 702: >> 703: for (;;) { >> 704: ObjectWaiter* front = Atomic::load_acquire(&_entry_list); > > Technically you don't need a load_acquire here because you do not access any members of front before hitting the cmpxchg that gives you a full fence.. For good code hygiene Atomic::load would suffice. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 723: > >> 721: >> 722: for (;;) { >> 723: ObjectWaiter* front = Atomic::load_acquire(&_entry_list); > > Technically you don't need a `load_acquire` here because you do not access any members of `front` before hitting the cmpxchg that gives you a full fence.. For good code hygiene `Atomic::load` would suffice. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1264: > >> 1262: return w; >> 1263: } >> 1264: w = Atomic::load_acquire(&_entry_list); > > Suggestion: > > // Need acquire here to match the implicit release of the cmpxchg that updated _entry_list, so we > // can access w->_next. > w = Atomic::load_acquire(&_entry_list); Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1303: > >> 1301: // Check if we are unlinking the last element in the _entry_list. >> 1302: // This is by far the most common case. >> 1303: if (currentNode->_next == nullptr) { > > The direct checks of `_next` and _prev` for null/non-null do not work with your use of `set_bad_pointers`. If you actually intend to keep `set_bad_pointers` in the final code then you should be using accessors e.g. > > ObjectWaiter* next() { > assert (_next != 0xBAD, "corrupted list!"); > return _next; > } Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1306: > >> 1304: assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant"); >> 1305: >> 1306: ObjectWaiter* v = Atomic::load_acquire(&_entry_list); > > Again technically you do not need `load_acquire` here because you do not access any fields of `v` when `v` could be other than the current node. `Atomic::load` will suffice. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1315: > >> 1313: } >> 1314: // The CAS above can fail from interference IFF a contending >> 1315: // thread "pushed" itself onto entry_list. > > Suggestion: > > // The CAS above can fail from interference IFF a contending > // thread "pushed" itself onto entry_list. So fall-through to > // building the doubly-linked list. > assert(currentNode->prev == nullptr, "invariant"); Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1334: > >> 1332: } >> 1333: >> 1334: assert(currentNode->_next != nullptr, "invariant"); > > Suggestion: > > else { // currentNode->_next != nullptr > > // If we get here it means the current thread enqueued itself on the EntryList but was then able to > // "steal" the lock before the chosen successor was able to. Consequently currentNode must be an > // interior node in the EntryList, or the head. Added the comment but left out the suggested "else" and kept the assert. I know that the if statement above always ends in a return, but if that is changed this feels safer. > src/hotspot/share/runtime/objectMonitor.cpp line 1337: > >> 1335: assert(currentNode != _entry_list_tail, "invariant"); >> 1336: >> 1337: if (currentNode->_prev == nullptr) { > > Suggestion: > > // Check if we are in the singly-linked portion of the EntryList. If we are the head then we try to remove > // ourselves, else we convert to the doubly-linked list. > if (currentNode->_prev == nullptr) { Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1347: > >> 1345: // else we convert to the doubly-linked list. >> 1346: if (currentNode->_prev == nullptr) { >> 1347: ObjectWaiter* v = Atomic::load_acquire(&_entry_list); > > Again no `load_acquire` needed. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1352: > >> 1350: // The CAS above can fail from interference IFF a contending >> 1351: // thread "pushed" itself onto entry_list, in which case >> 1352: // currentNode must now be in the interior of the list. > > Suggestion: > > // currentNode must now be in the interior of the list. Fall-through > // to building the doubly-linked list. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1353: > >> 1351: // thread "pushed" itself onto entry_list, in which case >> 1352: // currentNode must now be in the interior of the list. >> 1353: assert(_entry_list != currentNode, "invariant"); > > Not sure you really need this. The fact the cmpxchg failed means we can't be the head of the list. Also by reading it again you are potentially finding a different head to that which existed when the cmpxchg failed. You are right I don't really need it, but sometimes I feel that comments can rotten, but asserts can't. I guess I put this one in so that it's easier to see what state the currentNode is in (not head) without reading through the logic that end up in the else-statement. > src/hotspot/share/runtime/objectMonitor.cpp line 1362: > >> 1360: } >> 1361: >> 1362: // We now assume we are unlinking currentNode from the interior of a > > Suggestion: > > // We now know we are unlinking currentNode from the interior of a Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1534: > >> 1532: ObjectWaiter* w = nullptr; >> 1533: >> 1534: w = _entry_list; > > Use `Atomic::load` for consistency and good code hygiene. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963144747 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963135003 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963050591 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1967825628 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963136473 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963132242 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1961646077 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1961647021 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963137807 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1969341568 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1961659147 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963133824 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1963141844 From aph at openjdk.org Tue Feb 25 13:40:57 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 25 Feb 2025 13:40:57 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. test/hotspot/gtest/aarch64/aarch64-asmtest.py line 19: > 17: 0x7e0, 0xfc0, 0x1f80, 0x3ff0, 0x7e00, 0x8000, > 18: 0x81ff, 0xc1ff, 0xc003, 0xc7ff, 0xdfff, 0xe03f, > 19: 0xe1ff, 0xf801, 0xfc00, 0xfc07, 0xff03, 0xfffe] So here you've deleted the duplicated `0x7e00` (good) but also the not-duplicated `0xe10f`. Is `0xe10f` not valid? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1969800950 From aph at openjdk.org Tue Feb 25 13:46:57 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 25 Feb 2025 13:46:57 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 12:09:57 GMT, Bhavana Kilambi wrote: > This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. Overall, this looks like a great pice of work. I only have a few changes in comments and a question, then we're good to go. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23748#issuecomment-2682030036 From aph at openjdk.org Tue Feb 25 13:52:55 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 25 Feb 2025 13:52:55 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: On Tue, 25 Feb 2025 13:15:49 GMT, Andrew Haley wrote: >>> I have not found the place in the manual where it allows/encourages the use of x instead of w, but I admit I > haven't read through all of the 14568 pages. >> >> Yes, you've got a point, but it's always worked. Is this a macos thing, maybe? >> >>> So I'm stuck for now. What 'as' are you using? >> >> Latest binutils, today. I checked it out half an hour ago. >> >> GNU assembler (GNU Binutils) 2.44.50.20250225 >> Copyright (C) 2025 Free Software Foundation, Inc. >> >> Try this: >> >> >> diff --git a/test/hotspot/gtest/aarch64/aarch64-asmtest.py b/test/hotspot/gtest/aarch64/aarch64-asmtest.py >> index 9c770632e25..b1674fff04d 100644 >> --- a/test/hotspot/gtest/aarch64/aarch64-asmtest.py >> +++ b/test/hotspot/gtest/aarch64/aarch64-asmtest.py >> @@ -476,8 +476,13 @@ class AddSubExtendedOp(ThreeRegInstruction): >> + ", " + str(self.amount) + ");")) >> >> def astr(self): >> - return (super(AddSubExtendedOp, self).astr() >> - + (", " + AddSubExtendedOp.optNames[self.option] >> + prefix = self.asmRegPrefix >> + return (super(ThreeRegInstruction, self).astr() >> + + ('%s, %s, %s' >> + % (self.reg[0].astr(prefix), >> + self.reg[1].astr(prefix), >> + self.reg[1].astr("w")) >> + + ", " + AddSubExtendedOp.optNames[self.option] >> + " #" + str(self.amount))) >> >> class AddSubImmOp(TwoRegImmedInstruction): > > I just tried it with top-of trunk latest binutils: > > fedora:aarch64 $ ~/binutils-gdb-install/bin/as -march=armv9-a+sha3+sve2-bitperm aarch64ops.s > fedora:aarch64 $ ~/binutils-gdb-install/bin/as --version > GNU assembler (GNU Binutils) 2.44.50.20250225 Aha! aph at Andrews-MacBook-Pro ~ % as t.s t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] sub x1, x10, x23, sxth #2 ^ aph at Andrews-MacBook-Pro ~ % as --version Apple clang version 16.0.0 (clang-1600.0.26.6) Target: arm64-apple-darwin24.3.0 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1969823700 From bkilambi at openjdk.org Tue Feb 25 13:55:58 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 25 Feb 2025 13:55:58 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 13:37:51 GMT, Andrew Haley wrote: >> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. > > test/hotspot/gtest/aarch64/aarch64-asmtest.py line 19: > >> 17: 0x7e0, 0xfc0, 0x1f80, 0x3ff0, 0x7e00, 0x8000, >> 18: 0x81ff, 0xc1ff, 0xc003, 0xc7ff, 0xdfff, 0xe03f, >> 19: 0xe1ff, 0xf801, 0xfc00, 0xfc07, 0xff03, 0xfffe] > > So here you've deleted the duplicated `0x7e00` (good) but also the not-duplicated `0xe10f`. Is `0xe10f` not valid? Hi, yes `0xe10f` does not seem to be valid. While I tried generating the `asmtest.out.h` I ran into errors with this value - aarch64ops.s:1105: Error: immediate out of range at operand 3 -- eor z6.h,z6.h,#0xe10f aarch64ops.s:1123: Error: immediate out of range at operand 3 -- eor z3.h,z3.h,#0xe10f So I looked it up here - https://gist.github.com/dinfuehr/51a01ac58c0b23e4de9aac313ed6a06a to see if this number is a legal immediate and looks like it isn't. Maybe it's just chance that this number wasn't generated before as an immediate operand and these errors didn't up till now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1969827032 From galder at openjdk.org Tue Feb 25 14:57:05 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 25 Feb 2025 14:57:05 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/d6aa3453...a190ae68 > > The interesting thing is intReductionSimpleMin @ 100%. We see a regression there but I didn't observe it with the perfasm run. So, this could be due to variance in the application of cmov or not? > > I don't see the error / variance in the results you posted. Often I look at those, and if it is anywhere above 10% of the average, then I'm suspicious ;) > > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this. > > Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example? @eme64 I think you're in the right direction: minLongA = negate(maxLongA); minLongB = negate(maxLongB); minIntA = toInts(minLongA); minIntB = toInts(minLongB); To keep same data distribution algorithm for both min and max operations, I started with positive numbers for max and found out that I could use the same data with the same properties for min by negating them. As you can see in the above snippet, the min values for ints had not been negated. I'll fix that and show final numbers with the same subset shown in https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644 ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2682263423 From tschatzl at openjdk.org Tue Feb 25 15:04:28 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 25 Feb 2025 15:04:28 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier Message-ID: Hi all, please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. ### Current situation With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. The main reason for the current barrier is how g1 implements concurrent refinement: * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: // Filtering if (region(@x.a) == region(y)) goto done; // same region check if (y == null) goto done; // null value check if (card(@x.a) == young_card) goto done; // write to young gen check StoreLoad; // synchronize if (card(@x.a) == dirty_card) goto done; *card(@x.a) = dirty // Card tracking enqueue(card-address(@x.a)) into thread-local-dcq; if (thread-local-dcq is not full) goto done; call runtime to move thread-local-dcq into dcqs done: Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a second card table ("refinement table"). The second card table also replaces the dirty card queue. In that scheme the fine-grained synchronization is unnecessary because mutator and refinement threads always write to different memory areas (and no concurrent write where an update can be lost can occur). This removes the necessity for synchronization for every reference write. Also no card enqueuing is required any more. Only the filters and the card mark remain. ### How this works In the beginning both the card table and the refinement table are completely unmarked (contain "clean" cards). The mutator dirties the card table, until G1 heuristics think that a significant enough amount of cards were dirtied based on what is allocated for scanning them during the garbage collection. At that point, the card table and the refinement table are exchanged "atomically" using handshakes. The mutator keeps dirtying the (the previous, clean refinement table which is now the) card table, while the refinement threads look for and refine dirty cards on the refinement table as before. Refinement of cards is very similar to before: if an interesting reference in a dirty card has been found, G1 records it in appropriate remembered sets. In this implementation there is an exception for references to the current collection set (typically young gen) - the refinement threads redirty that card on the card table with a special `to-collection-set` value. This is valid because races with the mutator for that write do not matter - the entire card will eventually be rescanned anyway, regardless of whether it ends up as dirty or to-collection-set. The advantage of marking to-collection-set cards specially is that the next time the card tables are swapped, the refinement threads will not re-refine them on the assumption that that reference to the collection set will not change. This decreases refinement work substantially. If refinement gets interrupted by GC, the refinement table will be merged with the card table before card scanning, which works as before. New barrier pseudo-code for an assignment `x.a = y`: // Filtering if (region(@x.a) == region(y)) goto done; // same region check if (y == null) goto done; // null value check if (card(@x.a) != clean_card) goto done; // skip already non-clean cards *card(@x.a) = dirty This is basically the Serial/Parallel GC barrier with additional filters to keep the number of dirty cards as little as possible. A few more comments about the barrier: * the barrier now loads the card table base offset from a thread local instead of inlining it. This is necessary for this mechanism to work as the card table to dirty changes over time, and may even be faster on some architectures (code size), and some architectures already do. * all existing pre-filters were kept. Benchmarks showed some significant regressions wrt to pause times and even throughput compared to G1 in master. Using the Parallel GC barrier (just the dirty card write) would be possible, and further investigation on stripping parts will be made as follow-up. * the final check tests for non-clean cards to avoid overwriting existing cards, in particular the "to-collection set" cards described above. Current G1 marks the cards corresponding to young gen regions as all "young" so that the original barrier could potentially avoid the `StoreLoad`. This implementation removes this facility (which might be re-introduced later), but measurements showed that pre-dirtying the young generation region's cards as "dirty" (g1 does not need to use an extra "young" value) did not yield any measurable performance difference. ### Refinement process The goal of the refinement (threads) is to make sure that the number of cards to scan in the garbage collection is below a particular threshold. The prototype changes the refinement threads into a single control thread and a set of (refinement) worker threads. Differently to the previous implementation, the control thread does not do any refinement, but only executes the heuristics to start a calculated amount of worker threads and tracking refinement progress. The refinement trigger is based on current known number of pending (i.e. dirty) cards on the card table and a pending card generation rate, fairly similarly to the previous algorithm. After the refinement control thread determines that it is time to do refinement, it starts the following sequence: 1) **Swap the card table**. This consists of several steps: 1) **Swap the global card table** - the global card table pointer is swapped; newly created threads and runtime calls will eventually use the new values, at the latest after the next two steps. 2) **Update the pointers in all JavaThread**'s TLS storage to the new card table pointer using a handshake operation 3) **Update the pointers in the GC thread**'s TLS storage to the new card table pointer using the SuspendibleThreadSet mechanism 2) **Snapshot the heap** - determine the extent of work needed for all regions where the refinement threads need to do some work on the refinement table (the previous card table). The snapshot stores the work progress for each region so that work can be interrupted and continued at any time. This work either consists of refinement of the particular card (old generation regions) or clearing the cards (next collection set/young generation regions). 3) **Sweep the refinement table** by activating the refinement worker threads. The threads refine dirty cards using the heap snapshot where worker threads claim parts of regions to process. * Cards with references to the young generation are not added to the young generation's card based remembered set. Instead these cards are marked as to-collection-set in the card table and any remaining refinement of that card skipped. * If refinement encounters a card that is already marked as to-collection-set it is not refined and re-marked as to-collection-set on the card table . * During refinement, the refinement table is also cleared (in bulk for collection set regions as they do not need any refinement, and in other regions as they are refined for the non-clean cards). * Dirty cards within unparsable heap areas are forwarded to/redirtied on the card table as is. 4) **Completion work**, mostly statistics. If the work is interrupted by a non-garbage collection synchronization point, work is suspended temporarily and resumed later using the heap snapshot. After the refinement process the refinement table is all-clean again and ready to be swapped again. ### Garbage collection pause changes Since a garbage collection (young or full gc) pause may occur at any point during the refinement process, the garbage collection needs some compensating work for the not yet swept parts of the refinement table. Note that this situation is very rare, and the heuristics try to avoid that, so in most cases nothing needs to be done as the refinement table is all clean. If this happens, young collections add a new phase called `Merge Refinement Table` in the garbage collection pause right before the `Merge Heap Roots` phase. This compensating phase does the following: 0) (Optional) Snapshot the heap if not done yet (if the process has been interrupted between state 1 and 3 of the refinement process) 1) Merge the refinement table into the card table - in this step the dirty cards of interesting regions are 2) Completion work (statistics) If a full collection interrupts concurrent refinement, the refinement table is simply cleared and all dirty cards thrown away. A garbage collection generates new cards (e.g. references from promoted objects into the young generation) on the refinement table. This acts similarly to the extra DCQS used to record these interesting references/cards and redirty the card table using them in the previous implementation. G1 swaps the card tables at the end of the collection to keep the post-condition of the refinement table being all clean (and any to-be-refined cards on the card table) at the end of garbage collection. ### Performance metrics Following is an overview of the changes in behavior. Some numbers are provided in the CR in the first comment. #### Native memory usage The refinement table takes an additional 0.2% of the Java heap size of native memory compared to JDK 21 and above (in JDK 21 we removed one card table sized data structure, so this is a non-issue when updating from before). Some of that additional memory usage is automatically reclaimed by removing the dirty card queues. Additional memory is reclaimed by managing the cards containing to-collection-set references on the card table by dropping the explicit remembered sets for young generation completely and any remembered set entries which would otherwise be duplicated into the other region's remembered sets. In some applications/benchmarks these gains completely offset the additional card table, however most of the time this is not the case, particularly for throughput applications currently. It is possible to allocate the refinement table lazily, which means that since these applications often do not need any concurrent refinement, there is no overhead at all but actually a net reduction of native memory usage. This is not implemented in this prototype. #### Latency ("Pause times") Not affected or slightly better. Pause times decrease due to a shorter "Merge remembered sets" phase due to no work required for the remembered sets for the young generation - they are always already on the card table! However merging of the refinement table into the card table is extremely fast and is always faster than merging remembered sets for the young gen in my measurements. Since this work is linearly scanning some memory, this is embarassingly parallel too. The cards created during garbage collection do not need to be redirtied, so that phase has also been removed. The card table swap is based on predictions for mutator card dirtying rate and refinement rate as before, and the policy is actually fairly similar to before. It is still rather aggressive, but in most cases takes less cpu resources than the one before, mostly because refining takes less cpu time. Many applications do not do any refinement at all like before. More investigation could be done to improve this in the future. #### Throughput This change always increases throughput in my measurements, depending on benchmark/application it may not actually show up in scores though. Due to the pre-barrier and the additional filters in the barrier G1 is still slower than Parallel on raw throughput benchmarks, but is typically somewhere half-way to Parallel GC or closer. ### Platform support Since the post write barrier changed, additional work for some platforms is required to allow this change to proceed. At this time all work for all platforms is done, but needs testing - GraalVM (contributed by the GraalVM team) - S390 (contributed by A. Kumar from IBM) - PPC (contributed by M. Doerr, from SAP) - ARM (should work, HelloWorld compiles and runs) - RISCV (should work, HelloWorld compiles and runs) - x86 (should work, build/HelloWorld compiles and runs) None of the above mentioned platforms implement the barrier method to write cards for a reference array (aarch64 and x64 are fully implemented), they call the runtime as before. I believe it is doable fairly easily now with this simplified barrier for some extra performance, but not necessary. ### Alternatives The JEP text extensively discusses alternatives. ### Reviewing The change can be roughly divided in these fairly isolated parts * platform specific changes to the barrier * refinement and refinement control thread changes; this is best reviewed starting from the `G1ConcurrentRefineThread::run_service` method * changes to garbage collection: `merge_refinement_table()` in `g1RemSet.cpp` * policy modifications are typically related to code around the calls to `G1Policy::record_dirtying_stats`. Further information is available in the [JEP draft](https://bugs.openjdk.org/browse/JDK-8340827); there is also an a bit more extensive discussion of the change on my [blog](https://tschatzl.github.io/2025/02/21/new-write-barriers.html). Some additional comments: * the pre-marking of young generation cards has been removed. Benchmarks did not show any significant difference either way. To me this makes somewhat sense because the entire young gen will quickly get marked anyway. I.e. one only saves a single additional card table write (for every card). With the old barrier the costs for a card table mark has been much higher. * G1 sets `UseCondCardMark` to true by default. The conditional card mark corresponds to the third filter in the write barrier now, and since I decided to keep all filters for this change, it makes sense to directly use this mechanism. If there are any questions, feel free to ask. Testing: tier1-7 (multiple tier1-7, tier1-8 with slightly older versions) Thanks, Thomas ------------- Commit messages: - * only provide byte map base for JavaThreads - * mdoerr review: fix comments in ppc code - * fix crash when writing dirty cards for memory regions during card table switching - * remove mention of "enqueue" or "enqueuing" for actions related to post barrier - * remove some commented out debug code - Card table as DCQ Changes: https://git.openjdk.org/jdk/pull/23739/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8342382 Stats: 6543 lines in 103 files changed: 2162 ins; 3461 del; 920 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From mdoerr at openjdk.org Tue Feb 25 15:04:29 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 25 Feb 2025 15:04:29 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier In-Reply-To: References: Message-ID: On Sun, 23 Feb 2025 18:53:33 GMT, Thomas Schatzl wrote: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... PPC64 code looks great! Thanks for doing this! Only some comments are no longer correct. src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 244: > 242: > 243: __ xorr(R0, store_addr, new_val); // tmp1 := store address ^ new value > 244: __ srdi_(R0, R0, G1HeapRegion::LogOfHRGrainBytes); // tmp1 := ((store address ^ new value) >> LogOfHRGrainBytes) Comment: R0 is used instead of tmp1 src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 259: > 257: > 258: __ ld(tmp1, G1ThreadLocalData::card_table_base_offset(), thread); > 259: __ srdi(tmp2, store_addr, CardTable::card_shift()); // tmp1 := card address relative to card table base Comment: tmp2 is used, here src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp line 261: > 259: __ srdi(tmp2, store_addr, CardTable::card_shift()); // tmp1 := card address relative to card table base > 260: if (UseCondCardMark) { > 261: __ lbzx(R0, tmp1, tmp2); // tmp1 := card address Can you remove the comment, please? It's wrong. ------------- PR Review: https://git.openjdk.org/jdk/pull/23739#pullrequestreview-2637143540 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967669777 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967670850 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1967671593 From duke at openjdk.org Tue Feb 25 15:04:29 2025 From: duke at openjdk.org (Piotr Tarsa) Date: Tue, 25 Feb 2025 15:04:29 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier In-Reply-To: References: Message-ID: On Sun, 23 Feb 2025 18:53:33 GMT, Thomas Schatzl wrote: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... in this pr you've wrote if (region(@x.a) != region(y)) goto done; // same region check but on https://tschatzl.github.io/2025/02/21/new-write-barriers.html you wrote: (1) if (region(x.a) == region(y)) goto done; // Ignore references within the same region/area i guess the second one is correct ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2677075290 From stuefe at openjdk.org Tue Feb 25 15:04:29 2025 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 25 Feb 2025 15:04:29 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier In-Reply-To: References: Message-ID: On Sun, 23 Feb 2025 18:53:33 GMT, Thomas Schatzl wrote: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... @tschatzl I did not contribute the ppc port. Did you mean @TheRealMDoerr or @reinrich ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23739#issuecomment-2677512780 From tschatzl at openjdk.org Tue Feb 25 15:13:43 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 25 Feb 2025 15:13:43 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * remove unnecessarily added logging ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/0100d8e2..9ef9c5f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=00-01 Stats: 4 lines in 4 files changed: 0 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From duke at openjdk.org Tue Feb 25 16:00:57 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Tue, 25 Feb 2025 16:00:57 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> Message-ID: <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnqbIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com> On Tue, 25 Feb 2025 13:50:35 GMT, Andrew Haley wrote: >> I just tried it with top-of trunk latest binutils: >> >> fedora:aarch64 $ ~/binutils-gdb-install/bin/as -march=armv9-a+sha3+sve2-bitperm aarch64ops.s >> fedora:aarch64 $ ~/binutils-gdb-install/bin/as --version >> GNU assembler (GNU Binutils) 2.44.50.20250225 > > Aha! > > > aph at Andrews-MacBook-Pro ~ % as t.s > t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] > sub x1, x10, x23, sxth #2 > ^ > aph at Andrews-MacBook-Pro ~ % as --version > Apple clang version 16.0.0 (clang-1600.0.26.6) > Target: arm64-apple-darwin24.3.0 OK, so GNU as is more forgiving than Apple as... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1970076152 From kvn at openjdk.org Tue Feb 25 17:32:02 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 17:32:02 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: > > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - stall -> delay, plus some more comments > - adjust selector if probability > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - remove multiversion mark if we break the structure > - register opaque with igvn > - copyright and rm CFG check > - IR rules for all cases > - 3 test versions > - test changed to unaligned ints > - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 This looks good for me. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2641927937 From kvn at openjdk.org Tue Feb 25 17:32:02 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Feb 2025 17:32:02 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: <9mXRl7rScxJwxNNlV_H1gxndtzZ6g-gE8cMsc6VsTJQ=.b5a77c13-6e7e-4203-898a-3318e298d30f@github.com> Message-ID: <_pnjKfnS2e4hYWJ5_y8CudFAOmKB7FrD8cad8wCfZus=.16ac819a-2a99-4a8b-9640-3fa3bde53970@github.com> On Tue, 25 Feb 2025 07:09:24 GMT, Emanuel Peter wrote: > > PS: "slow" path implies that it is not taking frequently and it should not affect general performance of application. > > For me "slow" just means less optimized, because some assumption does not hold. The "fast" path is faster, because it has more assumptions and can optimize more (i.e. vectorize in this case, or vectorize more instructions). Do you have a better name than "fast/slow"? I think I nit-picked here. I see your good comments in `loopTransform.cpp` and loop `node.hpp` explaining mutiversioning fast_loop/slow_loop. I think it is fine to keep "slow/fast". We can use "uncommon" to indicate unfrequent path. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2682745643 From bkilambi at openjdk.org Tue Feb 25 19:45:31 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 25 Feb 2025 19:45:31 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: References: Message-ID: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> > This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23748/files - new: https://git.openjdk.org/jdk/pull/23748/files/a608a035..4d699740 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23748&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23748&range=00-01 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/23748.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23748/head:pull/23748 PR: https://git.openjdk.org/jdk/pull/23748 From bkilambi at openjdk.org Tue Feb 25 19:49:01 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 25 Feb 2025 19:49:01 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 17:06:59 GMT, Andrew Haley wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Address review comments > > src/hotspot/cpu/aarch64/aarch64.ad line 17275: > >> 17273: >> 17274: // This pattern would result in the following instructions (the first two are for ConvF2HF >> 17275: // and the last instruction is for ReinterpretS2HF) - > > Suggestion: > > // Without this pattern, (ReinterpretS2HF (ConvF2HF src)) would result in the following instructions (the first two for ConvF2HF > // and the last instruction for ReinterpretS2HF) - > > Reads a little better, I think? Addressed this in the new patch. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1970437734 From bkilambi at openjdk.org Tue Feb 25 19:48:59 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 25 Feb 2025 19:48:59 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: References: Message-ID: On Mon, 24 Feb 2025 17:42:05 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 6978: >> >>> 6976: // ldr instruction has 32/64/128 bit variants but not a 16-bit variant. This >>> 6977: // loads the 16-bit value from constant pool into a 32-bit register but only >>> 6978: // the bottom half will be populated. >> >> Surely what actually happens here is that it loads a 32-bit word from the constant pool. The bottom 16 bits of this word contain the half-precision constant, the top 16 bits are zero. > > I agree. The wording didn't quite convey that. I will change it in my next PS. Thank you for looking into the patch! Addressed this in the new patch. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1970437283 From mpowers at openjdk.org Wed Feb 26 01:03:52 2025 From: mpowers at openjdk.org (Mark Powers) Date: Wed, 26 Feb 2025 01:03:52 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM In-Reply-To: References: Message-ID: On Mon, 17 Feb 2025 13:53:30 GMT, Ferenc Rakoczi wrote: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. ML-KEM benchmark results of this PR: MLKEM.decapsulate 512 11.80 us/op MLKEM.decapsulate 768 18.19 us/op MLKEM.decapsulate 1024 29.57 us/op MLKEM.encapsulate 512 8.80 us/op MLKEM.encapsulate 768 13.49 us/op MLKEM.encapsulate 1024 22.53 us/op MLKEM.keygen 512 7.49 us/op MLKEM.keygen 768 11.22 us/op MLKEM.keygen 1024 19.08 us/op ML-KEM no intrinsics MLKEM.decapsulate 512 31.23 us/op MLKEM.decapsulate 768 50.09 us/op MLKEM.decapsulate 1024 75.92 us/op MLKEM.encapsulate 512 22.72 us/op MLKEM.encapsulate 768 37.27 us/op MLKEM.encapsulate 1024 59.69 us/op MLKEM.keygen 512 17.95 us/op MLKEM.keygen 768 30.95 us/op MLKEM.keygen 1024 49.04 us/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/23663#issuecomment-2683631601 From dholmes at openjdk.org Wed Feb 26 07:02:12 2025 From: dholmes at openjdk.org (David Holmes) Date: Wed, 26 Feb 2025 07:02:12 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg wrote: > I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. > > This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. > > In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. > > The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. > > You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. > > The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. > > Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). > > Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. > > However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b... Disclaimer for other reviewers, I have been looking at this code for some time now. Overall code looks good. I have quite a few comments/suggestions about comments. I suggest renaming `_vthread_cxq_head` to just `_vthread_head` as the `cxq` part is no longer meaningful. I agree that even though this seems performance neutral, the code simplification (for people reading it for the first time) will be worth it. Thanks. src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 331: > 329: volatile_nonstatic_field(ObjectMonitor, _owner, int64_t) \ > 330: volatile_nonstatic_field(ObjectMonitor, _recursions, intptr_t) \ > 331: volatile_nonstatic_field(ObjectMonitor, _entry_list, ObjectWaiter*) \ Suggestion: volatile_nonstatic_field(ObjectMonitor, _entry_list, ObjectWaiter*) \ Extra space src/hotspot/share/runtime/objectMonitor.cpp line 166: > 164: // its next pointer, and have its prev pointer set to null. Thus > 165: // pushing six threads A-F (in that order) onto entry_list, will > 166: // form a singly-linked list, see 1) below. Suggestion: have diagram 1 immediately follow this text so the reader doesn't have to jump down. src/hotspot/share/runtime/objectMonitor.cpp line 172: > 170: // from the entry_list head. While walking the list we also assign > 171: // the prev pointers of each thread, essentially forming a doubly > 172: // linked list, see 2) below. Suggestion: have diagram 2 immediately follow this text so the reader doesn't have to jump down. src/hotspot/share/runtime/objectMonitor.cpp line 176: > 174: // Once we have formed a doubly linked list it's easy to find the > 175: // successor, wake it up, have it remove itself, and update the > 176: // tail pointer, as seen in 2) and 3) below. Suggestion: // tail pointer, as seen in 3) below. But have diagram 3 right here. src/hotspot/share/runtime/objectMonitor.cpp line 179: > 177: // > 178: // At any time new threads can add themselves to the entry_list, see > 179: // 4) and 5). Diagrams 4 and 5 do not follow from what has just been described, but the use of "at any time" implies to me you intended to show them affecting the queue as we have already seen it. Again show the diagram you want here. src/hotspot/share/runtime/objectMonitor.cpp line 183: > 181: // If the thread that removes itself from the end of the list hasn't > 182: // got any prev pointer, we just set the tail pointer to null, see > 183: // 5) and 6). Suggestion: // If the thread to be removed is the only thread in the entry list: // entry_list -> A -> null // entry_list_tail ---^ // we remove it and just set the tail pointer to null, // entry_list -> null // entry_list_tail -> null src/hotspot/share/runtime/objectMonitor.cpp line 187: > 185: // Next time we need to find the successor and the tail is null, we > 186: // just start walking from the entry_list head again forming a new > 187: // doubly linked list, see 6) and 7) below. Suggestion: // Next time we need to find the successor and the tail is null, // entry_list ->I->H->G->null // entry_list_tail ->null // we just start walking from the entry_list head again forming a new // doubly linked list: // entry_list ->I<=>H<=>G->null // entry_list_tail ----------^ src/hotspot/share/runtime/objectMonitor.cpp line 189: > 187: // doubly linked list, see 6) and 7) below. > 188: // > 189: // 1) entry_list ->F->E->D->C->B->A->null Suggestion: // 1) entry_list ->F->E->D->C->B->A->null Right-justify the names please src/hotspot/share/runtime/objectMonitor.cpp line 215: > 213: // The mutex property of the monitor itself protects the entry_list > 214: // from concurrent interference. > 215: // -- Only the monitor owner may detach nodes from the entry_list. Suggestion for this block - get rid of invariants headings and just say: // The monitor itself protects all of the operations on the entry_list except for the CAS of a new arrival // to the head. Only the monitor owner can read or write the prev links (e.g. to remove itself) or update // the tail. src/hotspot/share/runtime/objectMonitor.cpp line 225: > 223: // concurrent detaching thread. This mechanism is immune from the > 224: // ABA corruption. More precisely, the CAS-based "push" onto > 225: // entry_list is ABA-oblivious. Not sure this actually says anything to help people understand the code or its operation. There basically is no A-B-A issue with the use of CAS here. src/hotspot/share/runtime/objectMonitor.cpp line 227: > 225: // entry_list is ABA-oblivious. > 226: // > 227: // * The entry_list form a queue of threads stalled trying to acquire Suggestion: // * The entry_list forms a queue of threads stalled trying to acquire src/hotspot/share/runtime/objectMonitor.cpp line 232: > 230: // thread notices that the tail of the entry_list is not known, we > 231: // convert the singly-linked entry_list into a doubly linked list by > 232: // assigning the prev pointers and the entry_list_tail pointer. Didn't we essentially say all this at the beginning? src/hotspot/share/runtime/objectMonitor.cpp line 260: > 258: // > 259: // * notify() or notifyAll() simply transfers threads from the WaitSet > 260: // to either the entry_list. Subsequent exit() operations will Suggestion: // to the entry_list. Subsequent exit() operations will src/hotspot/share/runtime/objectMonitor.cpp line 704: > 702: > 703: for (;;) { > 704: ObjectWaiter* front = Atomic::load(&_entry_list); In comments and code pick "head" or "front" to use to describe what _entry_list points to and use that consistently. I think "front" is much more common. src/hotspot/share/runtime/objectMonitor.cpp line 705: > 703: for (;;) { > 704: ObjectWaiter* front = Atomic::load(&_entry_list); > 705: No need for blank line. src/hotspot/share/runtime/objectMonitor.cpp line 718: > 716: // if we added current to _entry_list. Once on _entry_list, current > 717: // stays on-queue until it acquires the lock. > 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) { Nit: the name suggests we do the try_lock first, when we don't. If we reverse the name we should also reverse the true/false return so that true relates to the first part of the name. See what others think. src/hotspot/share/runtime/objectMonitor.cpp line 719: > 717: // stays on-queue until it acquires the lock. > 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) { > 719: node->_prev = nullptr; Shouldn't this already be the case? src/hotspot/share/runtime/objectMonitor.cpp line 724: > 722: for (;;) { > 723: ObjectWaiter* front = Atomic::load(&_entry_list); > 724: No need for blank line. src/hotspot/share/runtime/objectMonitor.cpp line 731: > 729: > 730: // Interference - the CAS failed because _entry_list changed. Just retry. > 731: // As an optional optimization we retry the lock. Suggestion: // Interference - the CAS failed because _entry_list changed. Before // retrying the CAS retry taking the lock as it may now be free. src/hotspot/share/runtime/objectMonitor.cpp line 812: > 810: guarantee(_entry_list == nullptr, > 811: "must be no entering threads: entry_list=" INTPTR_FORMAT, > 812: p2i(_entry_list)); Mustn't re-read _entry_list in the p2i as it may have changed from the value that is causing the guarantee to fail. The old guarantees were buggy in this regard - a temp is needed. src/hotspot/share/runtime/objectMonitor.cpp line 1299: > 1297: assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant"); > 1298: > 1299: ObjectWaiter* v = Atomic::load(&_entry_list); Nit: use `w` to be consistent with similar code. The original used `w` for EntryList and `v` for cxq IIRC. src/hotspot/share/runtime/objectMonitor.cpp line 2018: > 2016: // that in prepend-mode we invert the order of the waiters. Let's say that the > 2017: // waitset is "ABCD" and the entry_list is "XYZ". After a notifyAll() in prepend > 2018: // mode the waitset will be empty and the entry_list will be "DCBAXYZ". We don't support different ordering modes any more so we always "prepend" such that waiters are added to the entry_list in the reverse order of waiting. So given waitList -> A -> B -> C -> D, and _entry_list -> x -> y -> z we will get _entry_list -> D -> C -> B -> A -> X -> Y -> Z src/hotspot/share/runtime/objectMonitor.hpp line 195: > 193: volatile intx _recursions; // recursion count, 0 for first entry > 194: ObjectWaiter* volatile _entry_list; // Threads blocked on entry or reentry. > 195: // The list is actually composed of WaitNodes, Suggestion: // The list is actually composed of wait-nodes, Pre-existing (check for other uses) `WaitNodes` reads like a class name but it isn't. ------------- Changes requested by dholmes (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2643098063 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970923830 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970940771 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970940914 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970941662 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970936929 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970946641 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970948581 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970934947 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970956573 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970965071 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970965291 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970966451 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970967237 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970971522 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970968581 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970975419 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970976144 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970976457 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970977990 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970979335 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970982964 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1971037645 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1970926134 From haosun at openjdk.org Wed Feb 26 08:30:55 2025 From: haosun at openjdk.org (Hao Sun) Date: Wed, 26 Feb 2025 08:30:55 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> References: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> Message-ID: On Tue, 25 Feb 2025 19:45:31 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend for scalar FP16 operations namely - add, subtract, multiply, divide, fma, sqrt, min and max. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097: > 2095: > 2096: // Half-precision floating-point instructions > 2097: INSN(fabdh, 0b011, 0b11, 0b000101, 0b0); I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`. I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971142347 From bkilambi at openjdk.org Wed Feb 26 08:52:53 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 26 Feb 2025 08:52:53 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: References: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> Message-ID: On Wed, 26 Feb 2025 08:26:57 GMT, Hao Sun wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Address review comments > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097: > >> 2095: >> 2096: // Half-precision floating-point instructions >> 2097: INSN(fabdh, 0b011, 0b11, 0b000101, 0b0); > > I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`. > > > I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here? Hi @shqking , thanks for your review comments. Yes I added `fabdh` and `fnmulh` to keep aligned with float and double types. For adding support for FP16 `absd` we need `AbsHF` to be supported (along with SubHF) but `AbsHF` node is not implemented currently. `abs` operation is directly executed from the java code here - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1464 and is not intrinsified or pattern matched like other FP16 operations. Same with `negate` operation for FP16 - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1449 On the Valhalla repo, while these operation were being developed, I tried adding support for `AbsHF/NegHF` which emitted `fabs` and `fneg` instructions but the performance with the direct java code(bit manipulation operations) was much faster (sorry don't remember the exact number) so we decided to go with the java implementation instead. I still added `fabd` here because `op21` is 0 only in `fabd` H variant and felt that it'd be better to handle it here as it belongs to this group of instructions. Please let me know your thoughts. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971175829 From roland at openjdk.org Wed Feb 26 09:16:03 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 26 Feb 2025 09:16:03 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: > > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - stall -> delay, plus some more comments > - adjust selector if probability > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - remove multiversion mark if we break the structure > - register opaque with igvn > - copyright and rm CFG check > - IR rules for all cases > - 3 test versions > - test changed to unaligned ints > - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684365921 From epeter at openjdk.org Wed Feb 26 10:02:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:02:09 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 09:12:46 GMT, Roland Westrelin wrote: > Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. @rwestrel What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684482233 From roland at openjdk.org Wed Feb 26 10:18:12 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 26 Feb 2025 10:18:12 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 09:59:36 GMT, Emanuel Peter wrote: > I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? Ok > And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if > > I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684523673 From aph at openjdk.org Wed Feb 26 10:27:02 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 26 Feb 2025 10:27:02 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: References: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> Message-ID: On Wed, 26 Feb 2025 08:49:58 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097: >> >>> 2095: >>> 2096: // Half-precision floating-point instructions >>> 2097: INSN(fabdh, 0b011, 0b11, 0b000101, 0b0); >> >> I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`. >> >> >> I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here? > > Hi @shqking , thanks for your review comments. Yes I added `fabdh` and `fnmulh` to keep aligned with float and double types. > For adding support for FP16 `absd` we need `AbsHF` to be supported (along with SubHF) but `AbsHF` node is not implemented currently. `abs` operation is directly executed from the java code here - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1464 and is not intrinsified or pattern matched like other FP16 operations. Same with `negate` operation for FP16 - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1449 > On the Valhalla repo, while these operation were being developed, I tried adding support for `AbsHF/NegHF` which emitted `fabs` and `fneg` instructions but the performance with the direct java code(bit manipulation operations) was much faster (sorry don't remember the exact number) so we decided to go with the java implementation instead. > I still added `fabd` here because `op21` is 0 only in `fabd` H variant and felt that it'd be better to handle it here as it belongs to this group of instructions. Please let me know your thoughts. According to the RM, fabd is in _Advanced SIMD scalar three same FP16_, but the rest are in _Floating-point data-processing (2 source)_. The decoding scheme looks rather different.`fabd`, then, doesn't really fit here, but in a section with the rest of the three same FP16 instructions. The encoding scheme for _Advanced SIMD scalar three same FP16_ is pretty simple, so I suggest you create a new group for them, and put `fabd` in there. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971330062 From epeter at openjdk.org Wed Feb 26 10:30:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:30:15 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin wrote: > > And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if > > I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. > > I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. Ah ok, I'll have to look into it myself then. But if we know that it happens at the beginning of a loop-opts phase just after igvn, and no predicates were hacked yet, then that should work fine. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684550571 From epeter at openjdk.org Wed Feb 26 10:36:06 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Feb 2025 10:36:06 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin wrote: >>> Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. >> >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? >> >> I don't see it as super critical personally, as the slow_path is `delayed`, so no loop-opts are performed on it. The overhead is minimal if we keep it until after loop-opts, I think. But I'm not against trying. It would take a bit of effort to construct test cases where we have the loop fold away after multiversion_if is added, but that is probably possible. >> >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: >> [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. >> >> @rwestrel What do you think? > >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? > > Ok > >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. > > I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. @rwestrel I filed this follow-up RFE: [JDK-8350756](https://bugs.openjdk.org/browse/JDK-8350756): C2 SuperWord Multiversioning: remove useless slow loop when the fast loop disappears We'll have to be careful to only fold the `slow_loop` away if it is not used, i.e. if we did not in the meantime use the `multiversion_if`, and maybe the `fast_loop` structure is only desintegrating because of some speculative assumption, maybe because of more unrolling that only happens with vectorization. It would be good to have a test-case for that. I'm writing that here so I will remember it later ;) @rwestrel Do you have any other ideas / suggestions? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2684567780 From galder at openjdk.org Wed Feb 26 11:36:11 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 26 Feb 2025 11:36:11 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/abdd4f5e...a190ae68 > > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this. > > Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example? The probabilities are fine. I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`. In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in. In the slow versions you see this `PrintMethodData`: static java.lang.Math::min(II)I interpreter_invocation_count: 18171 invocation_counter: 18171 backedge_counter: 0 decompile_count: 0 mdo size: 328 bytes 0 iload_0 1 iload_1 2 if_icmpgt 9 0 bci: 2 BranchData taken(7732) displacement(56) not taken(10180) 5 iload_0 6 goto 10 32 bci: 6 JumpData taken(10180) displacement(24) 9 iload_1 10 ireturn org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I interpreter_invocation_count: 189 invocation_counter: 189 backedge_counter: 313344 decompile_count: 0 mdo size: 384 bytes 0 iconst_0 1 istore_2 2 iconst_0 3 istore_3 4 iload_3 5 aload_1 6 fast_igetfield 35 9 if_icmpge 33 0 bci: 9 BranchData taken(58) displacement(72) not taken(192512) 12 aload_1 13 fast_agetfield 41 16 iload_3 17 iaload 18 istore #4 20 iload_2 21 fast_iload #4 23 invokestatic 32 32 bci: 23 CounterData count(192512) 26 istore_2 27 iinc #3 1 30 goto 4 48 bci: 30 JumpData taken(192512) displacement(-48) 33 iload_2 34 ireturn The benchmark method calls Math.min `192_512` times, yet the method data shows only `18_171` invocations, of which `7_732` are taken which is 42%. So it gets compiled with a `cmov` and the benchmark will be slow because it will branch 100% one of the sides. In the fast version, `PrintMethodData` looks like this: static java.lang.Math::min(II)I interpreter_invocation_count: 1575322 invocation_counter: 1575322 backedge_counter: 0 decompile_count: 0 mdo size: 368 bytes 0 iload_0 1 iload_1 2 if_icmpgt 9 0 bci: 2 BranchData taken(1418001) displacement(56) not taken(157062) 5 iload_0 6 goto 10 32 bci: 6 JumpData taken(157062) displacement(24) 9 iload_1 10 ireturn org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I interpreter_invocation_count: 858 invocation_counter: 858 backedge_counter: 1756214 decompile_count: 0 mdo size: 424 bytes 0 iconst_0 1 istore_2 2 iconst_0 3 istore_3 4 iload_3 5 aload_1 6 fast_igetfield 35 9 if_icmpge 33 0 bci: 9 BranchData taken(733) displacement(72) not taken(1637363) 12 aload_1 13 fast_agetfield 41 16 iload_3 17 iaload 18 istore #4 20 iload_2 21 fast_iload #4 23 invokestatic 32 32 bci: 23 CounterData count(1637363) 26 istore_2 27 iinc #3 1 30 goto 4 48 bci: 30 JumpData taken(1637363) displacement(-48) 33 iload_2 34 ireturn The benchmark method calls Math.min `1_637_363` times, and the method data shows `1_575_322` invocations, of which `1_418_001` are taken which is 90%. So no cmov is introduced and the benchmark will be fast because it will branch 100% one of the sides. A factor here might be my Xeon machine. I run the benchmark on a 4 core VM inside it, so given the limited resources compilation can take longer. I've noticed that it's easier to replicate this scenario there rather than my M1 laptop, which has 10 cores. >> So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long. > > Do you know when they were added? If that was a long time ago, we might not have noticed back then, but we might notice now. I don't know when they were added. > That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability. > > What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here. I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2684701935 From duke at openjdk.org Wed Feb 26 14:18:14 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Wed, 26 Feb 2025 14:18:14 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: - Added more comments, mainly as suggested by Andrew Dinn - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23300/files - new: https://git.openjdk.org/jdk/pull/23300/files/54373d5a..aa0570db Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=05-06 Stats: 478 lines in 3 files changed: 40 ins; 6 del; 432 mod Patch: https://git.openjdk.org/jdk/pull/23300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300 PR: https://git.openjdk.org/jdk/pull/23300 From haosun at openjdk.org Wed Feb 26 14:41:56 2025 From: haosun at openjdk.org (Hao Sun) Date: Wed, 26 Feb 2025 14:41:56 GMT Subject: RFR: 8345125: Aarch64: Add aarch64 backend for Float16 scalar operations [v2] In-Reply-To: References: <8QDbenZGakijqUrwAcaVogoJBEiNpzYhN3sDrrteSDk=.d8539631-ab03-45ff-a762-0b6e14c63f89@github.com> Message-ID: On Wed, 26 Feb 2025 08:49:58 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2097: >> >>> 2095: >>> 2096: // Half-precision floating-point instructions >>> 2097: INSN(fabdh, 0b011, 0b11, 0b000101, 0b0); >> >> I suppose `fadbh` and `fnmulh` are added to keep aligned with the float and double ones, i.e. `fabd(s|d)` and `fnmul(s|d)`. >> >> >> I noticed that there are matching rules for `fabd(s|d)`, i.e. `absd(F|D)_reg`. I wonder if we need add the corresponding rule for fp16 here? > > Hi @shqking , thanks for your review comments. Yes I added `fabdh` and `fnmulh` to keep aligned with float and double types. > For adding support for FP16 `absd` we need `AbsHF` to be supported (along with SubHF) but `AbsHF` node is not implemented currently. `abs` operation is directly executed from the java code here - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1464 and is not intrinsified or pattern matched like other FP16 operations. Same with `negate` operation for FP16 - https://github.com/openjdk/jdk/blob/037e47112bdf2fa2324f7c58198f6d433f17d9fd/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Float16.java#L1449 > On the Valhalla repo, while these operation were being developed, I tried adding support for `AbsHF/NegHF` which emitted `fabs` and `fneg` instructions but the performance with the direct java code(bit manipulation operations) was much faster (sorry don't remember the exact number) so we decided to go with the java implementation instead. > I still added `fabd` here because `op21` is 0 only in `fabd` H variant and felt that it'd be better to handle it here as it belongs to this group of instructions. Please let me know your thoughts. @Bhavana-Kilambi Thanks for your explanation for the missing `AbsHF`. It's okay to me to have `fadbh` and `fnmulh` in this patch. Overall it's good to me except aph's comment above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23748#discussion_r1971712164 From adinn at openjdk.org Wed Feb 26 14:58:07 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 26 Feb 2025 14:58:07 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: <8h5rWJFe3PKLNO6QiDZiAj98ePBoCilk0b9w420hZLE=.a17a4ecd-757b-405c-8f5a-5470bde5bf18@github.com> On Wed, 26 Feb 2025 14:18:14 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Added more comments, mainly as suggested by Andrew Dinn > - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi Ok, still good ------------- Marked as reviewed by adinn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23300#pullrequestreview-2644812035 From galder at openjdk.org Wed Feb 26 18:33:03 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 26 Feb 2025 18:33:03 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Wed, 26 Feb 2025 11:32:57 GMT, Galder Zamarre?o wrote: > > That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability. > > What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here. > > I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create. Before noting the regressions, it's worth noting that PR also improves performance certain scenarios. I will summarise those tomorrow. Here's a summary of the regressions ### Regression 1 Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress. Possible solutions: a) make Superword recognise these scenarios as profitable. ### Regression 2 Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress. Possible solutions a) find a way to use other vector instructions (vpcmp+vpblend+vmov?) b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided ### Regression 3 Given a loop with a long min/max non-reduction pattern (e.g. `longLoopMax`) with one side of branch taken near 100% of time, when the platform does not vectorize it (either lack of CPU instruction support, or Superword finding not profitable), then HotSpot will use scalar instructions (cmov) and performance will regress. Possible solutions: a) find a way to use other vector instructions (e.g. `longLoopMax` vectorizes with AVX2 and might also do with earlier instruction sets) b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided, ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2685865807 From roland at openjdk.org Wed Feb 26 19:34:04 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 26 Feb 2025 19:34:04 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: Message-ID: On Tue, 25 Feb 2025 09:27:13 GMT, Emanuel Peter wrote: >> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. >> >> **Background** >> >> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. >> >> **Problem** >> >> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. >> >> >> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); >> MemorySegment nativeUnaligned = nativeAligned.asSlice(1); >> test3(nativeUnaligned); >> >> >> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! >> >> static void test3(MemorySegment ms) { >> for (int i = 0; i < RANGE; i++) { >> long adr = i * 4L; >> int v = ms.get(ELEMENT_LAYOUT, adr); >> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); >> } >> } >> >> >> **Solution: Runtime Checks - Predicate and Multiversioning** >> >> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. >> >> I came up with 2 options where to place the runtime checks: >> - A new "auto vectorization" Parse Predicate: >> - This only works when predicates are available. >> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. >> - Multiversion the loop: >> - Create 2 copies of the loop (fast and slow loops). >> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take >> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: > > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - stall -> delay, plus some more comments > - adjust selector if probability > - Merge branch 'master' into JDK-8323582-SW-native-alignment > - remove multiversion mark if we break the structure > - register opaque with igvn > - copyright and rm CFG check > - IR rules for all cases > - 3 test versions > - test changed to unaligned ints > - ... and 56 more: https://git.openjdk.org/jdk/compare/d551daca...8eb52292 Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22016#pullrequestreview-2645658428 From epeter at openjdk.org Thu Feb 27 06:57:04 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Feb 2025 06:57:04 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com> On Wed, 26 Feb 2025 18:29:58 GMT, Galder Zamarre?o wrote: >>> > Re: [#20098 (comment)](https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644) - I was trying to think what could be causing this. >>> >>> Maybe it is an issue with probabilities? Do you know at what point (if at all) the `MinI` node appears/disappears in that example? >> >> The probabilities are fine. >> >> I think the issue with `Math.min(II)` seems to be specific to when its compilation happens, and the combined fact that the intrinsic has been disabled and vectorization does not kick in (explicitly disabled). Note that other parts of the JDK invoke `Math.min(II)`. >> >> In the slow cases it appears the compilation happens before the benchmark kicks in, and so it takes the profiling data before the benchmark to decide how to compile this in. >> >> In the slow versions you see this `PrintMethodData`: >> >> static java.lang.Math::min(II)I >> interpreter_invocation_count: 18171 >> invocation_counter: 18171 >> backedge_counter: 0 >> decompile_count: 0 >> mdo size: 328 bytes >> >> 0 iload_0 >> 1 iload_1 >> 2 if_icmpgt 9 >> 0 bci: 2 BranchData taken(7732) displacement(56) >> not taken(10180) >> 5 iload_0 >> 6 goto 10 >> 32 bci: 6 JumpData taken(10180) displacement(24) >> 9 iload_1 >> 10 ireturn >> >> org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I >> interpreter_invocation_count: 189 >> invocation_counter: 189 >> backedge_counter: 313344 >> decompile_count: 0 >> mdo size: 384 bytes >> >> 0 iconst_0 >> 1 istore_2 >> 2 iconst_0 >> 3 istore_3 >> 4 iload_3 >> 5 aload_1 >> 6 fast_igetfield 35 >> 9 if_icmpge 33 >> 0 bci: 9 BranchData taken(58) displacement(72) >> not taken(192512) >> 12 aload_1 >> 13 fast_agetfield 41 >> 16 iload_3 >> 17 iaload >> 18 istore #4 >> 20 iload_2 >> 21 fast_iload #4 >> 23 invokestatic 32 >> 32 bci: 23 CounterData count(192512) >> 26 istore_2 >> 27 iinc #3 1 >> 30 goto 4 >> 48 bci: 30 JumpData taken(192512) displacement(-48) >> 33 iload_2 >> 34 ireturn >> >> >> The benchmark method calls Math... > >> > That said: if we know that it is only in the high-probability cases, then we can address those separately. I would not consider it a blocking issue, as long as we file the follow-up RFE for int/max scalar case with high branch probability. >> > What would be really helpful: a list of all regressions / issues, and how we intend to deal with them. If we later find a regression that someone cares about, then we can come back to that list, and justify the decision we made here. >> >> I'll make up a list of regressions and post it here. I won't create RFEs for now. I'd rather wait until we have the list in front of us and we can decide which RFEs to create. > > Before noting the regressions, it's worth noting that PR also improves performance certain scenarios. I will summarise those tomorrow. > > Here's a summary of the regressions > > ### Regression 1 > Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress. > > Possible solutions: > a) make Superword recognise these scenarios as profitable. > > ### Regression 2 > Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress. > > Possible solutions > a) find a way to use other vector instructions (vpcmp+vpblend+vmov?) > b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided > > ### Regression 3 > Given a loop with a long min/max non-reduction pattern (e.g. `longLoopMax`) with one side of branch taken near 100% of time, when the platform does not vectorize it (either lack of CPU instruction support, or Superword finding not profitable), then HotSpot will use scalar instructions (cmov) and performance will regress. > > Possible solutions: > a) find a way to use other vector instructions (e.g. `longLoopMax` vectorizes with AVX2 and might also do with earlier instruction sets) > b) fallback on more suitable scalar instructions, e.g. cmp+mov, when the branch is very one-sided, @galderz Thanks for the summary of regressions! Yes, there are plenty of speedups, I assume primarily because of `Long.min/max` vectorization, but possibly also because the operation can now "float" out of a loop for example. All your Regressions 1-3 are cases with "extreme" probabilitiy (close to 100% / 0%), you listed none else. That matches with my intuition, that branching code is usually better than cmove in extreme probability cases. As for possible solutions. In all Regression 1-3 cases, it seems the issue is scalar cmove. So actually in all cases a possible solution is using branching code (i.e. `cmp+mov`). So to me, these are the follow-up RFE's: - Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the `Integer.min/max` cases, which have the same issue. - Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional. - Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional. Does that make sense, or am I missing something? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2687067125 From epeter at openjdk.org Thu Feb 27 07:02:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Feb 2025 07:02:10 GMT Subject: RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory [v4] In-Reply-To: References: <6R7kv7XGOWIBrjPQCemB6u2vd_tFl_xMQGQaVWoxkK0=.d26f6780-82f8-4ab9-a4bc-ff7831ed9a1a@github.com> Message-ID: On Wed, 26 Feb 2025 10:15:48 GMT, Roland Westrelin wrote: >>> Would it be possible and make sense to remove useless slow path loops the way it's done for predicates or zero trip guards? In `PhaseIdealLoop::build_loop_late_post_work()`, collect all `OpaqueMultiversioningNode` in a list. Then iterate over all loops the way it's done in `PhaseIdealLoop::eliminate_useless_zero_trip_guard()`, find loops marked as multi version, check we can get from the loop to the `OpaqueMultiversioningNode` and mark that one as useful. Eliminate all `OpaqueMultiversioningNode` not marked as useful. That way if some transformation such as peeling makes the loop non multi version or if the expected shape breaks for some reason, the slow loop is eliminated on next loop opts pass. >> >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? >> >> I don't see it as super critical personally, as the slow_path is `delayed`, so no loop-opts are performed on it. The overhead is minimal if we keep it until after loop-opts, I think. But I'm not against trying. It would take a bit of effort to construct test cases where we have the loop fold away after multiversion_if is added, but that is probably possible. >> >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: >> [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. >> >> @rwestrel What do you think? > >> I suppose we could try that. Is it ok to do that in a separate RFE, so we are keeping this here to a more manageable size? > > Ok > >> And would we not have similar issues with traversing from the loops to their `OpaqueMultiversioningNode`? What if some are not reachable in the meantime? Then we would just lose the `multiversion_if` early, and could not use it any more. So maybe we'd have to do that after the verification: [JDK-8350637](https://bugs.openjdk.org/browse/JDK-8350637): C2: verify that main_loop finds pre_loop and that multiversion loops find the multiversion_if >> >> I wonder if we do not have similar issues with `PhaseIdealLoop::eliminate_useless_zero_trip_guard()` currently. Maybe it's rare enough we don't notice. > > I don't think that's a problem. When that code runs the graph is in a stable shape. There's no dead condition that needs to go through igvn to be cleaned up. We've just run igvn and haven't made any change to the graph yet. @rwestrel @vnkozlov Thank you for the reviews, and all the good questions, and ideas for follow-up RFE's ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2687071561 From epeter at openjdk.org Thu Feb 27 07:02:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Feb 2025 07:02:11 GMT Subject: Integrated: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory In-Reply-To: References: Message-ID: On Mon, 11 Nov 2024 14:40:09 GMT, Emanuel Peter wrote: > Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below. > > **Background** > > With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer. > > **Problem** > > So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code. > > > MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1); > MemorySegment nativeUnaligned = nativeAligned.asSlice(1); > test3(nativeUnaligned); > > > When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not! > > static void test3(MemorySegment ms) { > for (int i = 0; i < RANGE; i++) { > long adr = i * 4L; > int v = ms.get(ELEMENT_LAYOUT, adr); > ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1)); > } > } > > > **Solution: Runtime Checks - Predicate and Multiversioning** > > Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check. > > I came up with 2 options where to place the runtime checks: > - A new "auto vectorization" Parse Predicate: > - This only works when predicates are available. > - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop. > - Multiversion the loop: > - Create 2 copies of the loop (fast and slow loops). > - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take > - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even unaligned `base`s would end up with reasonably fast code. > - We "stall" the `... This pull request has now been integrated. Changeset: 885338b5 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/885338b5f38ed05d8b91efc0178b371f2f89310e Stats: 1089 lines in 27 files changed: 966 ins; 28 del; 95 mod 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory Reviewed-by: roland, kvn ------------- PR: https://git.openjdk.org/jdk/pull/22016 From adinn at openjdk.org Thu Feb 27 09:50:59 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 27 Feb 2025 09:50:59 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: Message-ID: On Fri, 21 Feb 2025 10:23:37 GMT, Ferenc Rakoczi wrote: >> Hi. Here is the test result of our CI. >> >> ### copyright year >> >> the following files should update the copyright year to 2025. >> >> >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> src/hotspot/cpu/aarch64/stubRoutines_aarch64.hpp >> src/hotspot/share/runtime/globals.hpp >> src/java.base/share/classes/sun/security/provider/ML_DSA.java >> src/java.base/share/classes/sun/security/provider/SHA3Parallel.java >> test/micro/org/openjdk/bench/java/security/MLDSA.java >> >> >> ### cross-build failure >> >> Cross build for riscv64/s390/ppc64 failed. >> >> Here shows the error msg for ppc64 >> >> >> === Output from failing command(s) repeated here === >> * For target support_interim-jmods_support__create_java.base.jmod_exec: >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (/tmp/jdk-src/src/hotspot/share/asm/codeBuffer.hpp:200), pid=72752, tid=72769 >> # assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000e85cb03dc620 <= 0x0000e85cb03e8ab4 <= 0x0000e85cb03e8ab0 >> # >> # JRE version: OpenJDK Runtime Environment (25.0) (fastdebug build 25-internal-git-1e01c6deec3) >> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 25-internal-git-1e01c6deec3, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) >> # Problematic frame: >> # V [libjvm.so+0x3b391c] Instruction_aarch64::~Instruction_aarch64()+0xbc >> # >> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/ci-scripts/jdk-src/make/ >> # >> # An error report file with more information is saved as: >> # /tmp/jdk-src/make/hs_err_pid72752.log >> ... (rest of output omitted) >> >> * All command lines available in /sysroot/ppc64el/tmp/build-ppc64el/make-support/failure-logs. >> === End of repeated output === >> >> >> I suppose we should make the similar update at file `src/hotspot/cpu/aarch64/stubDeclarations_aarch64.hpp` to other platforms > > @shqking, I changed the copyright years, but I don't really understand how the aarch64-specific code can overflow buffers on other architectures. As far as I understand, Instruction_aarch64 should not have been there in a ppc build. > Was this a build attempted on an aarch64 for the other architectures? @ferakocz Apologies for raising yet another resolve conflict. You will need to make a further adjustment to the compiler blob declaration to accommodate a fix I just pushed to resolve a problem with cross-compilation. Your patch should now specify do_arch_blob(compiler, 50000 ZGC_ONLY(+10000)) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2687427983 From adinn at openjdk.org Thu Feb 27 09:56:02 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 27 Feb 2025 09:56:02 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v7] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 14:18:14 GMT, Ferenc Rakoczi wrote: >> By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional commits since the last revision: > > - Added more comments, mainly as suggested by Andrew Dinn > - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi Oops. sorry - cut and paste error -- the new setting should be do_arch_blob(compiler, 55000 ZGC_ONLY(+5000)) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23300#issuecomment-2687440017 From aph at openjdk.org Thu Feb 27 10:19:06 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 27 Feb 2025 10:19:06 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnqbIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com> References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnq bIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com> Message-ID: On Tue, 25 Feb 2025 15:58:18 GMT, Ferenc Rakoczi wrote: >> Aha! >> >> >> aph at Andrews-MacBook-Pro ~ % as t.s >> t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in range [0, 4] >> sub x1, x10, x23, sxth #2 >> ^ >> aph at Andrews-MacBook-Pro ~ % as --version >> Apple clang version 16.0.0 (clang-1600.0.26.6) >> Target: arm64-apple-darwin24.3.0 > > OK, so GNU as is more forgiving than Apple as... Did my patch to aarch64-asmtest.py solve the problem? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1973284472 From coleenp at openjdk.org Thu Feb 27 14:28:07 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 27 Feb 2025 14:28:07 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 05:42:12 GMT, David Holmes wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > src/hotspot/share/runtime/objectMonitor.cpp line 166: > >> 164: // its next pointer, and have its prev pointer set to null. Thus >> 165: // pushing six threads A-F (in that order) onto entry_list, will >> 166: // form a singly-linked list, see 1) below. > > Suggestion: have diagram 1 immediately follow this text so the reader doesn't have to jump down. I like this suggestion. I like these comments. > src/hotspot/share/runtime/objectMonitor.cpp line 718: > >> 716: // if we added current to _entry_list. Once on _entry_list, current >> 717: // stays on-queue until it acquires the lock. >> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) { > > Nit: the name suggests we do the try_lock first, when we don't. If we reverse the name we should also reverse the true/false return so that true relates to the first part of the name. See what others think. How about add_to_entry_list with a boolean parameter that tries the lock if it fails, and only have one of these functions? Although the return true if you get the lock makes it weird. bool add_to_entry_list(JavaThread* current, ObjectWaiter* node, bool or_lock) { return true if locked, false otherwise; } Maybe that makes sense. > src/hotspot/share/runtime/objectMonitor.cpp line 719: > >> 717: // stays on-queue until it acquires the lock. >> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) { >> 719: node->_prev = nullptr; > > Shouldn't this already be the case? I think for the vthread case, it isn't yet(?). Maybe motivation to fix the ObjectWaiter constructor with this patch? > src/hotspot/share/runtime/objectMonitor.cpp line 2018: > >> 2016: // that in prepend-mode we invert the order of the waiters. Let's say that the >> 2017: // waitset is "ABCD" and the entry_list is "XYZ". After a notifyAll() in prepend >> 2018: // mode the waitset will be empty and the entry_list will be "DCBAXYZ". > > We don't support different ordering modes any more so we always "prepend" such that waiters are added to the entry_list in the reverse order of waiting. So given waitList -> A -> B -> C -> D, and _entry_list -> x -> y -> z we will get _entry_list -> D -> C -> B -> A -> X -> Y -> Z One of the benefits of this work is to read, understand and clean up misleading and out of date comments in this code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973636957 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973657207 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973681891 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973684370 From coleenp at openjdk.org Thu Feb 27 14:28:05 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 27 Feb 2025 14:28:05 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 16:29:25 GMT, Fredrik Bredberg wrote: > I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. > > This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. > > In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. > > The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. > > You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. > > The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. > > Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). > > Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. > > However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b... This looks really good - I have some small change and improvement requests. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 418: > 416: // have released the lock. > 417: // Refer to the comments in synchronizer.cpp for how we might encode extra > 418: // state in _succ so we can avoid fetching entry_list. I there is no comment in synchronizer about this (that I can find) and whether or not this is a good idea, can you remove this line with this change? src/hotspot/share/runtime/objectMonitor.cpp line 701: > 699: void ObjectMonitor::add_to_entry_list(JavaThread* current, ObjectWaiter* node) { > 700: node->_prev = nullptr; > 701: node->TState = ObjectWaiter::TS_ENTER; I think you should do this in a future cleanup. The ObjectWaiter's constructor should initialize these fields to TS_ENTER or TS_WAIT when it's created and make prev, next null (or 0xBAD?). And fix the constructor to have an initialization list instead. src/hotspot/share/runtime/objectMonitor.cpp line 735: > 733: assert(!has_successor(current), "invariant"); > 734: assert(has_owner(current), "invariant"); > 735: return true; I wonder for a future RFE we can move these asserts into TryLock. src/hotspot/share/runtime/objectMonitor.cpp line 1285: > 1283: // By convention we unlink a contending thread from _entry_list immediately > 1284: // after the thread acquires the lock in ::enter(). Equally, we could defer > 1285: // unlinking the thread until ::exit()-time. Since you're here, remove these two lines 1222-1223. I really don't think pointing out an alternate implementation that we did not choose is helpful to understanding this code. src/hotspot/share/runtime/objectMonitor.hpp line 46: > 44: class ObjectWaiter : public CHeapObj { > 45: public: > 46: enum TStates : uint8_t { TS_UNDEF, TS_READY, TS_RUN, TS_WAIT, TS_ENTER }; TS_READY looks unused. src/hotspot/share/runtime/objectMonitor.hpp line 79: > 77: void set_bad_pointers() { > 78: #ifdef ASSERT > 79: // Diagnostic hygiene ... hygiene seems like the wrong word here. Can you remove this comment? src/hotspot/share/runtime/synchronizer.cpp line 369: > 367: // We have one or more waiters. Since this is an inflated monitor > 368: // that we own, we can transfer one or more threads from the waitset > 369: // to the entry_list here and now, avoiding the slow-path. Not related to this change but I found that this quick_notify isn't quicker. ------------- Changes requested by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2647862248 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973630782 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973654464 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973664035 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973670396 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973678657 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973632087 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973634214 From fbredberg at openjdk.org Thu Feb 27 15:54:28 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 15:54:28 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: > I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. > > This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. > > In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. > > The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. > > You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. > > The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. > > Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). > > Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. > > However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fact that c2 no longer has to check b... Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: Update after review by David and Coleen. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23421/files - new: https://git.openjdk.org/jdk/pull/23421/files/e1d4fac6..283c2431 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23421&range=00-01 Stats: 124 lines in 5 files changed: 28 ins; 36 del; 60 mod Patch: https://git.openjdk.org/jdk/pull/23421.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23421/head:pull/23421 PR: https://git.openjdk.org/jdk/pull/23421 From fbredberg at openjdk.org Thu Feb 27 16:00:13 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 16:00:13 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 05:42:25 GMT, David Holmes wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 172: > >> 170: // from the entry_list head. While walking the list we also assign >> 171: // the prev pointers of each thread, essentially forming a doubly >> 172: // linked list, see 2) below. > > Suggestion: have diagram 2 immediately follow this text so the reader doesn't have to jump down. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973880640 From fbredberg at openjdk.org Thu Feb 27 16:00:14 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 16:00:14 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 14:09:45 GMT, Coleen Phillimore wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 701: > >> 699: void ObjectMonitor::add_to_entry_list(JavaThread* current, ObjectWaiter* node) { >> 700: node->_prev = nullptr; >> 701: node->TState = ObjectWaiter::TS_ENTER; > > I think you should do this in a future cleanup. The ObjectWaiter's constructor should initialize these fields to TS_ENTER or TS_WAIT when it's created and make prev, next null (or 0xBAD?). And fix the constructor to have an initialization list instead. Sounds like a plan. > src/hotspot/share/runtime/synchronizer.cpp line 369: > >> 367: // We have one or more waiters. Since this is an inflated monitor >> 368: // that we own, we can transfer one or more threads from the waitset >> 369: // to the entry_list here and now, avoiding the slow-path. > > Not related to this change but I found that this quick_notify isn't quicker. Let's make quick_notify quicker (in another RFE). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973883699 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1973878764 From galder at openjdk.org Thu Feb 27 16:41:13 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 27 Feb 2025 16:41:13 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> Message-ID: On Fri, 7 Feb 2025 12:39:24 GMT, Galder Zamarre?o wrote: >> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in order to help improve vectorization performance. >> >> Currently vectorization does not kick in for loops containing either of these calls because of the following error: >> >> >> VLoop::check_preconditions: failed: control flow in loop not allowed >> >> >> The control flow is due to the java implementation for these methods, e.g. >> >> >> public static long max(long a, long b) { >> return (a >= b) ? a : b; >> } >> >> >> This patch intrinsifies the calls to replace the CmpL + Bool nodes for MaxL/MinL nodes respectively. >> By doing this, vectorization no longer finds the control flow and so it can carry out the vectorization. >> E.g. >> >> >> SuperWord::transform_loop: >> Loop: N518/N126 counted [int,int),+4 (1025 iters) main has_sfpt strip_mined >> 518 CountedLoop === 518 246 126 [[ 513 517 518 242 521 522 422 210 ]] inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] !jvms: Test::test @ bci:14 (line 21) >> >> >> Applying the same changes to `ReductionPerf` as in https://github.com/openjdk/jdk/pull/13056, we can compare the results before and after. Before the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1155 >> long max 1173 >> >> >> After the patch, on darwin/aarch64 (M1): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java >> 1 1 0 0 >> ============================== >> TEST SUCCESS >> >> long min 1042 >> long max 1042 >> >> >> This patch does not add an platform-specific backend implementations for the MaxL/MinL nodes. >> Therefore, it still relies on the macro expansion to transform those into CMoveL. >> >> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these results: >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PA... > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 44 additional commits since the last revision: > > - Merge branch 'master' into topic.intrinsify-max-min-long > - Fix typo > - Renaming methods and variables and add docu on algorithms > - Fix copyright years > - Make sure it runs with cpus with either avx512 or asimd > - Test can only run with 256 bit registers or bigger > > * Remove platform dependant check > and use platform independent configuration instead. > - Fix license header > - Tests should also run on aarch64 asimd=true envs > - Added comment around the assertions > - Adjust min/max identity IR test expectations after changes > - ... and 34 more: https://git.openjdk.org/jdk/compare/92e82467...a190ae68 Also, I've started a [discussion on jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) to see if there's a way to minimise pollution of `Math.min(II)` compilation. As a follow to https://github.com/openjdk/jdk/pull/20098#issuecomment-2684701935 I looked at where the other `Math.min(II)` calls are coming from, and a big chunk seem related to the JMH infrastructure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2688510211 From galder at openjdk.org Thu Feb 27 16:38:04 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 27 Feb 2025 16:38:04 GMT Subject: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12] In-Reply-To: <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com> References: <6uzJCMkW_tFnyxzMbFGYfs7p3mezuBhizHl9dkR1Jro=.2da99701-7b40-492f-b15a-ef1ff7530ef7@github.com> <63F-0aHgMthexL0b2DFmkW8_QrJeo8OOlCaIyZApfpY=.4744070d-9d56-4031-8684-be14cf66d1e5@github.com> Message-ID: On Thu, 27 Feb 2025 06:54:30 GMT, Emanuel Peter wrote: > Detect "extreme" probability scalar cmove, and replace them with branching code. This should take care of all regressions here. This one has high priority, as it fixes the regression caused by this patch here. But it would also help to improve performance for the Integer.min/max cases, which have the same issue. +1 and the rest of suggestions. Shall I create a JDK bug for this? > Additional performance improvement: make SuperWord recognize more cases as profitble (see Regression 1). Optional. > Additional performance improvement: extend backend capabilities for vectorization (see Regression 2 + 3). Optional. Do we need JDK bug(s) for these? If so, how many? 1 or 2? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2688502397 From duke at openjdk.org Thu Feb 27 16:48:07 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Thu, 27 Feb 2025 16:48:07 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5] In-Reply-To: References: <1yB95sOajuS5ptFI0GQWLepii5JsZ9DOsje-TEFyFYs=.a325ad18-17ed-4e77-b1e3-0bad2cf55c67@github.com> <_CekdxBJviS_sZCVN62_yFx-cTF4qrIuAnq bIeUmFck=.3a6afffb-8fbe-4809-a4ca-1bc22b52a628@github.com> Message-ID: On Thu, 27 Feb 2025 10:15:48 GMT, Andrew Haley wrote: >> OK, so GNU as is more forgiving than Apple as... > > Did my patch to aarch64-asmtest.py solve the problem? I haven't tried, I just used GNU as. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23300#discussion_r1973970358 From coleenp at openjdk.org Thu Feb 27 17:16:05 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 27 Feb 2025 17:16:05 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 14:22:02 GMT, Coleen Phillimore wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.hpp line 46: > >> 44: class ObjectWaiter : public CHeapObj { >> 45: public: >> 46: enum TStates : uint8_t { TS_UNDEF, TS_READY, TS_RUN, TS_WAIT, TS_ENTER }; > > TS_READY looks unused. Edit: this could be a trivial further PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974015687 From coleenp at openjdk.org Thu Feb 27 17:16:04 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 27 Feb 2025 17:16:04 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Update after review by David and Coleen. This change looks great. Thank you! src/hotspot/share/runtime/objectMonitor.cpp line 219: > 217: // entry_list_tail ----------^ > 218: // > 219: // * The monitor itself protects all of the operations on the This is a nice comment and really helps understand the algorithm. src/hotspot/share/runtime/objectMonitor.cpp line 948: > 946: current->_ParkEvent->reset(); > 947: > 948: if (try_lock_or_add_to_entry_list(current, &node)) { try_lock_or_add_to_entry_list() name makes sense in this context. if (add_to_entry_list(current, &node, /*try_lock*/true)) { return; // We got the lock } Makes less sense. I propose leaving the names and the functions for now. ------------- Marked as reviewed by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2648493876 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974006126 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974014351 From ayang at openjdk.org Thu Feb 27 18:34:14 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Thu, 27 Feb 2025 18:34:14 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2] In-Reply-To: References: Message-ID: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com> On Tue, 25 Feb 2025 15:13:43 GMT, Thomas Schatzl wrote: >> Hi all, >> >> please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. >> >> The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. >> >> ### Current situation >> >> With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. >> >> The main reason for the current barrier is how g1 implements concurrent refinement: >> * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. >> * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, >> * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. >> >> These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: >> >> >> // Filtering >> if (region(@x.a) == region(y)) goto done; // same region check >> if (y == null) goto done; // null value check >> if (card(@x.a) == young_card) goto done; // write to young gen check >> StoreLoad; // synchronize >> if (card(@x.a) == dirty_card) goto done; >> >> *card(@x.a) = dirty >> >> // Card tracking >> enqueue(card-address(@x.a)) into thread-local-dcq; >> if (thread-local-dcq is not full) goto done; >> >> call runtime to move thread-local-dcq into dcqs >> >> done: >> >> >> Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. >> >> The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. >> >> There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). >> >> The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching c... > > Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: > > * remove unnecessarily added logging src/hotspot/share/gc/g1/g1BarrierSet.hpp line 54: > 52: // them, keeping the write barrier simple. > 53: // > 54: // The refinement threads mark cards in the the current collection set specially on the "the the" typo. src/hotspot/share/gc/g1/g1CardTable.inline.hpp line 47: > 45: > 46: // Returns bits from a where mask is 0, and bits from b where mask is 1. > 47: inline size_t blend(size_t a, size_t b, size_t mask) { Can you provide some input/output examples in the doc? src/hotspot/share/gc/g1/g1CardTableClaimTable.cpp line 45: > 43: } > 44: > 45: void G1CardTableClaimTable::initialize(size_t max_reserved_regions) { Should the arg be `uint`? src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 280: > 278: assert_state(State::SweepRT); > 279: > 280: set_state_start_time(); This method is called in a loop; would that skew the state-starting time? src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 344: > 342: size_t _num_clean; > 343: size_t _num_dirty; > 344: size_t _num_to_cset; Seem never read. src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349: > 347: > 348: bool do_heap_region(G1HeapRegion* r) override { > 349: if (!r->is_free()) { I am a bit lost on this closure; the intention seems to set unclaimed to all non-free regions, why can't this be done in one go, instead of first setting all regions to claimed (`reset_all_claims_to_claimed`), then set non-free ones unclaimed? src/hotspot/share/gc/g1/g1ConcurrentRefine.hpp line 116: > 114: > 115: // Current heap snapshot. > 116: G1CardTableClaimTable* _sweep_state; Since this is a table, I wonder if we can name it "x_table" instead of "x_state". src/hotspot/share/gc/g1/g1RemSet.cpp line 147: > 145: if (_contains[region]) { > 146: return; > 147: } Indentation seems broken. src/hotspot/share/gc/g1/g1RemSet.cpp line 830: > 828: size_t const start_idx = region_card_base_idx + claim.value(); > 829: > 830: size_t* card_cur_card = (size_t*)card_table->byte_for_index(start_idx); This var name should end with "_word", instead of "_card". src/hotspot/share/gc/g1/g1RemSet.cpp line 1252: > 1250: G1ConcurrentRefineWorkState::snapshot_heap_into(&constructed); > 1251: claim = &constructed; > 1252: } It's not super obvious to me why the "has_sweep_claims" checking needs to be on this level. Can `G1ConcurrentRefineWorkState` return a valid `G1CardTableClaimTable*` directly? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974124792 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1971426039 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973435950 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974083760 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973447654 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973452168 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974056492 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1973423400 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974108760 PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1974134441 From fbredberg at openjdk.org Thu Feb 27 19:57:03 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 19:57:03 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 05:19:44 GMT, David Holmes wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 331: > >> 329: volatile_nonstatic_field(ObjectMonitor, _owner, int64_t) \ >> 330: volatile_nonstatic_field(ObjectMonitor, _recursions, intptr_t) \ >> 331: volatile_nonstatic_field(ObjectMonitor, _entry_list, ObjectWaiter*) \ > > Suggestion: > > volatile_nonstatic_field(ObjectMonitor, _entry_list, ObjectWaiter*) \ > > Extra space Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 176: > >> 174: // Once we have formed a doubly linked list it's easy to find the >> 175: // successor, wake it up, have it remove itself, and update the >> 176: // tail pointer, as seen in 2) and 3) below. > > Suggestion: > > // tail pointer, as seen in 3) below. > > But have diagram 3 right here. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 179: > >> 177: // >> 178: // At any time new threads can add themselves to the entry_list, see >> 179: // 4) and 5). > > Diagrams 4 and 5 do not follow from what has just been described, but the use of "at any time" implies to me you intended to show them affecting the queue as we have already seen it. > > Again show the diagram you want here. Rewrote diagram. > src/hotspot/share/runtime/objectMonitor.cpp line 183: > >> 181: // If the thread that removes itself from the end of the list hasn't >> 182: // got any prev pointer, we just set the tail pointer to null, see >> 183: // 5) and 6). > > Suggestion: > > // If the thread to be removed is the only thread in the entry list: > // entry_list -> A -> null > // entry_list_tail ---^ > // we remove it and just set the tail pointer to null, > // entry_list -> null > // entry_list_tail -> null Rewrote the diagram. Wanted to show how things work when he thread that removes itself from the end of the list hasn't got any prev pointer (and it's not the only thread in the entry list). > src/hotspot/share/runtime/objectMonitor.cpp line 187: > >> 185: // Next time we need to find the successor and the tail is null, we >> 186: // just start walking from the entry_list head again forming a new >> 187: // doubly linked list, see 6) and 7) below. > > Suggestion: > > // Next time we need to find the successor and the tail is null, > // entry_list ->I->H->G->null > // entry_list_tail ->null > // we just start walking from the entry_list head again forming a new > // doubly linked list: > // entry_list ->I<=>H<=>G->null > // entry_list_tail ----------^ Rewrote diagram. Didn't abandon the "number list" since everything else is written that way. > src/hotspot/share/runtime/objectMonitor.cpp line 189: > >> 187: // doubly linked list, see 6) and 7) below. >> 188: // >> 189: // 1) entry_list ->F->E->D->C->B->A->null > > Suggestion: > > // 1) entry_list ->F->E->D->C->B->A->null > > Right-justify the names please I think it's more readable to have it left-justified, since both entry_list and entry_list_tail both start with the same text. > src/hotspot/share/runtime/objectMonitor.cpp line 215: > >> 213: // The mutex property of the monitor itself protects the entry_list >> 214: // from concurrent interference. >> 215: // -- Only the monitor owner may detach nodes from the entry_list. > > Suggestion for this block - get rid of invariants headings and just say: > > // The monitor itself protects all of the operations on the entry_list except for the CAS of a new arrival > // to the head. Only the monitor owner can read or write the prev links (e.g. to remove itself) or update > // the tail. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 225: > >> 223: // concurrent detaching thread. This mechanism is immune from the >> 224: // ABA corruption. More precisely, the CAS-based "push" onto >> 225: // entry_list is ABA-oblivious. > > Not sure this actually says anything to help people understand the code or its operation. There basically is no A-B-A issue with the use of CAS here. Rewritten the comment. > src/hotspot/share/runtime/objectMonitor.cpp line 227: > >> 225: // entry_list is ABA-oblivious. >> 226: // >> 227: // * The entry_list form a queue of threads stalled trying to acquire > > Suggestion: > > // * The entry_list forms a queue of threads stalled trying to acquire Fixed > src/hotspot/share/runtime/objectMonitor.hpp line 195: > >> 193: volatile intx _recursions; // recursion count, 0 for first entry >> 194: ObjectWaiter* volatile _entry_list; // Threads blocked on entry or reentry. >> 195: // The list is actually composed of WaitNodes, > > Suggestion: > > // The list is actually composed of wait-nodes, > > Pre-existing (check for other uses) `WaitNodes` reads like a class name but it isn't. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974244653 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974247893 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974246933 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974250054 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974251792 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974246012 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974252355 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974252954 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974253676 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974245155 From fbredberg at openjdk.org Thu Feb 27 19:57:04 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 19:57:04 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 13:59:38 GMT, Coleen Phillimore wrote: >> src/hotspot/share/runtime/objectMonitor.cpp line 166: >> >>> 164: // its next pointer, and have its prev pointer set to null. Thus >>> 165: // pushing six threads A-F (in that order) onto entry_list, will >>> 166: // form a singly-linked list, see 1) below. >> >> Suggestion: have diagram 1 immediately follow this text so the reader doesn't have to jump down. > > I like this suggestion. I like these comments. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974247465 From fbredberg at openjdk.org Thu Feb 27 20:04:02 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:04:02 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 06:08:14 GMT, David Holmes wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 232: > >> 230: // thread notices that the tail of the entry_list is not known, we >> 231: // convert the singly-linked entry_list into a doubly linked list by >> 232: // assigning the prev pointers and the entry_list_tail pointer. > > Didn't we essentially say all this at the beginning? This text makes more sense before the newly added "Example:", so I moved it. > src/hotspot/share/runtime/objectMonitor.cpp line 260: > >> 258: // >> 259: // * notify() or notifyAll() simply transfers threads from the WaitSet >> 260: // to either the entry_list. Subsequent exit() operations will > > Suggestion: > > // to the entry_list. Subsequent exit() operations will Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 704: > >> 702: >> 703: for (;;) { >> 704: ObjectWaiter* front = Atomic::load(&_entry_list); > > In comments and code pick "head" or "front" to use to describe what _entry_list points to and use that consistently. I think "front" is much more common. A `grep -r `suggests that `head` is more common, so I changed to `head`. > src/hotspot/share/runtime/objectMonitor.cpp line 705: > >> 703: for (;;) { >> 704: ObjectWaiter* front = Atomic::load(&_entry_list); >> 705: > > No need for blank line. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974257620 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974259984 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974261995 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974260402 From fbredberg at openjdk.org Thu Feb 27 20:12:58 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:12:58 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 13:56:15 GMT, Coleen Phillimore wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 418: > >> 416: // have released the lock. >> 417: // Refer to the comments in synchronizer.cpp for how we might encode extra >> 418: // state in _succ so we can avoid fetching entry_list. > > I there is no comment in synchronizer about this (that I can find) and whether or not this is a good idea, can you remove this line with this change? Removed > src/hotspot/share/runtime/objectMonitor.hpp line 79: > >> 77: void set_bad_pointers() { >> 78: #ifdef ASSERT >> 79: // Diagnostic hygiene ... > > hygiene seems like the wrong word here. Can you remove this comment? Removed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974271052 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974271724 From fbredberg at openjdk.org Thu Feb 27 20:12:59 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:12:59 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: <0ALa3fouoHHnr9xwosMUd0gxQnQFwomxSmQ8_4wijcY=.acdb876b-6b94-4320-904a-f7741d54c8de@github.com> On Thu, 27 Feb 2025 14:11:21 GMT, Coleen Phillimore wrote: >> src/hotspot/share/runtime/objectMonitor.cpp line 718: >> >>> 716: // if we added current to _entry_list. Once on _entry_list, current >>> 717: // stays on-queue until it acquires the lock. >>> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) { >> >> Nit: the name suggests we do the try_lock first, when we don't. If we reverse the name we should also reverse the true/false return so that true relates to the first part of the name. See what others think. > > How about add_to_entry_list with a boolean parameter that tries the lock if it fails, and only have one of these functions? Although the return true if you get the lock makes it weird. > > > bool add_to_entry_list(JavaThread* current, ObjectWaiter* node, bool or_lock) { > return true if locked, false otherwise; > } > > > Maybe that makes sense. I wasn't completely happy with naming this `try_lock_or_add_to_entry_list` for the exact reason David points out. It does NOT first `try_lock` and then if that fails `add_to_entry_list`. It does the complete opposite. It first try to add to the entry list and if that fails, it tries to lock. So why on earth did I end up with this solution? Because I went along with how the current family of `try_enter`, `spin_enter` and `TryLockWithContentionMark` works. They all try to lock the monitor and if they succeed they return true otherwise they return false. And this is exactly how my `try_lock_or_add_to_entry_list` works, except for the fact that when it returns false (because we didn't get the lock) the current thread has been been added to the `entry_list`. I also think that combining the two functions into one (as Colleen suggests) just adds to the confusion, mostly because of the "weird" return value. I guess we just have to choose what kind of weirdness we can accept. I'm absolutely willing to change it if anyone has a strong opinion, or comes up with something that the majority think is better. For me joining the `TryLockWithContentionMark` etc. camp seemed like the most reasonable kind of weird. >> src/hotspot/share/runtime/objectMonitor.cpp line 719: >> >>> 717: // stays on-queue until it acquires the lock. >>> 718: bool ObjectMonitor::try_lock_or_add_to_entry_list(JavaThread* current, ObjectWaiter* node) { >>> 719: node->_prev = nullptr; >> >> Shouldn't this already be the case? > > I think for the vthread case, it isn't yet(?). Maybe motivation to fix the ObjectWaiter constructor with this patch? For the most part it is. But as Coleen points out, the vthread case might not be, and I'm not willing to risk it. >> src/hotspot/share/runtime/objectMonitor.cpp line 2018: >> >>> 2016: // that in prepend-mode we invert the order of the waiters. Let's say that the >>> 2017: // waitset is "ABCD" and the entry_list is "XYZ". After a notifyAll() in prepend >>> 2018: // mode the waitset will be empty and the entry_list will be "DCBAXYZ". >> >> We don't support different ordering modes any more so we always "prepend" such that waiters are added to the entry_list in the reverse order of waiting. So given waitList -> A -> B -> C -> D, and _entry_list -> x -> y -> z we will get _entry_list -> D -> C -> B -> A -> X -> Y -> Z > > One of the benefits of this work is to read, understand and clean up misleading and out of date comments in this code. Rewrote the comment. Let the waitset remain as a string "ABCD" because it would be to messy to try to depict it as a circular doubly linked list. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974266558 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974267473 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974270597 From fbredberg at openjdk.org Thu Feb 27 20:13:01 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:13:01 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Wed, 26 Feb 2025 06:19:38 GMT, David Holmes wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 724: > >> 722: for (;;) { >> 723: ObjectWaiter* front = Atomic::load(&_entry_list); >> 724: > > No need for blank line. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 731: > >> 729: >> 730: // Interference - the CAS failed because _entry_list changed. Just retry. >> 731: // As an optional optimization we retry the lock. > > Suggestion: > > // Interference - the CAS failed because _entry_list changed. Before > // retrying the CAS retry taking the lock as it may now be free. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 812: > >> 810: guarantee(_entry_list == nullptr, >> 811: "must be no entering threads: entry_list=" INTPTR_FORMAT, >> 812: p2i(_entry_list)); > > Mustn't re-read _entry_list in the p2i as it may have changed from the value that is causing the guarantee to fail. The old guarantees were buggy in this regard - a temp is needed. Fixed > src/hotspot/share/runtime/objectMonitor.cpp line 1299: > >> 1297: assert(_entry_list_tail == nullptr || _entry_list_tail == currentNode, "invariant"); >> 1298: >> 1299: ObjectWaiter* v = Atomic::load(&_entry_list); > > Nit: use `w` to be consistent with similar code. The original used `w` for EntryList and `v` for cxq IIRC. Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974268658 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974268941 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974267878 PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974269555 From fbredberg at openjdk.org Thu Feb 27 20:19:01 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:19:01 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 14:15:15 GMT, Coleen Phillimore wrote: >> Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after review by David and Coleen. > > src/hotspot/share/runtime/objectMonitor.cpp line 735: > >> 733: assert(!has_successor(current), "invariant"); >> 734: assert(has_owner(current), "invariant"); >> 735: return true; > > I wonder for a future RFE we can move these asserts into TryLock. Good idea! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974277231 From fbredberg at openjdk.org Thu Feb 27 20:19:02 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:19:02 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 17:12:40 GMT, Coleen Phillimore wrote: >> src/hotspot/share/runtime/objectMonitor.hpp line 46: >> >>> 44: class ObjectWaiter : public CHeapObj { >>> 45: public: >>> 46: enum TStates : uint8_t { TS_UNDEF, TS_READY, TS_RUN, TS_WAIT, TS_ENTER }; >> >> TS_READY looks unused. > > Edit: this could be a trivial further PR. And so does `TS_UNDEF`, but the enum value for `TS_UNDEF` will be zero and maybe there is some hidden "check for uninitialized `TStates` code" somewhere that stops working... A grep also finds: `src/hotspot/share/prims/jvmtiRawMonitor.hpp: enum TStates { TS_READY, TS_RUN, TS_WAIT, TS_ENTER }; ` So, since this is not really in the core part of this PR, I'd like to postpone that change to a later cleanup RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974278590 From fbredberg at openjdk.org Thu Feb 27 20:40:56 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:40:56 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com> On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Update after review by David and Coleen. I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689061860 From fbredberg at openjdk.org Thu Feb 27 20:53:56 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:53:56 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Update after review by David and Coleen. @pchilano Since I have removed the `cxq` list @dholmes-ora suggested that I should also rename `_vthread_cxq_head`. Thereby removing the term "cxq" altogether. I chose to rename `_vthread_cxq_head` with `_vthread_list_head`. Hope that is okay. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689083393 From fbredberg at openjdk.org Thu Feb 27 20:59:55 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Thu, 27 Feb 2025 20:59:55 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: <1S7kUz3GfEDitlf6dU4nF5Tl1X7UNBhMDdWCPE9Apos=.a1e7abc2-065d-4fe3-95b2-d0d5ca884dac@github.com> On Mon, 10 Feb 2025 12:51:43 GMT, Fredrik Bredberg wrote: >> src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 332: >> >>> 330: volatile_nonstatic_field(ObjectMonitor, _owner, int64_t) \ >>> 331: volatile_nonstatic_field(ObjectMonitor, _recursions, intptr_t) \ >>> 332: volatile_nonstatic_field(ObjectMonitor, _EntryListTail, ObjectWaiter*) \ >> >> You may need to coordinate with @mur47x111 to see what graal does with this field. I suspect the graal code also checks both ctx and EntryList in the unlock fast path and now only needs to check _EntryList. In which case we don't need to export EntryListTail. > > Thanks for the heads up @coleenp . I was planing on contacting the Graal team when this PR gets closer to getting integrated. I'll delete the `_EntryListTail` export, and make sure to ask for a review from @mur47x111 when that time comes. They seem to have everything under control: [[JDK-8349711] Adapt JDK-8343840: Rewrite the ObjectMonitor lists](https://github.com/oracle/graal/pull/10757) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23421#discussion_r1974327790 From fyang at openjdk.org Fri Feb 28 05:23:54 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Feb 2025 05:23:54 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com> References: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com> Message-ID: On Thu, 27 Feb 2025 20:38:32 GMT, Fredrik Bredberg wrote: > I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well. FYI: hs:tier1 - hs:tier3 test good on linux-riscv64 platform. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689751810 From duke at openjdk.org Fri Feb 28 06:22:09 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Fri, 28 Feb 2025 06:22:09 GMT Subject: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v8] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-DSA algorithms (key generation, document signing, signature verification) can be approximately doubled. Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - Merged master. - Added more comments, mainly as suggested by Andrew Dinn - Changed aarch64-asmtest.py as suggested by Bhavana-Kilambi - Accepting suggested change from Andrew Dinn - Added comments suggested by Andrew Dinn - Fixed copyright years - renaming a couple of functions - Adding comments + some code reorganization - removed debugging code - merging master - ... and 3 more: https://git.openjdk.org/jdk/compare/ab4b0ef9...d82dfb2f ------------- Changes: https://git.openjdk.org/jdk/pull/23300/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23300&range=07 Stats: 2611 lines in 22 files changed: 2030 ins; 92 del; 489 mod Patch: https://git.openjdk.org/jdk/pull/23300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23300/head:pull/23300 PR: https://git.openjdk.org/jdk/pull/23300 From dholmes at openjdk.org Fri Feb 28 07:02:55 2025 From: dholmes at openjdk.org (David Holmes) Date: Fri, 28 Feb 2025 07:02:55 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: Message-ID: On Thu, 27 Feb 2025 15:54:28 GMT, Fredrik Bredberg wrote: >> I've combined two `ObjectMonitor`'s lists, `EntryList` and `cxq`, into one list. The `entry_list`. >> >> This way c2 no longer has to check both `EntryList` and `cxq` in order to opt out if the "conceptual entry list" is empty, which also means that the constant question about if it's safe to first check the `EntryList` and then `cxq` will be a thing of the past. >> >> In the current multi-queue design new threads where always added to the `cxq`, then `ObjectMonitor::exit` would choose a successor from the head of `EntryList`. When the `EntryList` was empty and `cxq` was not, `ObjectMonitor::exit` whould detached the singly linked `cxq` list, and add the elements to the doubly linked `EntryList`. The element that was first added to `cxq` whould be at the tail of the `EntryList`. This way you ended up working through the contending threads in LIFO-chunks. >> >> The new list-design is as much a multi-queue as the current. Conceptually it can be looked upon as if the old singly linked `cxq` list doesn't end with a null pointer, but instead has a link that points to the head of the doubly linked `entry_list`. >> >> You always add to the `entry_list` by Compare And Exchange to the head. The most common case is that you remove from the tail (the successor is chosen in strict FIFO order). The head is volatile, but the interior is stable. >> >> The first contending thread that "pushes" itself onto `entry_list`, will be the last thread in the list. Each newly pushed thread in `entry_list` will be linked trough its next pointer, and have its prev pointer set to null, thus pushing new threads onto `entry_list` will form a singly linked list. The list is always in the right order (via the next-pointers) and is never moved to another list. >> >> Since we choose the successor in FIFO order, the exiting thread needs to find the tail of the `entry_list`. This is done by walking from the `entry_list` head. While walking the list we assign the prev pointers of each thread, essentially forming a doubly linked list. The tail pointer is cached in `entry_list_tail` so that we don't need to walk from the `entry_list` head each time we need to find the tail (successor). >> >> Performance wise the new design seems to be equal to the old design, even though c2 generates two less instructions per monitor unlock operation. >> >> However the complexity of the source has been reduced by removing the `TS_CXQ` state and adding functions instead of inlining `cmpxchg` here and there, and the fac... > > Fredrik Bredberg has updated the pull request incrementally with one additional commit since the last revision: > > Update after review by David and Coleen. Okay that's good enough for me. :) Thanks ------------- Marked as reviewed by dholmes (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23421#pullrequestreview-2649910490 From amitkumar at openjdk.org Fri Feb 28 07:02:56 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 28 Feb 2025 07:02:56 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com> Message-ID: On Fri, 28 Feb 2025 05:21:34 GMT, Fei Yang wrote: > I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well. Tier1 test passed on s390x. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2689887509 From duke at openjdk.org Fri Feb 28 09:46:32 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Fri, 28 Feb 2025 09:46:32 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v2] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. Ferenc Rakoczi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - Merged master - removing trailing spaces - kyber aarch64 intrinsics ------------- Changes: https://git.openjdk.org/jdk/pull/23663/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=01 Stats: 2885 lines in 20 files changed: 2774 ins; 84 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From duke at openjdk.org Fri Feb 28 10:15:09 2025 From: duke at openjdk.org (Ferenc Rakoczi) Date: Fri, 28 Feb 2025 10:15:09 GMT Subject: RFR: 8349721: Add aarch64 intrinsics for ML-KEM [v3] In-Reply-To: References: Message-ID: > By using the aarch64 vector registers the speed of the computation of the ML-KEM algorithms (key generation, encapsulation, decapsulation) can be approximately doubled. Ferenc Rakoczi has updated the pull request incrementally with one additional commit since the last revision: A little cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23663/files - new: https://git.openjdk.org/jdk/pull/23663/files/ff0f8430..4adc5cf2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23663&range=01-02 Stats: 24 lines in 3 files changed: 0 ins; 23 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23663.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23663/head:pull/23663 PR: https://git.openjdk.org/jdk/pull/23663 From tschatzl at openjdk.org Fri Feb 28 10:35:03 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 28 Feb 2025 10:35:03 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2] In-Reply-To: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com> References: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com> Message-ID: On Thu, 27 Feb 2025 18:24:15 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * remove unnecessarily added logging > > src/hotspot/share/gc/g1/g1BarrierSet.hpp line 54: > >> 52: // them, keeping the write barrier simple. >> 53: // >> 54: // The refinement threads mark cards in the the current collection set specially on the > > "the the" typo. I fixed one more occurrence in files changed in this CR. There are like 10 more of these duplications in our code, I will fix separately. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1975186407 From mdoerr at openjdk.org Fri Feb 28 10:50:00 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 28 Feb 2025 10:50:00 GMT Subject: RFR: 8343840: Rewrite the ObjectMonitor lists [v2] In-Reply-To: References: <_XnhdwtuB6AhiTL4TYmV4yqIy_WwQEeASn2b2zL9-V0=.05ec2994-8599-4f76-871d-a9e2bbe8afa2@github.com> Message-ID: On Fri, 28 Feb 2025 07:00:40 GMT, Amit Kumar wrote: > I've used QEMU to smoke test this PR on ppc64le, riscv64 and s390x, But it would be nice if @TheRealMDoerr, @RealFYang and @offamitkumar could check if it runs okay on real hardware as well. The PPC64 code looks correct and some quick tests have passed. I'll run larger test suites over the weekend. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23421#issuecomment-2690327204 From tschatzl at openjdk.org Fri Feb 28 11:25:53 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 28 Feb 2025 11:25:53 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2] In-Reply-To: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com> References: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com> Message-ID: <9tS5E1tteGutSNX7rZh5WYLdZoF7Vgl_4_pjuAdT4WU=.c8c73c45-7abb-48a9-b623-769d3c1679ca@github.com> On Thu, 27 Feb 2025 12:07:29 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * remove unnecessarily added logging > > src/hotspot/share/gc/g1/g1ConcurrentRefine.cpp line 349: > >> 347: >> 348: bool do_heap_region(G1HeapRegion* r) override { >> 349: if (!r->is_free()) { > > I am a bit lost on this closure; the intention seems to set unclaimed to all non-free regions, why can't this be done in one go, instead of first setting all regions to claimed (`reset_all_claims_to_claimed`), then set non-free ones unclaimed? `do_heap_region()` only visits committed regions in this case. I wanted to avoid the additional check in the iteration code. If you still think it is more clear to filter those out later, please tell me. I'll add a comment for now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1975250646 From tschatzl at openjdk.org Fri Feb 28 12:14:01 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 28 Feb 2025 12:14:01 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v2] In-Reply-To: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com> References: <3zmj-DeeRyPMHc32YnvfqACN0xJxLQ6jZZ7sd-Baa3w=.672912f6-e4a3-4679-b8a3-b7f6ad51589d@github.com> Message-ID: <87L5pcyGAgyDsXTwlSdAFLyIAOcUl1ZdYXK-nwzLrUQ=.c3db7522-b3e6-46e0-b268-e457c3d2bdc2@github.com> On Thu, 27 Feb 2025 18:31:16 GMT, Albert Mingkun Yang wrote: >> Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: >> >> * remove unnecessarily added logging > > src/hotspot/share/gc/g1/g1RemSet.cpp line 1252: > >> 1250: G1ConcurrentRefineWorkState::snapshot_heap_into(&constructed); >> 1251: claim = &constructed; >> 1252: } > > It's not super obvious to me why the "has_sweep_claims" checking needs to be on this level. Can `G1ConcurrentRefineWorkState` return a valid `G1CardTableClaimTable*` directly? I agree. I remember having similar thoughts as well, but then did not do anything about this. Will fix. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23739#discussion_r1975311607 From tschatzl at openjdk.org Fri Feb 28 13:43:24 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 28 Feb 2025 13:43:24 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v3] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with two additional commits since the last revision: - * ayang review 1 (ctd) * split up sweep-rt state into "start" (to be called once) and "step" (to be called repeatedly) phases * move building the snapshot our of g1remset - * ayang review 1 * use uint for number of reserved regions consistently * rename *sweep_state to *sweep_table * improved comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/9ef9c5f4..7d361fc1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=01-02 Stats: 108 lines in 8 files changed: 40 ins; 24 del; 44 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739 From tschatzl at openjdk.org Fri Feb 28 17:52:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 28 Feb 2025 17:52:56 GMT Subject: RFR: 8342382: Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier [v4] In-Reply-To: References: Message-ID: > Hi all, > > please review this change that implements (currently Draft) JEP: G1: Improve Application Throughput with a More Efficient Write-Barrier. > > The reason for posting this early is that this is a large change, and the JEP process is already taking very long with no end in sight but we would like to have this ready by JDK 25. > > ### Current situation > > With this change, G1 will reduce the post write barrier to much more resemble Parallel GC's as described in the JEP. The reason is that G1 lacks in throughput compared to Parallel/Serial GC due to larger barrier. > > The main reason for the current barrier is how g1 implements concurrent refinement: > * g1 tracks dirtied cards using sets (dirty card queue set - dcqs) of buffers (dirty card queues - dcq) containing the location of dirtied cards. Refinement threads pick up their contents to re-refine. The barrier needs to enqueue card locations. > * For correctness dirty card updates requires fine-grained synchronization between mutator and refinement threads, > * Finally there is generic code to avoid dirtying cards altogether (filters), to avoid executing the synchronization and the enqueuing as much as possible. > > These tasks require the current barrier to look as follows for an assignment `x.a = y` in pseudo code: > > > // Filtering > if (region(@x.a) == region(y)) goto done; // same region check > if (y == null) goto done; // null value check > if (card(@x.a) == young_card) goto done; // write to young gen check > StoreLoad; // synchronize > if (card(@x.a) == dirty_card) goto done; > > *card(@x.a) = dirty > > // Card tracking > enqueue(card-address(@x.a)) into thread-local-dcq; > if (thread-local-dcq is not full) goto done; > > call runtime to move thread-local-dcq into dcqs > > done: > > > Overall this post-write barrier alone is in the range of 40-50 total instructions, compared to three or four(!) for parallel and serial gc. > > The large size of the inlined barrier not only has a large code footprint, but also prevents some compiler optimizations like loop unrolling or inlining. > > There are several papers showing that this barrier alone can decrease throughput by 10-20% ([Yang12](https://dl.acm.org/doi/10.1145/2426642.2259004)), which is corroborated by some benchmarks (see links). > > The main idea for this change is to not use fine-grained synchronization between refinement and mutator threads, but coarse grained based on atomically switching card tables. Mutators only work on the "primary" card table, refinement threads on a se... Thomas Schatzl has updated the pull request incrementally with one additional commit since the last revision: * fix assert ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23739/files - new: https://git.openjdk.org/jdk/pull/23739/files/7d361fc1..d87935a0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23739&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23739.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23739/head:pull/23739 PR: https://git.openjdk.org/jdk/pull/23739