From sviswanathan at openjdk.org Thu Aug 1 00:01:37 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 1 Aug 2024 00:01:37 GMT Subject: RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v4] In-Reply-To: References: Message-ID: <2T9i21zOQfC_9sVWWga1HR1lYgJNxBEhxHsEvixKe9U=.061a8435-44a8-4211-a05d-42e744fb10ab@github.com> On Wed, 31 Jul 2024 08:35:37 GMT, Christian Hagedorn wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Jatin review comment resolution > > Testing with tier1-4 + hs-precheckin-comp + hs-comp-stress passed. Let's wait for @vnkozlov for possible review comments. Thanks a lot @chhagedorn for testing. I will give it another day before integrating. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20306#issuecomment-2261679770 From svkamath at openjdk.org Thu Aug 1 05:54:48 2024 From: svkamath at openjdk.org (Smita Kamath) Date: Thu, 1 Aug 2024 05:54:48 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 Message-ID: Hi, I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain -- | -- | -- | -- | -- full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 ? | ? | ? | ? | ? full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 ------------- Commit messages: - Merge branch 'master' of https://git.openjdk.org/jdk into avx512-small-sizes - Removed trailing space - clean up code - Zero out registers and fix issue - Update formatting - Merge master - Code cleanup - Removed shuffle - Updated htbl gen code and other parts - updated htbl code - ... and 1 more: https://git.openjdk.org/jdk/compare/65646b5f...24c9c792 Changes: https://git.openjdk.org/jdk/pull/17515/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17515&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8337632 Stats: 966 lines in 6 files changed: 493 ins; 217 del; 256 mod Patch: https://git.openjdk.org/jdk/pull/17515.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17515/head:pull/17515 PR: https://git.openjdk.org/jdk/pull/17515 From sviswanathan at openjdk.org Thu Aug 1 23:05:40 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 1 Aug 2024 23:05:40 GMT Subject: Integrated: 8337062: x86_64: Unordered add/mul reduction support for vector api In-Reply-To: References: Message-ID: On Wed, 24 Jul 2024 01:31:56 GMT, Sandhya Viswanathan wrote: > Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64. > > Performance for add reduction before: > Benchmark (size) Mode Cnt Score Error Units > Float128Vector.ADDLanes 1024 thrpt 5 4667.317 ? 0.456 ops/ms > Float256Vector.ADDLanes 1024 thrpt 5 5861.845 ? 0.933 ops/ms > Float512Vector.ADDLanes 1024 thrpt 5 4831.763 ? 36.330 ops/ms > Double128Vector.ADDLanes 1024 thrpt 5 2402.777 ? 0.814 ops/ms > Double256Vector.ADDLanes 1024 thrpt 5 4628.929 ? 1.638 ops/ms > Double512Vector.ADDLanes 1024 thrpt 5 4327.784 ? 13.728 ops/ms > > Performance for add reduction after: > Benchmark (size) Mode Cnt Score Error Units > Float128Vector.ADDLanes 1024 thrpt 5 4879.820 ? 7.407 ops/ms > Float256Vector.ADDLanes 1024 thrpt 5 9614.422 ? 4.621 ops/ms > Float512Vector.ADDLanes 1024 thrpt 5 15007.357 ? 57.316 ops/ms > Double128Vector.ADDLanes 1024 thrpt 5 2443.077 ? 1.694 ops/ms > Double256Vector.ADDLanes 1024 thrpt 5 4873.086 ? 1.680 ops/ms > Double512Vector.ADDLanes 1024 thrpt 5 9485.805 ? 31.852 ops/ms > > Performance for mul reduction before: > Benchmark (size) Mode Cnt Score Error Units > Float128Vector.MULLanes 1024 thrpt 5 4692.669 ? 3.555 ops/ms > Float256Vector.MULLanes 1024 thrpt 5 5866.017 ? 7.740 ops/ms > Float512Vector.MULLanes 1024 thrpt 5 4852.888 ? 46.561 ops/ms > Double128Vector.MULLanes 1024 thrpt 5 2402.173 ? 1.795 ops/ms > Double256Vector.MULLanes 1024 thrpt 5 4646.541 ? 2.136 ops/ms > Double512Vector.MULLanes 1024 thrpt 5 4292.133 ? 19.717 ops/ms > > Performance for mul reduction after: > Benchmark (size) Mode Cnt Score Error Units > Float128Vector.MULLanes 1024 thrpt 5 4885.890 ? 1.386 ops/ms > Float256Vector.MULLanes 1024 thrpt 5 9441.757 ? 46.048 ops/ms > Float512Vector.MULLanes 1024 thrpt 5 15091.997 ? 60.052 ops/ms > Double128Vector.MULLanes 1024 thrpt 5 2444.268 ? 1.677 ops/ms > Double256Vector.MULLanes 1024 thrpt 5 4871.302 ? 3.373 ops/ms > Double51... This pull request has now been integrated. Changeset: dc35f3e8 Author: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/dc35f3e8a84c8f622a4cabb8aee0f96de2e2ea30 Stats: 417 lines in 17 files changed: 290 ins; 1 del; 126 mod 8337062: x86_64: Unordered add/mul reduction support for vector api Reviewed-by: jbhateja, sgibbons ------------- PR: https://git.openjdk.org/jdk/pull/20306 From kvn at openjdk.org Thu Aug 1 23:11:35 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 1 Aug 2024 23:11:35 GMT Subject: RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v4] In-Reply-To: References: Message-ID: On Thu, 25 Jul 2024 23:40:45 GMT, Sandhya Viswanathan wrote: >> Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64. >> >> Performance for add reduction before: >> Benchmark (size) Mode Cnt Score Error Units >> Float128Vector.ADDLanes 1024 thrpt 5 4667.317 ? 0.456 ops/ms >> Float256Vector.ADDLanes 1024 thrpt 5 5861.845 ? 0.933 ops/ms >> Float512Vector.ADDLanes 1024 thrpt 5 4831.763 ? 36.330 ops/ms >> Double128Vector.ADDLanes 1024 thrpt 5 2402.777 ? 0.814 ops/ms >> Double256Vector.ADDLanes 1024 thrpt 5 4628.929 ? 1.638 ops/ms >> Double512Vector.ADDLanes 1024 thrpt 5 4327.784 ? 13.728 ops/ms >> >> Performance for add reduction after: >> Benchmark (size) Mode Cnt Score Error Units >> Float128Vector.ADDLanes 1024 thrpt 5 4879.820 ? 7.407 ops/ms >> Float256Vector.ADDLanes 1024 thrpt 5 9614.422 ? 4.621 ops/ms >> Float512Vector.ADDLanes 1024 thrpt 5 15007.357 ? 57.316 ops/ms >> Double128Vector.ADDLanes 1024 thrpt 5 2443.077 ? 1.694 ops/ms >> Double256Vector.ADDLanes 1024 thrpt 5 4873.086 ? 1.680 ops/ms >> Double512Vector.ADDLanes 1024 thrpt 5 9485.805 ? 31.852 ops/ms >> >> Performance for mul reduction before: >> Benchmark (size) Mode Cnt Score Error Units >> Float128Vector.MULLanes 1024 thrpt 5 4692.669 ? 3.555 ops/ms >> Float256Vector.MULLanes 1024 thrpt 5 5866.017 ? 7.740 ops/ms >> Float512Vector.MULLanes 1024 thrpt 5 4852.888 ? 46.561 ops/ms >> Double128Vector.MULLanes 1024 thrpt 5 2402.173 ? 1.795 ops/ms >> Double256Vector.MULLanes 1024 thrpt 5 4646.541 ? 2.136 ops/ms >> Double512Vector.MULLanes 1024 thrpt 5 4292.133 ? 19.717 ops/ms >> >> Performance for mul reduction after: >> Benchmark (size) Mode Cnt Score Error Units >> Float128Vector.MULLanes 1024 thrpt 5 4885.890 ? 1.386 ops/ms >> Float256Vector.MULLanes 1024 thrpt 5 9441.757 ? 46.048 ops/ms >> Float512Vector.MULLanes 1024 thrpt 5 15091.997 ? 60.052 ops/ms >> Double128Vector.MULLanes 1024 thrpt 5 2444.268 ? 1.677 ops/ms >> Double256... > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Jatin review comment resolution This change looks fine to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20306#issuecomment-2264177701 From sviswanathan at openjdk.org Thu Aug 1 23:26:35 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 1 Aug 2024 23:26:35 GMT Subject: RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v4] In-Reply-To: References: Message-ID: On Thu, 1 Aug 2024 23:09:00 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Jatin review comment resolution > > This change looks fine to me. Thanks a lot @vnkozlov. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20306#issuecomment-2264194922 From duke at openjdk.org Fri Aug 2 00:08:37 2024 From: duke at openjdk.org (duke) Date: Fri, 2 Aug 2024 00:08:37 GMT Subject: Withdrawn: 8327240: Obsolete Tier2CompileThreshold/Tier2BackEdgeThreshold product flags In-Reply-To: <86N-93rC4Q2Q1d_YQSARfjQAHNNCEMvCXMq0_fk5A48=.9c621bb8-b724-40fb-afd7-835773a0e942@github.com> References: <86N-93rC4Q2Q1d_YQSARfjQAHNNCEMvCXMq0_fk5A48=.9c621bb8-b724-40fb-afd7-835773a0e942@github.com> Message-ID: On Mon, 22 Apr 2024 20:23:49 GMT, Sonia Zaldana Calles wrote: > Hi all, > > This PR removes the unused options ```Tier2CompileThreshold``` and ```Tier2BackEdgeThreshold```. > > Testing: > - [x] Verified warning is issued as support was removed. > > Thanks, > Sonia This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/18904 From kvn at openjdk.org Fri Aug 2 06:06:04 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 2 Aug 2024 06:06:04 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() Message-ID: Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) As result we waste two registers to pass constant and NULL. Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. Tested tier1-3,stress,xcomp ------------- Commit messages: - 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() Changes: https://git.openjdk.org/jdk/pull/20437/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8337702 Stats: 150 lines in 14 files changed: 128 ins; 1 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/20437.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20437/head:pull/20437 PR: https://git.openjdk.org/jdk/pull/20437 From kvn at openjdk.org Fri Aug 2 15:28:04 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 2 Aug 2024 15:28:04 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: References: Message-ID: > Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. > `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: > [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) > > On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) > As result we waste two registers to pass constant and NULL. > > Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) > > I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. > > Tested tier1-3,stress,xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix arm (32 bits) build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20437/files - new: https://git.openjdk.org/jdk/pull/20437/files/134b8a84..0e8321e2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20437.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20437/head:pull/20437 PR: https://git.openjdk.org/jdk/pull/20437 From fjiang at openjdk.org Sat Aug 3 16:24:40 2024 From: fjiang at openjdk.org (Feilong Jiang) Date: Sat, 3 Aug 2024 16:24:40 GMT Subject: RFR: 8337780: RISC-V: C2: Change C calling convention for sp to NS Message-ID: Hi, please review this patch that changes the C calling convention for sp to NS as we have already saved/restored sp in `enter()`/`leave()`. This could reduce the frame size by 16 bytes for those C2 runtime stubs [1] as we do not have to save sp on the method entry. I also checked the calling convention type for sp on other platforms (AArch64, PPC, x86, x64, S390), and they are all treated as NS. Testing: - [x] tier1~3 & hotspot:tier4 with release build 1: https://github.com/openjdk/jdk/blob/367e0a65561f95aad61b40930d5f46843fee3444/src/hotspot/share/opto/runtime.cpp#L147-L167 ------------- Commit messages: - make sp C convention save type as NS Changes: https://git.openjdk.org/jdk/pull/20449/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20449&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8337780 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20449.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20449/head:pull/20449 PR: https://git.openjdk.org/jdk/pull/20449 From duke at openjdk.org Mon Aug 5 00:36:39 2024 From: duke at openjdk.org (leo liang) Date: Mon, 5 Aug 2024 00:36:39 GMT Subject: RFR: 8324241: Always record evol_method deps to avoid excessive method flushing [v5] In-Reply-To: References: Message-ID: On Fri, 26 Jan 2024 13:14:57 GMT, Volker Simonis wrote: >> Currently we don't record dependencies on redefined methods (i.e. `evol_method` dependencies) in JIT compiled methods if none of the `can_redefine_classes`, `can_retransform_classes` or `can_generate_breakpoint_events` JVMTI capabalities is set. This means that if a JVMTI agent which requests one of these capabilities is dynamically attached, all the methods which have been JIT compiled until that point, will be marked for deoptimization and flushed from the code cache. For large, warmed-up applications this mean deoptimization and instant recompilation of thousands if not then-thousands of methods, which can lead to dramatic performance/latency drop-downs for several minutes. >> >> One could argue that dynamic agent attach is now deprecated anyway (see [JEP 451: Prepare to Disallow the Dynamic Loading of Agents](https://openjdk.org/jeps/451)) and this problem could be solved by making the recording of `evol_method` dependencies dependent on the new `-XX:+EnableDynamicAgentLoading` flag isntead of the concrete JVMTI capabilities (because the presence of the flag indicates that an agent will be loaded eventually). >> >> But there a single, however important exception to this rule and that's JFR. JFR is advertised as low overhead profiler which can be enabled in production at any time. However, when JFR is started dynamically (e.g. through JCMD or JMX) it will silently load a HotSpot internl JVMTI agent which requests the `can_retransform_classes` and retransforms some classes. This will inevitably trigger the deoptimization of all compiled methods as described above. >> >> I'd therefor like to propose to *always* and unconditionally record `evol_method` dependencies in JIT compiled code by exporting the relevant properties right at startup in `init_globals()`: >> ```c++ >> jint init_globals() { >> management_init(); >> JvmtiExport::initialize_oop_storage(); >> +#if INCLUDE_JVMTI >> + JvmtiExport::set_can_hotswap_or_post_breakpoint(true); >> + JvmtiExport::set_all_dependencies_are_recorded(true); >> +#endif >> >> >> My measurements indicate that the overhead of doing so is minimal (around 1% increase of nmethod size) and justifies the benefit. E.g. a Spring Petclinic application started with `-Xbatch -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation` compiles about ~11500 methods (~9000 with C1 and ~2500 with C2) resulting in an aggregated nmethod size of around ~40bm. Additionally recording `evol_method` dependencies only increases this size be about 400kb.... > > Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: > > Fixed whitepspace in flag documentation Hi there, we are seeing this issue when we run JFR on our services under load, we see a large spike of CPU after JFR is triggered, which cause 500 errors in our service. We are currently using corretto-17 in our service. Wondering this fix get back ported to JDK 17? As I can't find this change mentioned in [JDK update](https://wiki.openjdk.org/display/JDKUpdates/Archived+Releases) or in [jdk17u tag compare](https://github.com/openjdk/jdk17u/compare/jdk-17.0.9+9...jdk-17.0.13+1) Also, wondering if there is a walk around for this issue if the PR is not back ported to Java 17. `XX:+EnableDynamicAgentLoading` seems to only supported in Java 21, so that wouldn't help for now ------------- PR Comment: https://git.openjdk.org/jdk/pull/17509#issuecomment-2267971903 From fyang at openjdk.org Mon Aug 5 01:09:37 2024 From: fyang at openjdk.org (Fei Yang) Date: Mon, 5 Aug 2024 01:09:37 GMT Subject: RFR: 8337780: RISC-V: C2: Change C calling convention for sp to NS In-Reply-To: References: Message-ID: On Sat, 3 Aug 2024 16:17:53 GMT, Feilong Jiang wrote: > Hi, please review this patch that changes the C calling convention for sp to NS as sp is always saved and restored by the prolog/epilog code.. > This could reduce the frame size by 16 bytes for those C2 runtime stubs [1] as we do not have to save sp on the method entry. > > I also checked the calling convention type for sp on other platforms (AArch64, PPC, x86, x64, S390), and they are all treated as NS. > > > Testing: > - [x] tier1~3 & hotspot:tier4 with release build > > 1: https://github.com/openjdk/jdk/blob/367e0a65561f95aad61b40930d5f46843fee3444/src/hotspot/share/opto/runtime.cpp#L147-L167 Looks good. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20449#pullrequestreview-2217869143 From syan at openjdk.org Mon Aug 5 02:12:36 2024 From: syan at openjdk.org (SendaoYan) Date: Mon, 5 Aug 2024 02:12:36 GMT Subject: [jdk23] RFR: 8335806: RISC-V: Corrected typos Bizarrely In-Reply-To: <7jTWdv7YfToIIvfLR90RmHHB4YsuqYxSTakvL1BFb2s=.4caec85f-c42e-49a5-9f1c-8be3e06ddb2d@github.com> References: <7jTWdv7YfToIIvfLR90RmHHB4YsuqYxSTakvL1BFb2s=.4caec85f-c42e-49a5-9f1c-8be3e06ddb2d@github.com> Message-ID: On Mon, 8 Jul 2024 01:40:56 GMT, SendaoYan wrote: > Hi all, > > This pull request contains a backport of commit [3f37c571](https://github.com/openjdk/jdk/commit/3f37c5718d676b7001e6a084aed3ba645745a144) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by SendaoYan on 8 Jul 2024 and was reviewed by Andrew Haley and Amit Kumar. > > In [c2_MacroAssembler_riscv.cpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1325), the word Bizzarely should be Bizarely. Trivial fix, no risk. > > Thanks! Maybe this PR not needed to backport. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20065#issuecomment-2268040150 From syan at openjdk.org Mon Aug 5 02:12:37 2024 From: syan at openjdk.org (SendaoYan) Date: Mon, 5 Aug 2024 02:12:37 GMT Subject: [jdk23] Withdrawn: 8335806: RISC-V: Corrected typos Bizarrely In-Reply-To: <7jTWdv7YfToIIvfLR90RmHHB4YsuqYxSTakvL1BFb2s=.4caec85f-c42e-49a5-9f1c-8be3e06ddb2d@github.com> References: <7jTWdv7YfToIIvfLR90RmHHB4YsuqYxSTakvL1BFb2s=.4caec85f-c42e-49a5-9f1c-8be3e06ddb2d@github.com> Message-ID: On Mon, 8 Jul 2024 01:40:56 GMT, SendaoYan wrote: > Hi all, > > This pull request contains a backport of commit [3f37c571](https://github.com/openjdk/jdk/commit/3f37c5718d676b7001e6a084aed3ba645745a144) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by SendaoYan on 8 Jul 2024 and was reviewed by Andrew Haley and Amit Kumar. > > In [c2_MacroAssembler_riscv.cpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1325), the word Bizzarely should be Bizarely. Trivial fix, no risk. > > Thanks! This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/20065 From dfenacci at openjdk.org Mon Aug 5 08:06:32 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 5 Aug 2024 08:06:32 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: On Wed, 31 Jul 2024 15:27:35 GMT, Jasmine Karthikeyan wrote: >> Not a review but I quickly ran it through our testing and the following test fails with `-XX:UseAVX=3` on linux-x64-debug and windows-x64-debug and without any flags on windows-x64-debug which seems to be related to your patch: >> >> Test: compiler/vectorization/runner/BasicBooleanOpTest.java >> >> Output: >> >> Failed IR Rules (1) of Methods (1) >> ---------------------------------- >> 1) Method "public boolean[] compiler.vectorization.runner.BasicBooleanOpTest.vectorAnd()" - [Failed IR rules: 1]: >> * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#MACRO_LOGIC_V#_", ">0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx512f", "true"}, applyIfAnd={}, applyIfNot={})" >> > Phase "PrintIdeal": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "MacroLogicV" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! > > Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks! @jaskarth out of curiosity: could you by chance notice any measurable performance difference (e.g. for specific/ad-hoc benchmarks)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2268429164 From rehn at openjdk.org Mon Aug 5 08:42:31 2024 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 5 Aug 2024 08:42:31 GMT Subject: RFR: 8337780: RISC-V: C2: Change C calling convention for sp to NS In-Reply-To: References: Message-ID: On Sat, 3 Aug 2024 16:17:53 GMT, Feilong Jiang wrote: > Hi, please review this patch that changes the C calling convention for sp to NS as sp is always saved and restored by the prolog/epilog code.. > This could reduce the frame size by 16 bytes for those C2 runtime stubs [1] as we do not have to save sp on the method entry. > > I also checked the calling convention type for sp on other platforms (AArch64, PPC, x86, x64, S390), and they are all treated as NS. > > > Testing: > - [x] tier1~3 & hotspot:tier4 with release build > > 1: https://github.com/openjdk/jdk/blob/367e0a65561f95aad61b40930d5f46843fee3444/src/hotspot/share/opto/runtime.cpp#L147-L167 Thanks! ------------- Marked as reviewed by rehn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20449#pullrequestreview-2218376385 From dnsimon at openjdk.org Mon Aug 5 09:04:40 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 5 Aug 2024 09:04:40 GMT Subject: RFR: 8336489: Track scoped accesses in JVMCI compiled code [v5] In-Reply-To: References: Message-ID: On Fri, 26 Jul 2024 17:16:02 GMT, Carlo Refice wrote: >> This PR adds JVMCI support to scoped access tracking introduced in #20158. >> >> In this PR: >> * The `Method::is_scoped` flag is now exposed in JVMCI as `HotSpotResolvedJavaMethod.isScoped()`, and serialized to / deserialized from the JVMCI compiled code stream as a boolean flag. >> * To determine whether a compiled method has a scoped access, we simply check `HotSpotResolvedJavaMethod.isScoped()` returns `true` for the root method or any of the methods that were inlined in the compilation. >> * The above check is implemented as the method `HotSpotCompiledNMethod.hasScopedAccess()`, instead of as an explicit flag set in a the constructor of `HotSpotCompiledNMethod`. This keeps the change isolated to JVMCI, without requiring coordinated changes to the Graal compiler. No other changes in the compiler are necessary to benefit from the optimization. > > Carlo Refice has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Add test for ResolvedJavaMethod#isScoped() > - Pull HotSpotResolvedJavaMethod#isScoped() to ResolvedJavaMethod > - Fix truncation of Method and ConstMethod flags in HotSpotResolvedJavaMethodImpl > - Clarify HotSpotResolvedJavaMethod#isScoped javadoc > - Track scoped accesses in JVMCI compiled code Marked as reviewed by dnsimon (Reviewer). src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethod.java line 2: > 1: /* > 2: * Copyright (c) 2011, 2024, Oracle and/or its affiliates. All rights reserved. No need for copyright update as this source is not otherwise modified. test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.runtime.test/src/jdk/vm/ci/runtime/test/TestResolvedJavaMethod.java line 485: > 483: ResolvedJavaMethod m = e.getValue(); > 484: Method key = e.getKey(); > 485: boolean expect = key.isAnnotationPresent(scopedAnnotationClass); Please add an assertion after the loop that `expect` was true at least once. ------------- PR Review: https://git.openjdk.org/jdk/pull/20256#pullrequestreview-2218382563 PR Review Comment: https://git.openjdk.org/jdk/pull/20256#discussion_r1703752873 PR Review Comment: https://git.openjdk.org/jdk/pull/20256#discussion_r1703783605 From chagedorn at openjdk.org Mon Aug 5 09:54:38 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 5 Aug 2024 09:54:38 GMT Subject: RFR: 8335257: Refactor code to create Initialized Assertion Predicates into separate class [v4] In-Reply-To: References: <4KPUDSysCAHzoOs0lmJ3ds9i4keYFlsdDgkm_FMganU=.69f994eb-dd68-40b8-854d-3f7a445807a7@github.com> Message-ID: On Thu, 25 Jul 2024 06:33:10 GMT, Christian Hagedorn wrote: >> This is the next patch for Assertion Predicates. It refactors the code to create an Initialized Assertion Predicate. Changes include: >> >> - `clone_assertion_predicate_and_initialize()` currently does two things: Cloning a Template Assertion Predicate and creating an Initailized Assertion Predicate. I've split this method into two methods `clone_template_assertion_predicate()` and `create_initialized_assertion_predicate()`: >> - `clone_template_assertion_predicate()`: Now only clones the template. I have not cleaned the code up further because I will soon replace the `If` node with a dedicated `TemplateAssertionPredicateNode`. >> - `create_initialized_assertion_predicate()`: I refactored the code for Initialized Assertion Predicate into a separate class `InitializedAssertionPredicate` which creates the complete Initialized Assertion Predicate `If` with a `HaltNode`. I could get rid of some of the arguments because they can be fetched inside the new class methods. >> - Moved `assertion_predicate_has_loop_opaque_node()` asserts to both methods. >> - Renamed `AssertionPredicateType::Init_value` -> `AssertionPredicateType::InitValue` (same for last value). >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Fix product build > - Merge branch 'refs/heads/master' into JDK-8335257 > - update transformation sketch and rename "Template Assertion Predicate Expression" to "Template Assertion Expression" > - review Emanuel > - Merge branch 'refs/heads/master' into JDK-8335257 > - 8335257: Refactor code to create Initialized Assertion Predicates into separate class Testing looked good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19940#issuecomment-2268659342 From chagedorn at openjdk.org Mon Aug 5 09:54:39 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 5 Aug 2024 09:54:39 GMT Subject: Integrated: 8335257: Refactor code to create Initialized Assertion Predicates into separate class In-Reply-To: <4KPUDSysCAHzoOs0lmJ3ds9i4keYFlsdDgkm_FMganU=.69f994eb-dd68-40b8-854d-3f7a445807a7@github.com> References: <4KPUDSysCAHzoOs0lmJ3ds9i4keYFlsdDgkm_FMganU=.69f994eb-dd68-40b8-854d-3f7a445807a7@github.com> Message-ID: On Fri, 28 Jun 2024 13:40:50 GMT, Christian Hagedorn wrote: > This is the next patch for Assertion Predicates. It refactors the code to create an Initialized Assertion Predicate. Changes include: > > - `clone_assertion_predicate_and_initialize()` currently does two things: Cloning a Template Assertion Predicate and creating an Initailized Assertion Predicate. I've split this method into two methods `clone_template_assertion_predicate()` and `create_initialized_assertion_predicate()`: > - `clone_template_assertion_predicate()`: Now only clones the template. I have not cleaned the code up further because I will soon replace the `If` node with a dedicated `TemplateAssertionPredicateNode`. > - `create_initialized_assertion_predicate()`: I refactored the code for Initialized Assertion Predicate into a separate class `InitializedAssertionPredicate` which creates the complete Initialized Assertion Predicate `If` with a `HaltNode`. I could get rid of some of the arguments because they can be fetched inside the new class methods. > - Moved `assertion_predicate_has_loop_opaque_node()` asserts to both methods. > - Renamed `AssertionPredicateType::Init_value` -> `AssertionPredicateType::InitValue` (same for last value). > > Thanks, > Christian This pull request has now been integrated. Changeset: be34730f Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/be34730fb4e6818ac13c46b34b735c967351e5cd Stats: 222 lines in 8 files changed: 114 ins; 16 del; 92 mod 8335257: Refactor code to create Initialized Assertion Predicates into separate class Reviewed-by: kvn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/19940 From fgao at openjdk.org Mon Aug 5 10:35:34 2024 From: fgao at openjdk.org (Fei Gao) Date: Mon, 5 Aug 2024 10:35:34 GMT Subject: RFR: 8336464: C2: Force CastX2P to be a two-address instruction In-Reply-To: References: <9HI9_zwkUQqTJDp-WhIpJzyRM2bUYXeR2NokG09u1yI=.ebe9f7ad-c7ea-42d1-b2bb-77a1689390f2@github.com> Message-ID: <18flwdx7lv9satJek9RU6Gq7L06hGpKeN60WDby9kgw=.3d234e57-0d5b-4d09-bef1-fa8c67a9e7d0@github.com> On Tue, 30 Jul 2024 17:40:42 GMT, Vladimir Kozlov wrote: > I am not sure about this change. There reason we keep `src `and `dst` in different register for different types is most likely for cases when `src` could be used in other operations. Overwriting `src` register may give you more spills than before. > > If there are no other `src` usages RA should handle this I think in shared code. Thanks for your review @vnkozlov . In my initial idea, if we keep `src` and `dst` in the same register, when `src` is used in other operations, yes, we need to generate extra spill code like: Spill src to new_src CastX2P src src ... // Other operations use new_src In the final code, we can remove `CastX2P`, it becomes: mov src new_src ... // Other operations use new_src If we keep `src` and `dst` in the different registers, we may get: CastX2P src dst ... // Other operations use src In the final code, it will be: mov src dst ... // Other operations use src I thought that keeping `src` and `dst` might not generate extra `mov`s because we can remove `CastX2P` itself finally. But I tried some written cases showed in the PR description, which violated my thoughts in an unexpected way. Then I'm also not sure about it. Anyway, it's a try and comments are welcome :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20159#issuecomment-2268744405 From duke at openjdk.org Mon Aug 5 11:25:36 2024 From: duke at openjdk.org (Yuri Gaevsky) Date: Mon, 5 Aug 2024 11:25:36 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2] In-Reply-To: References: Message-ID: On Thu, 25 Jan 2024 14:47:47 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with two additional commits since the last revision: > > - num_8b_elems_in_vec --> nof_vec_elems > - Removed checks for (MaxVectorSize >= 16) per @RealFYang suggestion. . ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-2268837135 From duke at openjdk.org Mon Aug 5 12:44:04 2024 From: duke at openjdk.org (Carlo Refice) Date: Mon, 5 Aug 2024 12:44:04 GMT Subject: RFR: 8336489: Track scoped accesses in JVMCI compiled code [v6] In-Reply-To: References: Message-ID: <7rGn7mfbOM5lYucokA47_R5AhNd59_c_bYu-gIFsxhA=.13cde341-93f7-4f0c-b4e8-cc0a9b5b3b79@github.com> > This PR adds JVMCI support to scoped access tracking introduced in #20158. > > In this PR: > * The `Method::is_scoped` flag is now exposed in JVMCI as `HotSpotResolvedJavaMethod.isScoped()`, and serialized to / deserialized from the JVMCI compiled code stream as a boolean flag. > * To determine whether a compiled method has a scoped access, we simply check `HotSpotResolvedJavaMethod.isScoped()` returns `true` for the root method or any of the methods that were inlined in the compilation. > * The above check is implemented as the method `HotSpotCompiledNMethod.hasScopedAccess()`, instead of as an explicit flag set in a the constructor of `HotSpotCompiledNMethod`. This keeps the change isolated to JVMCI, without requiring coordinated changes to the Graal compiler. No other changes in the compiler are necessary to benefit from the optimization. Carlo Refice has updated the pull request incrementally with one additional commit since the last revision: Assert that ResolvedJavaMethod#isScoped returns true at least once ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20256/files - new: https://git.openjdk.org/jdk/pull/20256/files/8fabb8bb..8b61ab79 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20256&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20256&range=04-05 Stats: 6 lines in 2 files changed: 5 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20256.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20256/head:pull/20256 PR: https://git.openjdk.org/jdk/pull/20256 From dnsimon at openjdk.org Mon Aug 5 12:44:04 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 5 Aug 2024 12:44:04 GMT Subject: RFR: 8336489: Track scoped accesses in JVMCI compiled code [v6] In-Reply-To: <7rGn7mfbOM5lYucokA47_R5AhNd59_c_bYu-gIFsxhA=.13cde341-93f7-4f0c-b4e8-cc0a9b5b3b79@github.com> References: <7rGn7mfbOM5lYucokA47_R5AhNd59_c_bYu-gIFsxhA=.13cde341-93f7-4f0c-b4e8-cc0a9b5b3b79@github.com> Message-ID: <-SyircI8QCAJzsfJKVgFenv6sKN71plFuXIfgMCYs0Y=.2ca49336-29a2-4f0b-b468-ff9a804aad66@github.com> On Mon, 5 Aug 2024 12:40:54 GMT, Carlo Refice wrote: >> This PR adds JVMCI support to scoped access tracking introduced in #20158. >> >> In this PR: >> * The `Method::is_scoped` flag is now exposed in JVMCI as `HotSpotResolvedJavaMethod.isScoped()`, and serialized to / deserialized from the JVMCI compiled code stream as a boolean flag. >> * To determine whether a compiled method has a scoped access, we simply check `HotSpotResolvedJavaMethod.isScoped()` returns `true` for the root method or any of the methods that were inlined in the compilation. >> * The above check is implemented as the method `HotSpotCompiledNMethod.hasScopedAccess()`, instead of as an explicit flag set in a the constructor of `HotSpotCompiledNMethod`. This keeps the change isolated to JVMCI, without requiring coordinated changes to the Graal compiler. No other changes in the compiler are necessary to benefit from the optimization. > > Carlo Refice has updated the pull request incrementally with one additional commit since the last revision: > > Assert that ResolvedJavaMethod#isScoped returns true at least once Still looks good to me. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20256#pullrequestreview-2218888317 From duke at openjdk.org Mon Aug 5 13:51:33 2024 From: duke at openjdk.org (duke) Date: Mon, 5 Aug 2024 13:51:33 GMT Subject: RFR: 8336489: Track scoped accesses in JVMCI compiled code [v6] In-Reply-To: <7rGn7mfbOM5lYucokA47_R5AhNd59_c_bYu-gIFsxhA=.13cde341-93f7-4f0c-b4e8-cc0a9b5b3b79@github.com> References: <7rGn7mfbOM5lYucokA47_R5AhNd59_c_bYu-gIFsxhA=.13cde341-93f7-4f0c-b4e8-cc0a9b5b3b79@github.com> Message-ID: <3EGnCOsj_Raj57jeHQDZRDyrZ6uvLUXmLqr4vjectuE=.20d4b8c8-3111-4cd9-99ce-65bbd3aacd7a@github.com> On Mon, 5 Aug 2024 12:44:04 GMT, Carlo Refice wrote: >> This PR adds JVMCI support to scoped access tracking introduced in #20158. >> >> In this PR: >> * The `Method::is_scoped` flag is now exposed in JVMCI as `HotSpotResolvedJavaMethod.isScoped()`, and serialized to / deserialized from the JVMCI compiled code stream as a boolean flag. >> * To determine whether a compiled method has a scoped access, we simply check `HotSpotResolvedJavaMethod.isScoped()` returns `true` for the root method or any of the methods that were inlined in the compilation. >> * The above check is implemented as the method `HotSpotCompiledNMethod.hasScopedAccess()`, instead of as an explicit flag set in a the constructor of `HotSpotCompiledNMethod`. This keeps the change isolated to JVMCI, without requiring coordinated changes to the Graal compiler. No other changes in the compiler are necessary to benefit from the optimization. > > Carlo Refice has updated the pull request incrementally with one additional commit since the last revision: > > Assert that ResolvedJavaMethod#isScoped returns true at least once @c-refice Your change (at version 8b61ab799495a7839eb8551686b89ea5f9fab79e) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20256#issuecomment-2269127221 From duke at openjdk.org Mon Aug 5 14:12:37 2024 From: duke at openjdk.org (Carlo Refice) Date: Mon, 5 Aug 2024 14:12:37 GMT Subject: Integrated: 8336489: Track scoped accesses in JVMCI compiled code In-Reply-To: References: Message-ID: On Fri, 19 Jul 2024 14:46:38 GMT, Carlo Refice wrote: > This PR adds JVMCI support to scoped access tracking introduced in #20158. > > In this PR: > * The `Method::is_scoped` flag is now exposed in JVMCI as `HotSpotResolvedJavaMethod.isScoped()`, and serialized to / deserialized from the JVMCI compiled code stream as a boolean flag. > * To determine whether a compiled method has a scoped access, we simply check `HotSpotResolvedJavaMethod.isScoped()` returns `true` for the root method or any of the methods that were inlined in the compilation. > * The above check is implemented as the method `HotSpotCompiledNMethod.hasScopedAccess()`, instead of as an explicit flag set in a the constructor of `HotSpotCompiledNMethod`. This keeps the change isolated to JVMCI, without requiring coordinated changes to the Graal compiler. No other changes in the compiler are necessary to benefit from the optimization. This pull request has now been integrated. Changeset: c095c0e6 Author: Carlo Refice URL: https://git.openjdk.org/jdk/commit/c095c0e6a58b1665d51ff19381e32f168e99e8f5 Stats: 72 lines in 11 files changed: 66 ins; 0 del; 6 mod 8336489: Track scoped accesses in JVMCI compiled code Reviewed-by: dnsimon, never ------------- PR: https://git.openjdk.org/jdk/pull/20256 From jkarthikeyan at openjdk.org Mon Aug 5 16:48:52 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 5 Aug 2024 16:48:52 GMT Subject: RFR: 8336860: x86: Change integer src operand for CMoveL of 0 and 1 to long [v3] In-Reply-To: References: Message-ID: > Hi all, > This patch fixes `cmovL_imm_01*` instructions matching against an integer immediate 1 instead of a long immediate 1. I noticed while looking at the backend implementation of CMove that the rules specify `immI_1` instead of `immL1`, which means that the instructions can't be matched and instead falls through to the base case. I added a small benchmark and got these results: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > BasicRules.cmovL_imm_01 avgt 15 259.073 ? 5.806 ns/op 231.108 ? 2.730 ns/op (+ 11.41%) > > Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Add IR test for codegen ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20275/files - new: https://git.openjdk.org/jdk/pull/20275/files/586d7703..1b926ecf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20275&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20275&range=01-02 Stats: 87 lines in 2 files changed: 87 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20275.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20275/head:pull/20275 PR: https://git.openjdk.org/jdk/pull/20275 From jkarthikeyan at openjdk.org Mon Aug 5 16:53:37 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 5 Aug 2024 16:53:37 GMT Subject: RFR: 8336860: x86: Change integer src operand for CMoveL of 0 and 1 to long [v2] In-Reply-To: <6NP28KkoRw8wbXPcby4lWfvQS2CLEomS3Pkt0-z3A-U=.172dca0b-5730-4c6f-8f47-7c90370963b0@github.com> References: <6NP28KkoRw8wbXPcby4lWfvQS2CLEomS3Pkt0-z3A-U=.172dca0b-5730-4c6f-8f47-7c90370963b0@github.com> Message-ID: <8_O6mW5JUdObq4VNzC_kIKDNvt2EBAtugwn743XRx70=.a02d8de3-b7d6-4a24-bec6-0190556c5897@github.com> On Thu, 25 Jul 2024 14:28:35 GMT, Emanuel Peter wrote: >> @eme64 or @TobiHartmann might take a look too, I guess? > > @liach @jaskarth I'll run some testing. > > Can you point me to the "base case" you mention in your PR description? @eme64 I've added an IR test that matches all of the relevant comparison types. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20275#issuecomment-2269499751 From kvn at openjdk.org Mon Aug 5 18:58:32 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 5 Aug 2024 18:58:32 GMT Subject: RFR: 8336464: C2: Force CastX2P to be a two-address instruction In-Reply-To: <9HI9_zwkUQqTJDp-WhIpJzyRM2bUYXeR2NokG09u1yI=.ebe9f7ad-c7ea-42d1-b2bb-77a1689390f2@github.com> References: <9HI9_zwkUQqTJDp-WhIpJzyRM2bUYXeR2NokG09u1yI=.ebe9f7ad-c7ea-42d1-b2bb-77a1689390f2@github.com> Message-ID: <82HQYCOB8dTzpifkI60mRVD7eNUV63zctGDu5a7Mg5M=.ffff16b5-0a73-43f7-b546-f4c312905e57@github.com> On Fri, 12 Jul 2024 13:59:23 GMT, Fei Gao wrote: > This patch forces `CastX2P` to be a two-address instruction, so that C2 could allocate the same register for `dst` and `src`. Then we can remove the instruction completely in the assembly. > > The motivation comes from some cast operations like `castPP`. The difference for ADLC between `castPP` and `CastX2P` lies in that `CastX2P` always has different types for `dst` and `src`. We can force ADLC to generate an extra `two_adr()` for `CastX2P` like it does automatically for `castPP`, which could tell register allocator that the instruction needs the same register for `dst` and `src`. > > However, sometimes, RA and GCM in C2 can't work as we expected. > > For example, we have Assembly on the existing code: > > ldp x10, x11, [x17,#136] > add x10, x10, x15 > add x11, x11, x10 > ldr x12, [x17,#152] > str x16, [x10] > add x10, x12, x15 > str x16, [x11] > str x16, [x10] > > > After applying the patch independently, the assembly is: > > ldr x10, [x16,#136] <--- 1 > add x10, x10, x15 > ldr x11, [x16,#144] <--- 2 > mov x13, x10 <--- 3 > str x17, [x13] > ldr x12, [x16,#152] > add x10, x11, x10 > str x17, [x10] > add x10, x12, x15 > str x17, [x10] > > > C2 generates a totally extra `mov`, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible. > > Although this scenario would disappear after combining with https://github.com/openjdk/jdk/pull/20157, I'm still not sure if this patch is worthwhile. One idea I had long time ago [JDK-6768706](https://bugs.openjdk.org/browse/JDK-6768706) is to add mach instructions with complex matching rules which uses CatX2P/CastP2X in common patterns. Then you can avoid moves between registers. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20159#issuecomment-2269707305 From ascarpino at openjdk.org Mon Aug 5 19:15:35 2024 From: ascarpino at openjdk.org (Anthony Scarpino) Date: Mon, 5 Aug 2024 19:15:35 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 That's a good performance increase for such a small code change. I reviewed the simple java code change. I'll let a hotspot reviewer handle the rest of the code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17515#issuecomment-2269737577 From dlong at openjdk.org Mon Aug 5 19:30:30 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 5 Aug 2024 19:30:30 GMT Subject: RFR: 8336464: C2: Force CastX2P to be a two-address instruction In-Reply-To: <9HI9_zwkUQqTJDp-WhIpJzyRM2bUYXeR2NokG09u1yI=.ebe9f7ad-c7ea-42d1-b2bb-77a1689390f2@github.com> References: <9HI9_zwkUQqTJDp-WhIpJzyRM2bUYXeR2NokG09u1yI=.ebe9f7ad-c7ea-42d1-b2bb-77a1689390f2@github.com> Message-ID: On Fri, 12 Jul 2024 13:59:23 GMT, Fei Gao wrote: > This patch forces `CastX2P` to be a two-address instruction, so that C2 could allocate the same register for `dst` and `src`. Then we can remove the instruction completely in the assembly. > > The motivation comes from some cast operations like `castPP`. The difference for ADLC between `castPP` and `CastX2P` lies in that `CastX2P` always has different types for `dst` and `src`. We can force ADLC to generate an extra `two_adr()` for `CastX2P` like it does automatically for `castPP`, which could tell register allocator that the instruction needs the same register for `dst` and `src`. > > However, sometimes, RA and GCM in C2 can't work as we expected. > > For example, we have Assembly on the existing code: > > ldp x10, x11, [x17,#136] > add x10, x10, x15 > add x11, x11, x10 > ldr x12, [x17,#152] > str x16, [x10] > add x10, x12, x15 > str x16, [x11] > str x16, [x10] > > > After applying the patch independently, the assembly is: > > ldr x10, [x16,#136] <--- 1 > add x10, x10, x15 > ldr x11, [x16,#144] <--- 2 > mov x13, x10 <--- 3 > str x17, [x13] > ldr x12, [x16,#152] > add x10, x11, x10 > str x17, [x10] > add x10, x12, x15 > str x17, [x10] > > > C2 generates a totally extra `mov`, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible. > > Although this scenario would disappear after combining with https://github.com/openjdk/jdk/pull/20157, I'm still not sure if this patch is worthwhile. I thought reusing one of the inputs for the destination is the default, and we have to add TEMP to rules to prevent this from happening. So I don't understand why sometimes the register allocator doesn't reuse the register when there are no other uses. There is a concept of "chain rule" in ADLC that I don't quite understand, but I suspect that it is related. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20159#issuecomment-2269762174 From jbhateja at openjdk.org Tue Aug 6 00:33:34 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 6 Aug 2024 00:33:34 GMT Subject: RFR: 8336860: x86: Change integer src operand for CMoveL of 0 and 1 to long [v3] In-Reply-To: References: Message-ID: On Mon, 5 Aug 2024 16:48:52 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch fixes `cmovL_imm_01*` instructions matching against an integer immediate 1 instead of a long immediate 1. I noticed while looking at the backend implementation of CMove that the rules specify `immI_1` instead of `immL1`, which means that the instructions can't be matched and instead falls through to the base case. I added a small benchmark and got these results: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> BasicRules.cmovL_imm_01 avgt 15 259.073 ? 5.806 ns/op 231.108 ? 2.730 ns/op (+ 11.41%) >> >> Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Add IR test for codegen LGTM ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20275#pullrequestreview-2220047712 From jkarthikeyan at openjdk.org Tue Aug 6 03:41:04 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 6 Aug 2024 03:41:04 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v2] In-Reply-To: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: > Hi all, > I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) > > This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Fix IR test checks and add benchmark ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20066/files - new: https://git.openjdk.org/jdk/pull/20066/files/808d8067..1de4cee9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20066&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20066&range=00-01 Stats: 22 lines in 2 files changed: 16 ins; 4 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20066.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20066/head:pull/20066 PR: https://git.openjdk.org/jdk/pull/20066 From qxing at openjdk.org Tue Aug 6 03:44:38 2024 From: qxing at openjdk.org (Qizheng Xing) Date: Tue, 6 Aug 2024 03:44:38 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v7] In-Reply-To: References: Message-ID: <8ZssgK4ARZKnqIQTKLs48iLlujjN6lzXDMry6BiunG0=.78b8aa81-4d06-486d-9d48-80d70caeff45@github.com> On Tue, 9 Jul 2024 03:10:55 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: > > Add copyright. Hi all, This patch has now passed all GHA tests and is ready for further reviews. If there are any other suggestions for this PR, please let me know. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/19496#issuecomment-2270323013 From jkarthikeyan at openjdk.org Tue Aug 6 03:45:38 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 6 Aug 2024 03:45:38 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: On Wed, 31 Jul 2024 18:38:48 GMT, Christian Hagedorn wrote: >> Not a review but I quickly ran it through our testing and the following test fails with `-XX:UseAVX=3` on linux-x64-debug and windows-x64-debug and without any flags on windows-x64-debug which seems to be related to your patch: >> >> Test: compiler/vectorization/runner/BasicBooleanOpTest.java >> >> Output: >> >> Failed IR Rules (1) of Methods (1) >> ---------------------------------- >> 1) Method "public boolean[] compiler.vectorization.runner.BasicBooleanOpTest.vectorAnd()" - [Failed IR rules: 1]: >> * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#MACRO_LOGIC_V#_", ">0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx512f", "true"}, applyIfAnd={}, applyIfNot={})" >> > Phase "PrintIdeal": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "MacroLogicV" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! > >> Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks! > > Sure, here is the log file for linux-x64-debug: [test_failure.log](https://github.com/user-attachments/files/16445886/test_failure.log) @chhagedorn I've had a chance to investigate the IR test failure, and it seems it's because with the patch we can find a sharper type during IGVN than with the baseline. After parsing, the code shape inside the loop looks like this: `StoreB[idx] = (a & b) & 1;` As the StoreB is storing a boolean type, we perform a bitwise-and of 1 to ensure the boolean value is in bounds. With the patch, the type of `(a & b)` is `bool` so the `& 1` is removed in `AndINode::Identity`. In the baseline, the type of `(a & b)` is `int`, so the redundant AndNode isn't removed. Later on, the logic chain is transformed into a `MacroLogicVNode`. Since the IR output of the operation has changed I've updated the IR checks, bringing it in line with the other operations in the test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2270323255 From jkarthikeyan at openjdk.org Tue Aug 6 04:07:36 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 6 Aug 2024 04:07:36 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: <6rrlmqYOiKBOErJVqS7elZ6W8c5R8eDukiBKrZsatpM=.eba78749-be7d-439b-90bc-1ad82350e6ea@github.com> On Mon, 5 Aug 2024 08:03:52 GMT, Damon Fenacci wrote: >> Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks! > > @jaskarth out of curiosity: could you by chance notice any measurable performance difference (e.g. for specific/ad-hoc benchmarks)? @dafedafe I added a microbenchmark based on the case I saw above, and got these results: Baseline Patch Improvement Benchmark (COUNT) (seed) Mode Cnt Score Error Units Score Error Units TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ 512 0 avgt 8 155.288 ? 1.175 ns/op 188.844 ? 4.189 ns/op (+ 19.5%) TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ 2048 0 avgt 8 629.098 ? 7.489 ns/op 732.558 ? 3.983 ns/op (+ 15.2%) TypeVectorOperations.TypeVectorOperationsSuperWord.andZ 512 0 avgt 8 22.917 ? 0.338 ns/op 23.578 ? 1.003 ns/op (+ 2.8%) TypeVectorOperations.TypeVectorOperationsSuperWord.andZ 2048 0 avgt 8 35.683 ? 0.232 ns/op 37.546 ? 1.063 ns/op (+ 5.1%) In general though I've found that unfortunately it's pretty difficult to identify specific places where performance is improved, since rather than improving nodes locally this analysis strengthens other idealizations that use int types. By improving the type we might be able to find more operations that evaluate to constants or prune out redundant comparisons, either directly or through another node that transforms the type further. I've been wanting to make our type analysis stronger, so that we can find more nontrivial optimizations without needing specialized idealization rules. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2270339413 From chagedorn at openjdk.org Tue Aug 6 07:19:34 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Aug 2024 07:19:34 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: <9DmseJVXRKHJ7Ht52KIMGhFzsg5U8SqsBlnsuenDkbc=.013ead04-4e50-40bd-a999-fd5dc5b353ce@github.com> On Wed, 31 Jul 2024 18:38:48 GMT, Christian Hagedorn wrote: >> Not a review but I quickly ran it through our testing and the following test fails with `-XX:UseAVX=3` on linux-x64-debug and windows-x64-debug and without any flags on windows-x64-debug which seems to be related to your patch: >> >> Test: compiler/vectorization/runner/BasicBooleanOpTest.java >> >> Output: >> >> Failed IR Rules (1) of Methods (1) >> ---------------------------------- >> 1) Method "public boolean[] compiler.vectorization.runner.BasicBooleanOpTest.vectorAnd()" - [Failed IR rules: 1]: >> * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#MACRO_LOGIC_V#_", ">0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx512f", "true"}, applyIfAnd={}, applyIfNot={})" >> > Phase "PrintIdeal": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "MacroLogicV" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! > >> Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks! > > Sure, here is the log file for linux-x64-debug: [test_failure.log](https://github.com/user-attachments/files/16445886/test_failure.log) > @chhagedorn I've had a chance to investigate the IR test failure, and it seems it's because with the patch we can find a sharper type during IGVN than with the baseline. After parsing, the code shape inside the loop looks like this: `StoreB[idx] = (a & b) & 1;` As the StoreB is storing a boolean type, we perform a bitwise-and of 1 to ensure the boolean value is in bounds. With the patch, the type of `(a & b)` is `bool` so the `& 1` is removed in `AndINode::Identity`. In the baseline, the type of `(a & b)` is `int`, so the redundant AndNode isn't removed. Later on, the logic chain is transformed into a `MacroLogicVNode`. > > Since the IR output of the operation has changed I've updated the IR checks, bringing it in line with the other operations in the test. Thanks for the investigation and the update! That sounds reasonable. Let me submit some testing again for that test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2270555730 From chagedorn at openjdk.org Tue Aug 6 07:48:35 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Aug 2024 07:48:35 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v2] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: On Tue, 6 Aug 2024 03:41:04 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) >> >> This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Fix IR test checks and add benchmark The test is still failing, with `UseAVX=3` and without. It seems that `MacroLogicVNode` is present in both logs (see used flags in command line dump): [normal.log](https://github.com/user-attachments/files/16507041/normal.log) [avx3.log](https://github.com/user-attachments/files/16507059/avx3.log) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2270604849 From chagedorn at openjdk.org Tue Aug 6 11:52:01 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Aug 2024 11:52:01 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable Message-ID: It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. This is motivated by https://github.com/openjdk/jdk/pull/19635. I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: TestFramework testFramework = new TestFramework(); testFramework .addFlags("-XX:-TieredCompilation", "-XX:+UseParallelGC") .addTestClassesToBootClassPath() .start(); Thanks, Christian ------------- Commit messages: - 8337876: [IR Framework] Add support for IR tests with @Stable Changes: https://git.openjdk.org/jdk/pull/20477/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20477&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8337876 Stats: 36 lines in 4 files changed: 26 ins; 1 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/20477.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20477/head:pull/20477 PR: https://git.openjdk.org/jdk/pull/20477 From chagedorn at openjdk.org Tue Aug 6 11:52:01 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Aug 2024 11:52:01 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 11:43:32 GMT, Christian Hagedorn wrote: > It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. > > This is motivated by https://github.com/openjdk/jdk/pull/19635. > > I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: > > TestFramework testFramework = new TestFramework(); > testFramework > .addFlags("-XX:-TieredCompilation", > "-XX:+UseParallelGC") > .addTestClassesToBootClassPath() > .start(); > > > Thanks, > Christian @shipilev @liach Since you've already looked at this idea in https://github.com/openjdk/jdk/pull/19635, I'm kindly asking you to also have a look at this PR, thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20477#issuecomment-2271093719 From shade at openjdk.org Tue Aug 6 12:16:40 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 6 Aug 2024 12:16:40 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 11:43:32 GMT, Christian Hagedorn wrote: > It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. > > This is motivated by https://github.com/openjdk/jdk/pull/19635. > > I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: > > TestFramework testFramework = new TestFramework(); > testFramework > .addFlags("-XX:-TieredCompilation", > "-XX:+UseParallelGC") > .addTestClassesToBootClassPath() > .start(); > > > Thanks, > Christian Looks okay, I just think we don't need to mention `@Stable`. test/hotspot/jtreg/compiler/lib/ir_framework/README.md line 161: > 159: > 160: ### 2.5 IR Tests with `@Stable` annotation > 161: To run tests with `@Stable` annotations, one need to add the test classes to the boot classpath. This can easily be achieved by calling `TestFramework.addTestClassesToBootClassPath()` on the test framework object: I think mentioning `@Stable` here is overly specific. This guidance applies to any code that expects to be run in privileged mode, whether it is `@Stable`, `@Contended`, `@ReservedStackAccess`, etc. I say it should be e.g. "2.5 IR Tests With Privileged Classes" or some such. test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java line 329: > 327: /** > 328: * Add test classes to boot classpath. This adds all classes found on path {@link jdk.test.lib.Utils#TEST_CLASSES} > 329: * to the boot classpath with "-Xbootclasspath/a". This is useful when trying to run tests with @Stable annotations. Again, `@Stable` is overly specific here. This is "just" about the privileged code. test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java line 100: > 98: String bootClassPath = "-Xbootclasspath/a:."; > 99: if (testClassesOnBootClassPath) { > 100: // Add test classes themselves to boot classpath. This is required, for example, for IR tests with @Stable. Here too, drop the mention of `@Stable`? test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java line 107: > 105: cmds.add("-XX:+WhiteBoxAPI"); > 106: // Ignore CompileThreshold and CompileCommand flags which have an impact on the profiling information. > 107: List jtregVMFlags = Arrays.stream(Utils.getTestJavaOpts()).filter(s -> !s.contains("CompileThreshold")).toList(); This hunk seems unnecessary? I would like to have IR Framework patch that is easily backportable and does not contain additional, non-essential hunks :) ------------- PR Review: https://git.openjdk.org/jdk/pull/20477#pullrequestreview-2221121777 PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1705425371 PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1705425878 PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1705426506 PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1705428312 From chagedorn at openjdk.org Tue Aug 6 13:17:06 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Aug 2024 13:17:06 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v2] In-Reply-To: References: Message-ID: > It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. > > This is motivated by https://github.com/openjdk/jdk/pull/19635. > > I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: > > TestFramework testFramework = new TestFramework(); > testFramework > .addFlags("-XX:-TieredCompilation", > "-XX:+UseParallelGC") > .addTestClassesToBootClassPath() > .start(); > > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Review by Aleksey ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20477/files - new: https://git.openjdk.org/jdk/pull/20477/files/cd4d53d6..557f9d05 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20477&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20477&range=00-01 Stats: 7 lines in 3 files changed: 1 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/20477.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20477/head:pull/20477 PR: https://git.openjdk.org/jdk/pull/20477 From chagedorn at openjdk.org Tue Aug 6 13:17:06 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 6 Aug 2024 13:17:06 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v2] In-Reply-To: References: Message-ID: <4JIBBawhzWSjpmQnH5aVKPSW_8X9BW-A6FMJaNrIt5k=.71d9aced-5669-4297-9415-2e274843c755@github.com> On Tue, 6 Aug 2024 12:10:42 GMT, Aleksey Shipilev wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Review by Aleksey > > test/hotspot/jtreg/compiler/lib/ir_framework/README.md line 161: > >> 159: >> 160: ### 2.5 IR Tests with `@Stable` annotation >> 161: To run tests with `@Stable` annotations, one need to add the test classes to the boot classpath. This can easily be achieved by calling `TestFramework.addTestClassesToBootClassPath()` on the test framework object: > > I think mentioning `@Stable` here is overly specific. This guidance applies to any code that expects to be run in privileged mode, whether it is `@Stable`, `@Contended`, `@ReservedStackAccess`, etc. I say it should be e.g. "2.5 IR Tests With Privileged Classes" or some such. Thanks for the quick review! You're right, that's too specific. I've pushed an update according to your suggestion. > test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java line 107: > >> 105: cmds.add("-XX:+WhiteBoxAPI"); >> 106: // Ignore CompileThreshold and CompileCommand flags which have an impact on the profiling information. >> 107: List jtregVMFlags = Arrays.stream(Utils.getTestJavaOpts()).filter(s -> !s.contains("CompileThreshold")).toList(); > > This hunk seems unnecessary? I would like to have IR Framework patch that is easily backportable and does not contain additional, non-essential hunks :) Fair point, could be done separately, dropped :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1705515066 PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1705514947 From adinn at openjdk.org Tue Aug 6 13:21:32 2024 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 6 Aug 2024 13:21:32 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: References: Message-ID: On Fri, 2 Aug 2024 15:28:04 GMT, Vladimir Kozlov wrote: >> Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. >> `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: >> [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) >> >> On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) >> As result we waste two registers to pass constant and NULL. >> >> Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) >> >> I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. >> >> Tested tier1-3,stress,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix arm (32 bits) build Hi Vladimir. This looks like a very nice solution. The code changes look ok as far a I can tell. However, I'm not really in a position to approve the patch as I am not very familiar with the details of the matcher and formssel code. Sorry. ------------- PR Review: https://git.openjdk.org/jdk/pull/20437#pullrequestreview-2221288027 From shade at openjdk.org Tue Aug 6 13:24:37 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 6 Aug 2024 13:24:37 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v2] In-Reply-To: References: Message-ID: <-p3rdej_RlMTcR7zGrGte4cD-l1A1Tquf8SSfpxfMIU=.36413f8e-daaa-4782-b95f-f6a624ee5b97@github.com> On Tue, 6 Aug 2024 13:17:06 GMT, Christian Hagedorn wrote: >> It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. >> >> This is motivated by https://github.com/openjdk/jdk/pull/19635. >> >> I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: >> >> TestFramework testFramework = new TestFramework(); >> testFramework >> .addFlags("-XX:-TieredCompilation", >> "-XX:+UseParallelGC") >> .addTestClassesToBootClassPath() >> .start(); >> >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Review by Aleksey Looks fine, except one nit. I think we should maybe do a simple test to confirm this works on both Linux and Windows? test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java line 102: > 100: if (testClassesOnBootClassPath) { > 101: // Add test classes themselves to boot classpath to make them privileged. > 102: bootClassPath += ":" + Utils.TEST_CLASSES; I think `:` is not portable, and should instead be `File.pathSeparator`? ------------- PR Review: https://git.openjdk.org/jdk/pull/20477#pullrequestreview-2221294603 PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1705529991 From fjiang at openjdk.org Tue Aug 6 14:04:34 2024 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 6 Aug 2024 14:04:34 GMT Subject: RFR: 8337780: RISC-V: C2: Change C calling convention for sp to NS In-Reply-To: References: Message-ID: On Mon, 5 Aug 2024 01:07:11 GMT, Fei Yang wrote: >> Hi, please review this patch that changes the C calling convention for sp to NS as sp is always saved and restored by the prolog/epilog code.. >> This could reduce the frame size by 16 bytes for those C2 runtime stubs [1] as we do not have to save sp on the method entry. >> >> I also checked the calling convention type for sp on other platforms (AArch64, PPC, x86, x64, S390), and they are all treated as NS. >> >> >> Testing: >> - [x] tier1~3 & hotspot:tier4 with release build >> >> 1: https://github.com/openjdk/jdk/blob/367e0a65561f95aad61b40930d5f46843fee3444/src/hotspot/share/opto/runtime.cpp#L147-L167 > > Looks good. Thanks. @RealFYang @robehn -- Thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20449#issuecomment-2271371090 From fjiang at openjdk.org Tue Aug 6 14:04:35 2024 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 6 Aug 2024 14:04:35 GMT Subject: Integrated: 8337780: RISC-V: C2: Change C calling convention for sp to NS In-Reply-To: References: Message-ID: On Sat, 3 Aug 2024 16:17:53 GMT, Feilong Jiang wrote: > Hi, please review this patch that changes the C calling convention for sp to NS as sp is always saved and restored by the prolog/epilog code.. > This could reduce the frame size by 16 bytes for those C2 runtime stubs [1] as we do not have to save sp on the method entry. > > I also checked the calling convention type for sp on other platforms (AArch64, PPC, x86, x64, S390), and they are all treated as NS. > > > Testing: > - [x] tier1~3 & hotspot:tier4 with release build > > 1: https://github.com/openjdk/jdk/blob/367e0a65561f95aad61b40930d5f46843fee3444/src/hotspot/share/opto/runtime.cpp#L147-L167 This pull request has now been integrated. Changeset: 53db937d Author: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/53db937dd0766695906dc20c1dbbd3228c02fe1e Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8337780: RISC-V: C2: Change C calling convention for sp to NS Reviewed-by: fyang, rehn ------------- PR: https://git.openjdk.org/jdk/pull/20449 From dnsimon at openjdk.org Tue Aug 6 16:14:52 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 6 Aug 2024 16:14:52 GMT Subject: RFR: 8337887: [JVMCI] Clarify jdk.vm.ci.code.Architecture.getName javadoc Message-ID: This PR improves the documentation of `jdk.vm.ci.code.Architecture.getName`. ------------- Commit messages: - improve documentation of getName for each JVMCI Architecture class Changes: https://git.openjdk.org/jdk/pull/20476/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20476&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8337887 Stats: 15 lines in 4 files changed: 11 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/20476.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20476/head:pull/20476 PR: https://git.openjdk.org/jdk/pull/20476 From dnsimon at openjdk.org Tue Aug 6 16:14:52 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 6 Aug 2024 16:14:52 GMT Subject: RFR: 8337887: [JVMCI] Clarify jdk.vm.ci.code.Architecture.getName javadoc In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 10:38:06 GMT, Doug Simon wrote: > This PR improves the documentation of `jdk.vm.ci.code.Architecture.getName`. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/Architecture.java line 129: > 127: } > 128: > 129: /// Gets the name of this architecture. The value returned for This seems like a good opportunity to use the new [markdown support for javadoc](https://openjdk.org/jeps/467). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20476#discussion_r1705312698 From never at openjdk.org Tue Aug 6 16:14:52 2024 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 6 Aug 2024 16:14:52 GMT Subject: RFR: 8337887: [JVMCI] Clarify jdk.vm.ci.code.Architecture.getName javadoc In-Reply-To: References: Message-ID: <9W9petR9Nxb6o83-z2Xlx036dKUVS5JEDQVmzlYYif4=.4d7d0f57-7a33-44ea-9598-7c028248df43@github.com> On Tue, 6 Aug 2024 10:38:06 GMT, Doug Simon wrote: > This PR improves the documentation of `jdk.vm.ci.code.Architecture.getName`. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20476#pullrequestreview-2221610853 From dnsimon at openjdk.org Tue Aug 6 16:40:34 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 6 Aug 2024 16:40:34 GMT Subject: RFR: 8337887: [JVMCI] Clarify jdk.vm.ci.code.Architecture.getName javadoc In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 10:38:06 GMT, Doug Simon wrote: > This PR improves the documentation of `jdk.vm.ci.code.Architecture.getName`. Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20476#issuecomment-2271702667 From dnsimon at openjdk.org Tue Aug 6 16:40:35 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 6 Aug 2024 16:40:35 GMT Subject: Integrated: 8337887: [JVMCI] Clarify jdk.vm.ci.code.Architecture.getName javadoc In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 10:38:06 GMT, Doug Simon wrote: > This PR improves the documentation of `jdk.vm.ci.code.Architecture.getName`. This pull request has now been integrated. Changeset: 3f8b3e55 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/3f8b3e55276336cbb814d3b746d4337678389969 Stats: 15 lines in 4 files changed: 11 ins; 0 del; 4 mod 8337887: [JVMCI] Clarify jdk.vm.ci.code.Architecture.getName javadoc Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/20476 From kvn at openjdk.org Tue Aug 6 17:03:32 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 6 Aug 2024 17:03:32 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 13:18:50 GMT, Andrew Dinn wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix arm (32 bits) build > > Hi Vladimir. > > This looks like a very nice solution. The code changes look ok as far a I can tell. However, I'm not really in a position to approve the patch as I am not very familiar with the details of the matcher and formssel code. Sorry. Thank you, @adinn, for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20437#issuecomment-2271742113 From jkarthikeyan at openjdk.org Wed Aug 7 01:20:09 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 7 Aug 2024 01:20:09 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> > Hi all, > I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) > > This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Check IR before macro expansion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20066/files - new: https://git.openjdk.org/jdk/pull/20066/files/1de4cee9..ca2db583 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20066&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20066&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20066.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20066/head:pull/20066 PR: https://git.openjdk.org/jdk/pull/20066 From jkarthikeyan at openjdk.org Wed Aug 7 01:20:10 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 7 Aug 2024 01:20:10 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v2] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: On Tue, 6 Aug 2024 03:41:04 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) >> >> This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Fix IR test checks and add benchmark Hmm, that failure is quite peculiar, since now it seems we're failing because we're creating extra `MacroLogicV` nodes rather than failing because we're not creating any. Unfortunately, I didn't have much luck debugging the root cause since I don't have access to AVX-512 hardware. I've changed the IR check to a phase before `MacroLogicV` nodes are created, which should hopefully fix the failure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2272437897 From thartmann at openjdk.org Wed Aug 7 05:36:03 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 7 Aug 2024 05:36:03 GMT Subject: Integrated: 8337968: Problem list compiler/vectorapi/VectorRebracket128Test.java Message-ID: Problem list until [JDK-8330538](https://bugs.openjdk.org/browse/JDK-8330538) is fixed to reduce the noise in testing. Thanks, Tobias ------------- Commit messages: - 8337968: Problem list compiler/vectorapi/VectorRebracket128Test.java Changes: https://git.openjdk.org/jdk/pull/20485/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20485&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8337968 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20485.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20485/head:pull/20485 PR: https://git.openjdk.org/jdk/pull/20485 From thartmann at openjdk.org Wed Aug 7 05:36:04 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 7 Aug 2024 05:36:04 GMT Subject: Integrated: 8337968: Problem list compiler/vectorapi/VectorRebracket128Test.java In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 05:28:06 GMT, Tobias Hartmann wrote: > Problem list until [JDK-8330538](https://bugs.openjdk.org/browse/JDK-8330538) is fixed to reduce the noise in testing. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 66286b25 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/66286b25a183de2ffd0689da9c2bd1978b881aa7 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8337968: Problem list compiler/vectorapi/VectorRebracket128Test.java Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/20485 From chagedorn at openjdk.org Wed Aug 7 05:36:03 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 7 Aug 2024 05:36:03 GMT Subject: Integrated: 8337968: Problem list compiler/vectorapi/VectorRebracket128Test.java In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 05:28:06 GMT, Tobias Hartmann wrote: > Problem list until [JDK-8330538](https://bugs.openjdk.org/browse/JDK-8330538) is fixed to reduce the noise in testing. > > Thanks, > Tobias Looks good and trivial ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20485#pullrequestreview-2222660159 From thartmann at openjdk.org Wed Aug 7 05:36:03 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 7 Aug 2024 05:36:03 GMT Subject: Integrated: 8337968: Problem list compiler/vectorapi/VectorRebracket128Test.java In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 05:28:06 GMT, Tobias Hartmann wrote: > Problem list until [JDK-8330538](https://bugs.openjdk.org/browse/JDK-8330538) is fixed to reduce the noise in testing. > > Thanks, > Tobias Thanks Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20485#issuecomment-2272648934 From dlunden at openjdk.org Wed Aug 7 07:34:03 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 7 Aug 2024 07:34:03 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 Message-ID: If a method has a large number of parameters, we currently bail out from C2 compilation. ### Changeset Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. Changes: - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. - Remove all `can_represent` checks and bailouts. - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. ### Testing - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). ![c2-regression](https://github.com/user-attachments/assets/ffb90ace-4420-4c21-aac9-695c1ca2b645) ------------- Commit messages: - Support methods with many arguments in C2 Changes: https://git.openjdk.org/jdk/pull/20404/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8325467 Stats: 10995 lines in 13 files changed: 10766 ins; 80 del; 149 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From dlong at openjdk.org Wed Aug 7 08:04:32 2024 From: dlong at openjdk.org (Dean Long) Date: Wed, 7 Aug 2024 08:04:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 In-Reply-To: References: Message-ID: On Wed, 31 Jul 2024 12:36:38 GMT, Daniel Lund?n wrote: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... test/hotspot/gtest/opto/test_regmask.cpp line 83: > 81: > 82: TEST_VM(RegMask, Set_ALL) { > 83: // Check that Set_All doesn't add bits outside of CHUNK_SIZE The comment refers to CHUNK_SIZE, which no longer exists. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1706564281 From gcao at openjdk.org Wed Aug 7 08:06:02 2024 From: gcao at openjdk.org (Gui Cao) Date: Wed, 7 Aug 2024 08:06:02 GMT Subject: RFR: 8337971: Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed Message-ID: Hi, As described in the [JDK-8331704](https://bugs.openjdk.org/browse/JDK-8331704) issue, there are some jvmci test cases that fail on linux-riscv64. we put the failed test cases into the Problem list until JDK-8331704 is fixed. Please take a look and have some reviews. Thanks a lot. ### Testing - [ ] Run tier1-3 tests on SOPHON SG2042 (release) ------------- Commit messages: - Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed Changes: https://git.openjdk.org/jdk/pull/20487/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20487&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8337971 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20487.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20487/head:pull/20487 PR: https://git.openjdk.org/jdk/pull/20487 From dlong at openjdk.org Wed Aug 7 08:11:34 2024 From: dlong at openjdk.org (Dean Long) Date: Wed, 7 Aug 2024 08:11:34 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 In-Reply-To: References: Message-ID: On Wed, 31 Jul 2024 12:36:38 GMT, Daniel Lund?n wrote: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... What is the new maximum RegMask size and OptoReg value? What happens when C2 hits that limit? It's not clear from looking at your tests if they are under the limit or test going past the limit. There is at least one hidden limit that would cause an overflow if we accidentally go past it: OptoRegPair reg values currently must fit in `short`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2272877618 From fyang at openjdk.org Wed Aug 7 08:13:32 2024 From: fyang at openjdk.org (Fei Yang) Date: Wed, 7 Aug 2024 08:13:32 GMT Subject: RFR: 8337971: Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed In-Reply-To: References: Message-ID: <9QmqSS2u2xSyDA548VYAlUfqZw85FjpbQpcyfNsSi20=.2ce32e57-a5dc-4893-b1c1-de46de94ed53@github.com> On Wed, 7 Aug 2024 07:15:07 GMT, Gui Cao wrote: > Hi, As described in the [JDK-8331704](https://bugs.openjdk.org/browse/JDK-8331704) issue, there are some jvmci test cases that fail on linux-riscv64. we put the failed test cases into the Problem list until JDK-8331704 is fixed. > > Please take a look and have some reviews. Thanks a lot. > > ### Testing > - [ ] Run tier1-3 tests on SOPHON SG2042 (release) Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20487#pullrequestreview-2223209474 From shade at openjdk.org Wed Aug 7 08:27:31 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 7 Aug 2024 08:27:31 GMT Subject: RFR: 8337971: Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 07:15:07 GMT, Gui Cao wrote: > Hi, As described in the [JDK-8331704](https://bugs.openjdk.org/browse/JDK-8331704) issue, there are some jvmci test cases that fail on linux-riscv64. we put the failed test cases into the Problem list until JDK-8331704 is fixed. > > Please take a look and have some reviews. Thanks a lot. > > ### Testing > - [ ] Run hotspot:tier1 tests on SOPHON SG2042 (release) Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20487#pullrequestreview-2223294819 From dlunden at openjdk.org Wed Aug 7 09:25:32 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 7 Aug 2024 09:25:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 08:01:50 GMT, Dean Long wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > test/hotspot/gtest/opto/test_regmask.cpp line 83: > >> 81: >> 82: TEST_VM(RegMask, Set_ALL) { >> 83: // Check that Set_All doesn't add bits outside of CHUNK_SIZE > > The comment refers to CHUNK_SIZE, which no longer exists. Thanks, I'll fix it (and grep for other occurrences) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1706680848 From dlunden at openjdk.org Wed Aug 7 11:01:31 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 7 Aug 2024 11:01:31 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 08:09:23 GMT, Dean Long wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > What is the new maximum RegMask size and OptoReg value? What happens when C2 hits that limit? It's not clear from looking at your tests if they are under the limit or test going past the limit. There is at least one hidden limit that would cause an overflow if we accidentally go past it: OptoRegPair reg values currently must fit in `short`. Thanks for the comments @dean-long. There is no limit on register mask size, besides that it has to fit in memory. I allocate extended register mask memory in `comp_arena`, but I realize I should add an explicit check to see if the allocation succeeds or not when growing the register mask (and bail out if it doesn't). Good catch on the implications of `short` in `OptoRegPair` (and `int` in `OptoReg`). In practice, I doubt we'll ever reach these limits, but we should still ensure we add checks for this. Do you know why we use `int` in `OptoReg` but `short` in `OptoRegPair`? I don't see why we should not change it to `short` in `OptoReg` as well for consistency. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2273197937 From dlunden at openjdk.org Wed Aug 7 11:57:32 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 7 Aug 2024 11:57:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 10:58:57 GMT, Daniel Lund?n wrote: > but I realize I should add an explicit check to see if the allocation succeeds or not when growing the register mask (and bail out if it doesn't) After a bit of investigation, I do not believe we actually need explicit checks for if the allocation succeeds. This is already handled internally in the arena allocation and the VM will crash if we run out of memory. This should never really happen in practice, but we could set some sanity limit on register mask size if we feel that's appropriate. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2273291282 From fyang at openjdk.org Wed Aug 7 13:14:32 2024 From: fyang at openjdk.org (Fei Yang) Date: Wed, 7 Aug 2024 13:14:32 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: References: Message-ID: <2YYUNNYadbPKGGQ8jNqLpSX-Q4jTLeRVa0OXT0Xj_RU=.ed35c73c-8045-4008-bfd1-14d370fff721@github.com> On Fri, 2 Aug 2024 15:28:04 GMT, Vladimir Kozlov wrote: >> Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. >> `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: >> [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) >> >> On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) >> As result we waste two registers to pass constant and NULL. >> >> Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) >> >> I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. >> >> Tested tier1-3,stress,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix arm (32 bits) build FYI: Also performed tier1-3 test on linux-riscv64. Result looks good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20437#issuecomment-2273442114 From kurashige.taizo at fujitsu.com Wed Aug 7 13:39:07 2024 From: kurashige.taizo at fujitsu.com (Taizo Kurashige (Fujitsu)) Date: Wed, 7 Aug 2024 13:39:07 +0000 Subject: Question about JDK-8221092 In-Reply-To: References: Message-ID: Hi all, I'm sorry to bother you again. If possible, could anyone please give me some insight? Thanks. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ________________________________ Hi all, If possible, could anyone give me some insight? Thanks. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ________________________________ Hi Sandhya, Thank you for your response. Thanks to you, I understood the following. ?All Skylake processors have cupid family=6, model=0x55, stepping < 5 ?CascadeLake processors have cupid family=6, model=0x55, stepping >=5. If possible, I would like you to tell me about the following. Is there a specification for what the stepping value is for a particular processor? For example, is it defined in any documentation that CascadeLake processors have stepping >=5? I searched the documentation provided by Intel but couldn't find it. I want some evidence that the following is true. ?All Skylake processors have stepping < 5 ?CascadeLake processors have stepping >=5. If stepping per processor is specified somewhere, please let me know. Thanks. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ________________________________ Hi Taizo, UseAVX is set to 2 for all Skylake processors (cupid family=6, model=0x55, stepping < 5) , not just Skylake X. CascadeLake processors have cupid family=6, model=0x55, stepping >=5. Hope this helps. Best Regards, Sandhya -------- Forwarded Message -------- Subject: Re: Question about JDK-8221092 Date: Wed, 10 Jul 2024 06:39:43 +0000 From: Taizo Kurashige (Fujitsu) To: hotspot-compiler-dev at openjdk.org Hi all, Could someone please respond to this question if possible? Thank you. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ------------------------------------------------------------------------------------------------------------------------ *???:* Kurashige, Taizo/?? ?? *????:* 2024?7?3? 15:09 *??:* hotspot-compiler-dev at openjdk.org *??:* Question about JDK-8221092 Hi all, I have a question about https://bugs.openjdk.org/browse/JDK-8221092. If possible, could someone please provide some insight? Here's what I would like to know: 1. Is it correct to understand that "Skylake (X7) processors" refers to the Skylake processors listed at https://ark.intel.com/content/www/us/en/ark/products/codename/37572/products-formerly-skylake.html, specifically those in the 7000 series with an "X" or "XE" in their names? For example, "Intel? Core? i9-7920X X-series Processor (16.5M Cache, up to 4.30 GHz)" or "Intel? Core? i9-7980XE Extreme Edition Processor (24.75M Cache, up to 4.20 GHz)". 2. In the fix for JDK-8221092, if the stepping is less than 5, the processor is considered to be of Skylake (X7) or an earlier version. In such cases, UseAVX is set to 2. Is there any documentation that the stepping for Skylake (X7) is 5? Thank you. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 kurashige23 - Overview kurashige23 has 5 repositories available. Follow their code on GitHub. github.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dhanalla at openjdk.org Thu Aug 8 00:23:41 2024 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Thu, 8 Aug 2024 00:23:41 GMT Subject: RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded Message-ID: In the debug build, the assert is triggered during the parsing (before Code_Gen). In the Release build, however, the compilation bails out at `Compile::check_node_count()` during the code generation phase and completes execution without any issues. When I commented out the assert(C->live_nodes() <= C->max_node_limit()), both debug and Release builds exhibited the same behavior: the compilation bails out, and execution completes without any issues. The assert statement is not essential, as it is causing unnecessary failures in the debug build. ------------- Commit messages: - Removing the unnecessary assert Changes: https://git.openjdk.org/jdk/pull/20504/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20504&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8315916 Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20504.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20504/head:pull/20504 PR: https://git.openjdk.org/jdk/pull/20504 From gcao at openjdk.org Thu Aug 8 05:25:36 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 8 Aug 2024 05:25:36 GMT Subject: RFR: 8337971: Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 07:15:07 GMT, Gui Cao wrote: > Hi, As described in the [JDK-8331704](https://bugs.openjdk.org/browse/JDK-8331704) issue, there are some jvmci test cases that fail on linux-riscv64. we put the failed test cases into the Problem list until JDK-8331704 is fixed. > > Please take a look and have some reviews. Thanks a lot. > > ### Testing > - [x] Run hotspot:tier1 tests on SOPHON SG2042 (release) Thanks all for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20487#issuecomment-2274974395 From duke at openjdk.org Thu Aug 8 05:25:37 2024 From: duke at openjdk.org (duke) Date: Thu, 8 Aug 2024 05:25:37 GMT Subject: RFR: 8337971: Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed In-Reply-To: References: Message-ID: <-pGGZm_DfLFH-5fGpCawcichEqRQFlRWM6s0xz2AFgo=.fdd686a8-e9bf-404a-966f-e3b052f26b34@github.com> On Wed, 7 Aug 2024 07:15:07 GMT, Gui Cao wrote: > Hi, As described in the [JDK-8331704](https://bugs.openjdk.org/browse/JDK-8331704) issue, there are some jvmci test cases that fail on linux-riscv64. we put the failed test cases into the Problem list until JDK-8331704 is fixed. > > Please take a look and have some reviews. Thanks a lot. > > ### Testing > - [x] Run hotspot:tier1 tests on SOPHON SG2042 (release) @zifeihan Your change (at version 5ed0820229c20c07f4176fc5809569f343f3dfe7) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20487#issuecomment-2274975130 From gcao at openjdk.org Thu Aug 8 05:25:37 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 8 Aug 2024 05:25:37 GMT Subject: Integrated: 8337971: Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 07:15:07 GMT, Gui Cao wrote: > Hi, As described in the [JDK-8331704](https://bugs.openjdk.org/browse/JDK-8331704) issue, there are some jvmci test cases that fail on linux-riscv64. we put the failed test cases into the Problem list until JDK-8331704 is fixed. > > Please take a look and have some reviews. Thanks a lot. > > ### Testing > - [x] Run hotspot:tier1 tests on SOPHON SG2042 (release) This pull request has now been integrated. Changeset: 16df9c33 Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/16df9c33e9bbc9329ae60ba14332c09aadaba3f0 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod 8337971: Problem list several jvmci tests on linux-riscv64 until JDK-8331704 is fixed Reviewed-by: fyang, shade ------------- PR: https://git.openjdk.org/jdk/pull/20487 From chagedorn at openjdk.org Thu Aug 8 06:25:30 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Aug 2024 06:25:30 GMT Subject: RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 00:18:37 GMT, Dhamoder Nalla wrote: > In the debug build, the assert is triggered during the parsing (before Code_Gen). In the Release build, however, the compilation bails out at `Compile::check_node_count()` during the code generation phase and completes execution without any issues. > > When I commented out the assert(C->live_nodes() <= C->max_node_limit()), both debug and Release builds exhibited the same behavior: the compilation bails out, and execution completes without any issues. > > The assert statement is not essential, as it is causing unnecessary failures in the debug build. Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories: 1. We have a real bug and by fixing it, we no longer create this many nodes. 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed). 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well). Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20504#issuecomment-2275041220 From chagedorn at openjdk.org Thu Aug 8 06:40:17 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Aug 2024 06:40:17 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v3] In-Reply-To: References: Message-ID: > It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. > > This is motivated by https://github.com/openjdk/jdk/pull/19635. > > I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: > > TestFramework testFramework = new TestFramework(); > testFramework > .addFlags("-XX:-TieredCompilation", > "-XX:+UseParallelGC") > .addTestClassesToBootClassPath() > .start(); > > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: Fix file separator + add test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20477/files - new: https://git.openjdk.org/jdk/pull/20477/files/557f9d05..f4ad6df8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20477&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20477&range=01-02 Stats: 71 lines in 2 files changed: 70 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20477.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20477/head:pull/20477 PR: https://git.openjdk.org/jdk/pull/20477 From chagedorn at openjdk.org Thu Aug 8 06:40:17 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Aug 2024 06:40:17 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v2] In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 13:17:06 GMT, Christian Hagedorn wrote: >> It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. >> >> This is motivated by https://github.com/openjdk/jdk/pull/19635. >> >> I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: >> >> TestFramework testFramework = new TestFramework(); >> testFramework >> .addFlags("-XX:-TieredCompilation", >> "-XX:+UseParallelGC") >> .addTestClassesToBootClassPath() >> .start(); >> >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Review by Aleksey Sorry for the delay. That is definitely a good idea to add a simple test to check the new feature. I've added such a test with `@Stable` that fails when not adding the test class to the boot classpath and works otherwise. ------------- PR Review: https://git.openjdk.org/jdk/pull/20477#pullrequestreview-2226971263 From chagedorn at openjdk.org Thu Aug 8 06:40:17 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Aug 2024 06:40:17 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v2] In-Reply-To: <-p3rdej_RlMTcR7zGrGte4cD-l1A1Tquf8SSfpxfMIU=.36413f8e-daaa-4782-b95f-f6a624ee5b97@github.com> References: <-p3rdej_RlMTcR7zGrGte4cD-l1A1Tquf8SSfpxfMIU=.36413f8e-daaa-4782-b95f-f6a624ee5b97@github.com> Message-ID: On Tue, 6 Aug 2024 13:21:15 GMT, Aleksey Shipilev wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> Review by Aleksey > > test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java line 102: > >> 100: if (testClassesOnBootClassPath) { >> 101: // Add test classes themselves to boot classpath to make them privileged. >> 102: bootClassPath += ":" + Utils.TEST_CLASSES; > > I think `:` is not portable, and should instead be `File.pathSeparator`? Good catch! Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20477#discussion_r1708739469 From shade at openjdk.org Thu Aug 8 08:31:36 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 8 Aug 2024 08:31:36 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v3] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 06:40:17 GMT, Christian Hagedorn wrote: >> It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. >> >> This is motivated by https://github.com/openjdk/jdk/pull/19635. >> >> I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: >> >> TestFramework testFramework = new TestFramework(); >> testFramework >> .addFlags("-XX:-TieredCompilation", >> "-XX:+UseParallelGC") >> .addTestClassesToBootClassPath() >> .start(); >> >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix file separator + add test OK, thanks. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20477#pullrequestreview-2227203236 From chagedorn at openjdk.org Thu Aug 8 08:36:32 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 8 Aug 2024 08:36:32 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v3] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 06:40:17 GMT, Christian Hagedorn wrote: >> It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. >> >> This is motivated by https://github.com/openjdk/jdk/pull/19635. >> >> I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: >> >> TestFramework testFramework = new TestFramework(); >> testFramework >> .addFlags("-XX:-TieredCompilation", >> "-XX:+UseParallelGC") >> .addTestClassesToBootClassPath() >> .start(); >> >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix file separator + add test Thanks Aleksey for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20477#issuecomment-2275258274 From dlunden at openjdk.org Thu Aug 8 09:29:17 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 8 Aug 2024 09:29:17 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: Message-ID: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Remove leftover CHUNK_SIZE reference ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/a0589873..cbb2c251 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From rcastanedalo at openjdk.org Thu Aug 8 14:17:24 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 8 Aug 2024 14:17:24 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v3] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Flatten barrier assembly generation code by removing helpers individual barrier tests and operations ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/d722d4c7..20ef68c8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=01-02 Stats: 263 lines in 2 files changed: 77 ins; 116 del; 70 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Thu Aug 8 14:23:36 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 8 Aug 2024 14:23:36 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v3] In-Reply-To: References: Message-ID: On Wed, 19 Jun 2024 08:45:45 GMT, Albert Mingkun Yang wrote: >> Note that if we want to optimize the barrier code layout (see the [JEP description](https://openjdk.org/jeps/475), *Candidate optimizations* sub-section), splitting the assembly of each barrier in at least two blocks is necessary, since we need to separate the inline from the out-of-line (barrier stub) code. And since the assembly code has to be split into multiple functions anyway, I think it makes sense to group the code by logical blocks (different barrier tests, queue insertion, etc.), as proposed in this changeset. This also improves code reuse, e.g. the same `generate_queue_insertion` implementation is used for the pre- and post-barriers. >> If you still think there is value in grouping together the blocks that can be grouped together (e.g. `generate_single_region_test` + `generate_new_val_null_test` + `generate_card_young_test`), I can prototype the refactoring and let the G1 maintainers decide which alternative is more readable/maintainable. > >> This also improves code reuse > > In this area, I think code duplication is less of an issue -- it's more crucial that one can follow the asm flow as if reading real asm. (Ofc, this is subjective; feel free to keep as is.) I'm back from vacation now and resuming my work in this JEP. After some offline discussions, I have pushed a new version (commit 20ef68c81e) without helper functions, except for `generate_queue_insertion()` which is still included. @albertnetymk please have a look and let me know if you find the new style more readable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1709618766 From rcastanedalo at openjdk.org Thu Aug 8 15:37:19 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 8 Aug 2024 15:37:19 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v4] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Also include HOTSPOT_TARGET_CPU_ARCH-based G1 ADL source file ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/20ef68c8..47079ea1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Thu Aug 8 15:37:19 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 8 Aug 2024 15:37:19 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: Message-ID: On Sat, 29 Jun 2024 03:51:29 GMT, Amit Kumar wrote: >> make/hotspot/gensrc/GensrcAdlc.gmk line 205: >> >>> 203: ifeq ($(call check-jvm-feature, g1gc), true) >>> 204: AD_SRC_FILES += $(call uniq, $(wildcard $(foreach d, $(AD_SRC_ROOTS), \ >>> 205: $d/cpu/$(HOTSPOT_TARGET_CPU_ARCH)/gc/g1/g1_$(HOTSPOT_TARGET_CPU).ad \ >> >> on s390, `g1_s390.ad` file is not compiled with current code. >> >> Suggestion: >> >> $d/cpu/$(HOTSPOT_TARGET_CPU_ARCH)/gc/g1/g1_$(HOTSPOT_TARGET_CPU_ARCH).ad \ > > I guess this one might be better: > > diff --git a/make/hotspot/gensrc/GensrcAdlc.gmk b/make/hotspot/gensrc/GensrcAdlc.gmk > index e34f0725397..ef9c15b2975 100644 > --- a/make/hotspot/gensrc/GensrcAdlc.gmk > +++ b/make/hotspot/gensrc/GensrcAdlc.gmk > @@ -203,6 +203,7 @@ ifeq ($(call check-jvm-feature, compiler2), true) > ifeq ($(call check-jvm-feature, g1gc), true) > AD_SRC_FILES += $(call uniq, $(wildcard $(foreach d, $(AD_SRC_ROOTS), \ > $d/cpu/$(HOTSPOT_TARGET_CPU_ARCH)/gc/g1/g1_$(HOTSPOT_TARGET_CPU).ad \ > + $d/cpu/$(HOTSPOT_TARGET_CPU_ARCH)/gc/g1/g1_$(HOTSPOT_TARGET_CPU_ARCH).ad \ > ))) > endif > > > Build is fine with both changes, (tested on Mac-M1) Thanks! I went with the second option (commit 47079ea1) for consistency with other collectors. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1709781421 From shade at openjdk.org Thu Aug 8 16:33:35 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 8 Aug 2024 16:33:35 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v3] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 06:40:17 GMT, Christian Hagedorn wrote: >> It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. >> >> This is motivated by https://github.com/openjdk/jdk/pull/19635. >> >> I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: >> >> TestFramework testFramework = new TestFramework(); >> testFramework >> .addFlags("-XX:-TieredCompilation", >> "-XX:+UseParallelGC") >> .addTestClassesToBootClassPath() >> .start(); >> >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix file separator + add test Can/should we get some more reviews here? I would like to get the `@Stable` PR, which depends on it, moving :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20477#issuecomment-2276226349 From ayang at openjdk.org Thu Aug 8 16:47:40 2024 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Thu, 8 Aug 2024 16:47:40 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v4] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 15:37:19 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Also include HOTSPOT_TARGET_CPU_ARCH-based G1 ADL source file Some naming comments/suggestions, up to you. g1_write_barrier_post_c2 generate_c2_post_barrier_stub The latter is the "next" step if slower path is taken. I wonder if it can be renamed to sth like "...write_barrier_post_c2_stub" to make it obvious that they are related. Both "write_barrier_pre" and "pre_write_barrier" exist. It's not obvious whether that is intended (to highlight some diff) or not. ------------- Marked as reviewed by ayang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19746#pullrequestreview-2228393022 From jbhateja at openjdk.org Thu Aug 8 17:00:05 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 8 Aug 2024 17:00:05 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI Message-ID: Hi All, As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. . SATURATING_UADD : Saturating unsigned addition. . SATURATING_ADD : Saturating signed addition. . SATURATING_USUB : Saturating unsigned subtraction. . SATURATING_SUB : Saturating signed subtraction. . UMAX : Unsigned max . UMIN : Unsigned min. New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. Summary of changes: - Java side implementation of new vector operators. - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. - C2 compiler IR and inline expander changes. - Optimized x86 backend implementation for new vector operators and their predicated counterparts. - Extends existing VectorAPI Jtreg test suite to cover new operations. Kindly review and share your feedback. Best Regards, PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html ------------- Commit messages: - Removed redundant comment - 8338021: Support saturating vector operators in VectorAPI Changes: https://git.openjdk.org/jdk/pull/20507/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20507&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338021 Stats: 9013 lines in 67 files changed: 8923 ins; 28 del; 62 mod Patch: https://git.openjdk.org/jdk/pull/20507.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20507/head:pull/20507 PR: https://git.openjdk.org/jdk/pull/20507 From jbhateja at openjdk.org Thu Aug 8 17:02:05 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 8 Aug 2024 17:02:05 GMT Subject: RFR: 8338023: Support two vector selectFrom API Message-ID: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Hi All, As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. Declaration:- Vector.selectFrom(Vector v1, Vector v2) Semantics:- Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. Summary of changes: - Java side implementation of new selectFrom API. - C2 compiler IR and inline expander changes. - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. - Optimized x86 backend implementation for AVX512 and legacy target. - Function tests covering new API. JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] Benchmark (size) Mode Cnt Score Error Units SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.244 ops/ms SelectFromBenchmark.selectFromLongVector 1024 thrpt 2 5856.859 ops/ms SelectFromBenchmark.selectFromLongVector 2048 thrpt 2 1513.378 ops/ms SelectFromBenchmark.selectFromShortVector 1024 thrpt 2 17888.617 ops/ms SelectFromBenchmark.selectFromShortVector 2048 thrpt 2 9079.565 ops/ms Kindly review and share your feedback. Best Regards, Jatin [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html ------------- Commit messages: - Adding Benchmark - 8338023: Support two vector selectFrom API Changes: https://git.openjdk.org/jdk/pull/20508/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338023 Stats: 2737 lines in 95 files changed: 2719 ins; 17 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508 PR: https://git.openjdk.org/jdk/pull/20508 From jbhateja at openjdk.org Thu Aug 8 17:20:06 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 8 Aug 2024 17:20:06 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. > > > . SATURATING_UADD : Saturating unsigned addition. > . SATURATING_ADD : Saturating signed addition. > . SATURATING_USUB : Saturating unsigned subtraction. > . SATURATING_SUB : Saturating signed subtraction. > . UMAX : Unsigned max > . UMIN : Unsigned min. > > > New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. > > As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. > > Summary of changes: > - Java side implementation of new vector operators. > - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. > - C2 compiler IR and inline expander changes. > - Optimized x86 backend implementation for new vector operators and their predicated counterparts. > - Extends existing VectorAPI Jtreg test suite to cover new operations. > > Kindly review and share your feedback. > > Best Regards, > PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. > > [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8338201 - Removed redundant comment - 8338021: Support saturating vector operators in VectorAPI ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20507/files - new: https://git.openjdk.org/jdk/pull/20507/files/1ffe4c68..5468e72b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20507&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20507&range=00-01 Stats: 3609 lines in 32 files changed: 177 ins; 3316 del; 116 mod Patch: https://git.openjdk.org/jdk/pull/20507.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20507/head:pull/20507 PR: https://git.openjdk.org/jdk/pull/20507 From dlong at openjdk.org Thu Aug 8 19:41:35 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 8 Aug 2024 19:41:35 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 In-Reply-To: References: Message-ID: On Wed, 7 Aug 2024 10:58:57 GMT, Daniel Lund?n wrote: > Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency. I don't know the reason for the inconsistency. I agree they should use the same type. I suggest using OptoReg::Name in OptoRegPair, and changing its type to short, along with checks that we don't overflow the smaller value. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2276526582 From kvn at openjdk.org Fri Aug 9 02:49:39 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Aug 2024 02:49:39 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v3] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 06:40:17 GMT, Christian Hagedorn wrote: >> It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. >> >> This is motivated by https://github.com/openjdk/jdk/pull/19635. >> >> I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: >> >> TestFramework testFramework = new TestFramework(); >> testFramework >> .addFlags("-XX:-TieredCompilation", >> "-XX:+UseParallelGC") >> .addTestClassesToBootClassPath() >> .start(); >> >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix file separator + add test Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20477#pullrequestreview-2229184149 From kvn at openjdk.org Fri Aug 9 03:15:32 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Aug 2024 03:15:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 19:38:41 GMT, Dean Long wrote: > > Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency. > > I don't know the reason for the inconsistency. I agree they should use the same type. I suggest using OptoReg::Name in OptoRegPair, and changing its type to short, along with checks that we don't overflow the smaller type. Platform specific code use `int`. Converting `OptoReg::Name` may need more changes then you think. I agree with changing `OptoReg::Name` type to `short` but it should be separate from this RFE. `short` in `OptoRegPair` is for memory saving. In a lot of places (all?) we use `copy value` for it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2277062569 From jkarthikeyan at openjdk.org Fri Aug 9 03:31:38 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 9 Aug 2024 03:31:38 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 17:20:06 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SATURATING_UADD : Saturating unsigned addition. >> . SATURATING_ADD : Saturating signed addition. >> . SATURATING_USUB : Saturating unsigned subtraction. >> . SATURATING_SUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8338201 > - Removed redundant comment > - 8338021: Support saturating vector operators in VectorAPI src/hotspot/share/opto/type.cpp line 495: > 493: TypeInt::POS1 = TypeInt::make(1,max_jint, WidenMin); // Positive values > 494: TypeInt::INT = TypeInt::make(min_jint,max_jint, WidenMax); // 32-bit integers > 495: TypeInt::UINT = TypeInt::make(0, max_juint, WidenMin); // Unsigned ints This would make an illegal type, right? Since `TypeInt` is signed using `max_juint` as the hi value would end up as signed -1, resulting in the type `0..-1`, an empty type. I wonder if there's a better way to handle this, since in the type system empty types are in a sense equivalent to `TOP`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1710642379 From chagedorn at openjdk.org Fri Aug 9 07:20:43 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 9 Aug 2024 07:20:43 GMT Subject: RFR: 8337876: [IR Framework] Add support for IR tests with @Stable [v3] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 06:40:17 GMT, Christian Hagedorn wrote: >> It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. >> >> This is motivated by https://github.com/openjdk/jdk/pull/19635. >> >> I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: >> >> TestFramework testFramework = new TestFramework(); >> testFramework >> .addFlags("-XX:-TieredCompilation", >> "-XX:+UseParallelGC") >> .addTestClassesToBootClassPath() >> .start(); >> >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Fix file separator + add test Thanks Vladimir for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20477#issuecomment-2277305104 From chagedorn at openjdk.org Fri Aug 9 07:20:45 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 9 Aug 2024 07:20:45 GMT Subject: Integrated: 8337876: [IR Framework] Add support for IR tests with @Stable In-Reply-To: References: Message-ID: On Tue, 6 Aug 2024 11:43:32 GMT, Christian Hagedorn wrote: > It is currently not possible to write IR tests with `@Stable` annotations because one need to somehow add the IR test classes to the boot classpath. This patch provides support to write such IR tests. I've added a section to the README to provide guidance on how this can be done. > > This is motivated by https://github.com/openjdk/jdk/pull/19635. > > I've tested this patch by taking the current patch of https://github.com/openjdk/jdk/pull/19635, dropping the `RestrictStable` flag and modifying the tests to work with the new IR framework feature: > > TestFramework testFramework = new TestFramework(); > testFramework > .addFlags("-XX:-TieredCompilation", > "-XX:+UseParallelGC") > .addTestClassesToBootClassPath() > .start(); > > > Thanks, > Christian This pull request has now been integrated. Changeset: c01f53ac Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/c01f53ac2dab1d4d2cd1e4d45a67f9373d4a9c7e Stats: 103 lines in 5 files changed: 96 ins; 0 del; 7 mod 8337876: [IR Framework] Add support for IR tests with @Stable Reviewed-by: shade, kvn ------------- PR: https://git.openjdk.org/jdk/pull/20477 From dlong at openjdk.org Fri Aug 9 07:22:37 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 9 Aug 2024 07:22:37 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Thu, 8 Aug 2024 09:29:17 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove leftover CHUNK_SIZE reference A separate RFE is fine with me. I'm just concerned that unlimited size regmasks could allow OptoRegPair to overflow. Does C2 have a maximum frame size? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2277309657 From thartmann at openjdk.org Fri Aug 9 08:39:37 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 9 Aug 2024 08:39:37 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: References: Message-ID: <3yuLs7rogsy-qznycivDI6m7TizfSMJrbvEgfqhyW30=.4018f37c-64d2-4629-808c-1c538d67a9ce@github.com> On Fri, 2 Aug 2024 15:28:04 GMT, Vladimir Kozlov wrote: >> Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. >> `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: >> [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) >> >> On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) >> As result we waste two registers to pass constant and NULL. >> >> Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) >> >> I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. >> >> Tested tier1-3,stress,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Fix arm (32 bits) build Nice enhancement. I noticed that `TailJump` is in vmStructs.cpp, should `ForwardException` be added as well? src/hotspot/cpu/aarch64/aarch64.ad line 16188: > 16186: > 16187: // Forward exception. > 16188: instruct ForwardExceptionjmp() Suggestion: instruct ForwardException() Same for other AD files. src/hotspot/share/opto/callnode.hpp line 160: > 158: public: > 159: ForwardExceptionNode( Node *cntrl, Node *i_o, Node *memory, Node *frameptr, Node *retadr) > 160: : ReturnNode( TypeFunc::Parms, cntrl, i_o, memory, frameptr, retadr ) { Suggestion: ForwardExceptionNode(Node* cntrl, Node* i_o, Node* memory, Node* frameptr, Node* retadr) : ReturnNode(TypeFunc::Parms, cntrl, i_o, memory, frameptr, retadr) { ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20437#pullrequestreview-2229654576 PR Review Comment: https://git.openjdk.org/jdk/pull/20437#discussion_r1711050995 PR Review Comment: https://git.openjdk.org/jdk/pull/20437#discussion_r1711043275 From rcastanedalo at openjdk.org Fri Aug 9 11:48:17 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 9 Aug 2024 11:48:17 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Give barrier generation helper functions a more consistent name ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/47079ea1..1834bf41 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=03-04 Stats: 455 lines in 3 files changed: 0 ins; 0 del; 455 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Fri Aug 9 11:52:35 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 9 Aug 2024 11:52:35 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: On Fri, 9 Aug 2024 11:48:17 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Give barrier generation helper functions a more consistent name Thanks for reviewing, Albert! > ``` > g1_write_barrier_post_c2 > generate_c2_post_barrier_stub > ``` > > The latter is the "next" step if slower path is taken. I wonder if it can be renamed to sth like "...write_barrier_post_c2_stub" to make it obvious that they are related. I agree with your suggestion, but will postpone it to a follow-up task to avoid interfering with the ongoing port work (the names are dictated by the platform-independent `G1PreBarrierStubC2::emit_code()` and `G1PostBarrierStubC2::emit_code()` functions, so a name change would affect every platform). > Both "write_barrier_pre" and "pre_write_barrier" exist. It's not obvious whether that is intended (to highlight some diff) or not. This is accidental, as far as I can see. `write_barrier_pre` is the pre-existing name for the interpreter barrier generation functions, I would rather leave it as-is to avoid making this changeset even larger. Instead, I have renamed the helper functions `g1_pre_write_barrier()` and `g1_post_write_barrier()` to `write_barrier_pre()` and `write_barrier_post()`, for consistency (and dropped `g1_` since it is obvious from the context) in commit 1834bf4. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2277770042 From rcastanedalo at openjdk.org Fri Aug 9 12:03:37 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 9 Aug 2024 12:03:37 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> Message-ID: <5Q8PqULlpKfoPLXRqI0ua0dVWAy3zPBqtFpycNwBg0Y=.f2830c84-63ba-43cd-85e3-2245e4ac8917@github.com> On Sun, 21 Jul 2024 08:21:39 GMT, Martin Doerr wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Build barrier data in G1BarrierSetC2::get_store_barrier() by adding, rather than removing, barrier tags > > src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 86: > >> 84: // an indirect memory operand) to reduce C2's scheduling and register >> 85: // allocation pressure (fewer Mach nodes). The same holds for g1StoreN and >> 86: // g1EncodePAndStoreN. > > I'm not convinced that this is beneficial. We're wasting a temp register just for an addition? I agree that using indirect memory operands is the most readable choice, and is slightly less wasteful from a register usage perspective. However, when I tried this choice a couple of months ago, I observed timeouts in some CTW runs, which as far as I remember were caused when LCM processed huge basic blocks with lots of memory writes (e.g. arising from static initializations of large String arrays such as in [here](https://github.com/apache/lucene/blob/ea562f6ef2b32fe6eadf57c6381d9a69acb043c7/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/KStemData1.java#L47-L748)), in combination with C2 stress options. In these scenarios, the large number of additional Mach nodes seemed to cause the timeouts. I settled for materializing the store address internally to guard against such corner cases. I did not see any significant performance difference between the two choices in my benchmark results. I would like to study whether LCM can be made more robust in this scenario, which would enable using indirect memory operands here, but I think this would be best addressed in a separate RFE. Would it be OK by now to extend the code comment with the details provided in the above explanation? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1711337413 From syan at openjdk.org Fri Aug 9 13:39:58 2024 From: syan at openjdk.org (SendaoYan) Date: Fri, 9 Aug 2024 13:39:58 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build Message-ID: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Hi all, The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, ------------- Commit messages: - 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build Changes: https://git.openjdk.org/jdk/pull/20524/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20524&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338112 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20524.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20524/head:pull/20524 PR: https://git.openjdk.org/jdk/pull/20524 From chagedorn at openjdk.org Fri Aug 9 13:39:59 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 9 Aug 2024 13:39:59 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build In-Reply-To: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: On Fri, 9 Aug 2024 13:12:14 GMT, SendaoYan wrote: > Hi all, > The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. > > The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, Changes requested by chagedorn (Reviewer). test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java line 35: > 33: * @test > 34: * @requires vm.flagless > 35: * @requires vm.debug == true I think the test can also fail when you have a build without C2 compiler. Just to be sure, I would add the following: Suggestion: * @requires vm.debug == true & vm.compMode != "Xint" & vm.compiler2.enabled & vm.flagless ------------- PR Review: https://git.openjdk.org/jdk/pull/20524#pullrequestreview-2230182000 PR Review Comment: https://git.openjdk.org/jdk/pull/20524#discussion_r1711457625 From mdoerr at openjdk.org Fri Aug 9 14:08:34 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 9 Aug 2024 14:08:34 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <5Q8PqULlpKfoPLXRqI0ua0dVWAy3zPBqtFpycNwBg0Y=.f2830c84-63ba-43cd-85e3-2245e4ac8917@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <5Q8PqULlpKfoPLXRqI0ua0dVWAy3zPBqtFpycNwBg0Y=.f2830c84-63ba-43cd-85e3-2245e4ac8917@github.com> Message-ID: On Fri, 9 Aug 2024 12:00:26 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 86: >> >>> 84: // an indirect memory operand) to reduce C2's scheduling and register >>> 85: // allocation pressure (fewer Mach nodes). The same holds for g1StoreN and >>> 86: // g1EncodePAndStoreN. >> >> I'm not convinced that this is beneficial. We're wasting a temp register just for an addition? > > I agree that using indirect memory operands is the most readable choice, and is slightly less wasteful from a register usage perspective. However, when I tried this choice a couple of months ago, I observed timeouts in some CTW runs, which as far as I remember were caused when LCM processed huge basic blocks with lots of memory writes (e.g. arising from static initializations of large String arrays such as in [here](https://github.com/apache/lucene/blob/ea562f6ef2b32fe6eadf57c6381d9a69acb043c7/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/KStemData1.java#L47-L748)), in combination with C2 stress options. In these scenarios, the large number of additional Mach nodes seemed to cause the timeouts. I settled for materializing the store address internally to guard against such corner cases. I did not see any significant performance difference between the two choices in my benchmark results. > > I would like to study whether LCM can be made more robust in this scenario, which would enable using indirect memory operands here, but I think this would be best addressed in a separate RFE. Would it be OK by now to extend the code comment with the details provided in the above explanation? Ok, doing it in a separate RFE is fine with me. This sounds like a C2 problem which should get investigated. It may cause other performance problems, too. Maybe a native profiler can show what takes too much time. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1711536279 From syan at openjdk.org Fri Aug 9 14:20:02 2024 From: syan at openjdk.org (SendaoYan) Date: Fri, 9 Aug 2024 14:20:02 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v2] In-Reply-To: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: > Hi all, > The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. > > The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, SendaoYan has updated the pull request incrementally with one additional commit since the last revision: since this test also fails when have a debug build without compilers, so change requires to `* @requires vm.debug == true & vm.compMode != "Xint" & vm.compiler2.enabled & vm.flagless` ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20524/files - new: https://git.openjdk.org/jdk/pull/20524/files/8bcf5c41..537a9d1f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20524&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20524&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20524.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20524/head:pull/20524 PR: https://git.openjdk.org/jdk/pull/20524 From syan at openjdk.org Fri Aug 9 14:20:02 2024 From: syan at openjdk.org (SendaoYan) Date: Fri, 9 Aug 2024 14:20:02 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build In-Reply-To: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: On Fri, 9 Aug 2024 13:12:14 GMT, SendaoYan wrote: > Hi all, > The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. > > The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, > /label add hotspot-compiler Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20524#issuecomment-2278052302 From syan at openjdk.org Fri Aug 9 14:20:03 2024 From: syan at openjdk.org (SendaoYan) Date: Fri, 9 Aug 2024 14:20:03 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v2] In-Reply-To: References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: On Fri, 9 Aug 2024 13:22:56 GMT, Christian Hagedorn wrote: >> SendaoYan has updated the pull request incrementally with one additional commit since the last revision: >> >> since this test also fails when have a debug build without compilers, so change requires to `* @requires vm.debug == true & vm.compMode != "Xint" & vm.compiler2.enabled & vm.flagless` > > test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java line 35: > >> 33: * @test >> 34: * @requires vm.flagless >> 35: * @requires vm.debug == true > > I think the test can also fail when you have a build without C2 compiler. Just to be sure, I would add the following: > Suggestion: > > * @requires vm.debug == true & vm.compMode != "Xint" & vm.compiler2.enabled & vm.flagless Thanks for your advice. Does `vm.compiler2.enabled & vm.flagless` include the situation `vm.compMode != "Xint"`, I mean `vm.compMode != "Xint"` not needed when set requires `vm.compiler2.enabled & vm.flagless` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20524#discussion_r1711552578 From chagedorn at openjdk.org Fri Aug 9 14:25:31 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 9 Aug 2024 14:25:31 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v2] In-Reply-To: References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: On Fri, 9 Aug 2024 14:15:56 GMT, SendaoYan wrote: >> test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java line 35: >> >>> 33: * @test >>> 34: * @requires vm.flagless >>> 35: * @requires vm.debug == true >> >> I think the test can also fail when you have a build without C2 compiler. Just to be sure, I would add the following: >> Suggestion: >> >> * @requires vm.debug == true & vm.compMode != "Xint" & vm.compiler2.enabled & vm.flagless > > Thanks for your advice. Does `vm.compiler2.enabled & vm.flagless` include the situation `vm.compMode != "Xint"`, I mean `vm.compMode != "Xint"` not needed when set requires `vm.compiler2.enabled & vm.flagless` Yes, I guess it's not needed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20524#discussion_r1711563721 From syan at openjdk.org Fri Aug 9 14:50:03 2024 From: syan at openjdk.org (SendaoYan) Date: Fri, 9 Aug 2024 14:50:03 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v3] In-Reply-To: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> > Hi all, > The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. And ths test will fail without c2 compiler, so it also need `@requires vm.compiler2.enabled`. > > The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, SendaoYan has updated the pull request incrementally with one additional commit since the last revision: delete vm.compMode != "Xint" ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20524/files - new: https://git.openjdk.org/jdk/pull/20524/files/537a9d1f..292ad2a8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20524&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20524&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20524.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20524/head:pull/20524 PR: https://git.openjdk.org/jdk/pull/20524 From syan at openjdk.org Fri Aug 9 14:50:03 2024 From: syan at openjdk.org (SendaoYan) Date: Fri, 9 Aug 2024 14:50:03 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v3] In-Reply-To: References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: On Fri, 9 Aug 2024 14:22:42 GMT, Christian Hagedorn wrote: >> Thanks for your advice. Does `vm.compiler2.enabled & vm.flagless` include the situation `vm.compMode != "Xint"`, I mean `vm.compMode != "Xint"` not needed when set requires `vm.compiler2.enabled & vm.flagless` > > Yes, I guess it's not needed. The code has been changed according your advice. Thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20524#discussion_r1711602565 From duke at openjdk.org Fri Aug 9 15:23:48 2024 From: duke at openjdk.org (duke) Date: Fri, 9 Aug 2024 15:23:48 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v13] In-Reply-To: References: Message-ID: <3r3BVhGKPPKpcHX9Xfz2nBwWnod3-FrolykXcf9EQPc=.65a6e150-f33d-4dd6-a5e3-3058427c821f@github.com> On Thu, 6 Oct 2022 06:28:04 GMT, Smita Kamath wrote: >> 8289552: Make intrinsic conversions between bit representations of half precision values and floats > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated instruct to use kmovw @smita-kamath Your change (at version a00c3ecdab6b2c8ca6883e92bb51e3fa99544a17) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/9781#issuecomment-1275006152 From epeter at openjdk.org Fri Aug 9 15:23:48 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 9 Aug 2024 15:23:48 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v13] In-Reply-To: References: <66be8SJdxPOqmqsQ1YIwS4zM4GwPerypGIf8IbfxhRs=.1d03c94a-f3e5-40ae-999e-bdd5f328170d@github.com> Message-ID: On Tue, 11 Oct 2022 17:00:53 GMT, Smita Kamath wrote: >> I started new testing. > > @vnkozlov Thank you for reviewing the patch. @smita-kamath I think I just found another regression of this feature: https://bugs.openjdk.org/browse/JDK-8338126 Can you please have a look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/9781#issuecomment-2278194773 From kvn at openjdk.org Fri Aug 9 17:12:50 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Aug 2024 17:12:50 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v3] In-Reply-To: References: Message-ID: > Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. > `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: > [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) > > On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) > As result we waste two registers to pass constant and NULL. > > Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) > > I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. > > Tested tier1-3,stress,xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/callnode.hpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20437/files - new: https://git.openjdk.org/jdk/pull/20437/files/0e8321e2..d8513442 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20437.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20437/head:pull/20437 PR: https://git.openjdk.org/jdk/pull/20437 From azafari at openjdk.org Fri Aug 9 18:00:02 2024 From: azafari at openjdk.org (Afshin Zafari) Date: Fri, 9 Aug 2024 18:00:02 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' Message-ID: The operand of shift which is a constant `0` changed to `unsigned long`. ------------- Commit messages: - 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' Changes: https://git.openjdk.org/jdk/pull/20530/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20530&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8300800 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20530.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20530/head:pull/20530 PR: https://git.openjdk.org/jdk/pull/20530 From svkamath at openjdk.org Fri Aug 9 18:02:49 2024 From: svkamath at openjdk.org (Smita Kamath) Date: Fri, 9 Aug 2024 18:02:49 GMT Subject: RFR: 8289552: Make intrinsic conversions between bit representations of half precision values and floats [v13] In-Reply-To: References: <66be8SJdxPOqmqsQ1YIwS4zM4GwPerypGIf8IbfxhRs=.1d03c94a-f3e5-40ae-999e-bdd5f328170d@github.com> Message-ID: <_PlJd1cbdiMGb1yUCWWZDf13xpTIBH2FtPcJ62VduhE=.36f54fc6-2c4f-488d-8d41-874cbf24d722@github.com> On Fri, 9 Aug 2024 15:21:22 GMT, Emanuel Peter wrote: >> @vnkozlov Thank you for reviewing the patch. > > @smita-kamath I think I just found another regression of this feature: https://bugs.openjdk.org/browse/JDK-8338126 > Can you please have a look? @eme64, Sure will look into it. Thank you. ------------- PR Comment: https://git.openjdk.org/jdk/pull/9781#issuecomment-2278462816 From gziemski at openjdk.org Fri Aug 9 19:47:31 2024 From: gziemski at openjdk.org (Gerard Ziemski) Date: Fri, 9 Aug 2024 19:47:31 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' In-Reply-To: References: Message-ID: <_ssyuLGxjN084l07Pq95QmfnP5RNZVyQKzhbXHBNdas=.0e01ec30-5572-473e-bd33-3f4d348b9c1b@github.com> On Fri, 9 Aug 2024 17:54:17 GMT, Afshin Zafari wrote: > The operand of shift which is a constant `0` changed to `unsigned long`. LGTM, thanks! ------------- Marked as reviewed by gziemski (Committer). PR Review: https://git.openjdk.org/jdk/pull/20530#pullrequestreview-2230930783 From kvn at openjdk.org Fri Aug 9 19:56:33 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Aug 2024 19:56:33 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: <2YYUNNYadbPKGGQ8jNqLpSX-Q4jTLeRVa0OXT0Xj_RU=.ed35c73c-8045-4008-bfd1-14d370fff721@github.com> References: <2YYUNNYadbPKGGQ8jNqLpSX-Q4jTLeRVa0OXT0Xj_RU=.ed35c73c-8045-4008-bfd1-14d370fff721@github.com> Message-ID: On Wed, 7 Aug 2024 13:12:01 GMT, Fei Yang wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix arm (32 bits) build > > FYI: Also performed tier1-3 test on linux-riscv64. Result looks good. Thank you, @RealFYang, for testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20437#issuecomment-2278646121 From kvn at openjdk.org Fri Aug 9 19:56:34 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Aug 2024 19:56:34 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: <3yuLs7rogsy-qznycivDI6m7TizfSMJrbvEgfqhyW30=.4018f37c-64d2-4629-808c-1c538d67a9ce@github.com> References: <3yuLs7rogsy-qznycivDI6m7TizfSMJrbvEgfqhyW30=.4018f37c-64d2-4629-808c-1c538d67a9ce@github.com> Message-ID: On Fri, 9 Aug 2024 08:34:59 GMT, Tobias Hartmann wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix arm (32 bits) build > > src/hotspot/cpu/aarch64/aarch64.ad line 16188: > >> 16186: >> 16187: // Forward exception. >> 16188: instruct ForwardExceptionjmp() > > Suggestion: > > instruct ForwardException() > > > Same for other AD files. We can't use the same name for Mach node as for Ideal node: src/hotspot/cpu/x86/x86_64.ad(12596) Syntax Error: :duplicate name ForwardException for instruction Error Context: >>>(<<<) src/hotspot/cpu/x86/x86_64.ad(12597) Syntax Error: :Identifier expected, but found '%{ match(ForwardEx[...]'. Error Context: >>>%<<<{ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20437#discussion_r1712071962 From kvn at openjdk.org Fri Aug 9 20:05:50 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Aug 2024 20:05:50 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v4] In-Reply-To: References: Message-ID: > Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. > `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: > [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) > > On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) > As result we waste two registers to pass constant and NULL. > > Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) > > I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. > > Tested tier1-3,stress,xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: add ForwardExceptionNode type to vmStruct ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20437/files - new: https://git.openjdk.org/jdk/pull/20437/files/d8513442..931e15ce Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20437.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20437/head:pull/20437 PR: https://git.openjdk.org/jdk/pull/20437 From kvn at openjdk.org Fri Aug 9 20:15:35 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 9 Aug 2024 20:15:35 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: <3yuLs7rogsy-qznycivDI6m7TizfSMJrbvEgfqhyW30=.4018f37c-64d2-4629-808c-1c538d67a9ce@github.com> References: <3yuLs7rogsy-qznycivDI6m7TizfSMJrbvEgfqhyW30=.4018f37c-64d2-4629-808c-1c538d67a9ce@github.com> Message-ID: On Fri, 9 Aug 2024 08:36:32 GMT, Tobias Hartmann wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix arm (32 bits) build > > Nice enhancement. > > I noticed that `TailJump` is in vmStructs.cpp, should `ForwardException` be added as well? Hi, @TobiHartmann I cleaned `ForwardExceptionNode` constructor using your suggestion and added its type to `vmStruct`. I kept name of Mach instructions unchanged (with `jump` at the end) because we can't use the same name as Ideal node. We do generate `jump' instruction so the name is appropriate in this sense I think. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20437#issuecomment-2278672039 From kbarrett at openjdk.org Sun Aug 11 04:54:37 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Sun, 11 Aug 2024 04:54:37 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' In-Reply-To: References: Message-ID: On Fri, 9 Aug 2024 17:54:17 GMT, Afshin Zafari wrote: > The operand of shift which is a constant `0` changed to `unsigned long`. This should use UCONST64 instead of a UL suffix, because that suffix is platform-dependent. Windows is LLP64, so L is 32 bits. This is code that is (probably, I haven't checked for sure) used by the windows-aarch64 port. Changes requested by kbarrett (Reviewer). src/hotspot/cpu/aarch64/immediate_aarch64.cpp line 298: > 296: uint64_t or_bits_sub = replicate(or_bit, 1, nbits); > 297: uint64_t and_bits_top = (and_bits_sub << nbits) | ones(nbits); > 298: uint64_t or_bits_top = (0UL << nbits) | or_bits_sub; This should use UCONST64 instead of a UL suffix, because that suffix is platform-dependent. Windows is LLP64, so L is 32 bits. This is code that is (probably, I haven't checked for sure) used by the windows-aarch64 port. ------------- Changes requested by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20530#pullrequestreview-2231737254 PR Review: https://git.openjdk.org/jdk/pull/20530#pullrequestreview-2231737350 PR Review Comment: https://git.openjdk.org/jdk/pull/20530#discussion_r1712907362 From thartmann at openjdk.org Mon Aug 12 05:08:31 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 12 Aug 2024 05:08:31 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v3] In-Reply-To: <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> Message-ID: On Fri, 9 Aug 2024 14:50:03 GMT, SendaoYan wrote: >> Hi all, >> The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. And ths test will fail without c2 compiler, so it also need `@requires vm.compiler2.enabled`. >> >> The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete vm.compMode != "Xint" Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20524#pullrequestreview-2232052022 From amitkumar at openjdk.org Mon Aug 12 05:25:33 2024 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 12 Aug 2024 05:25:33 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: On Fri, 9 Aug 2024 11:48:17 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Give barrier generation helper functions a more consistent name is there issue if we replace this code: if (in_bytes(SATBMarkQueue::byte_width_of_active()) == 4) { __ ldrw(rscratch1, in_progress); } else { assert(in_bytes(SATBMarkQueue::byte_width_of_active()) == 1, "Assumption"); __ ldrb(rscratch1, in_progress); } in method `G1BarrierSetAssembler::gen_write_ref_array_pre_barrier` with `generate_queue_test_and_insertion(masm, rthread, rscratch1)` ? Though you have to move the `gen_write_ref_array_pre_barrier` on top otherwise compiler wouldn't be able to find it. ------------- PR Review: https://git.openjdk.org/jdk/pull/19746#pullrequestreview-2232065079 From thartmann at openjdk.org Mon Aug 12 05:33:31 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 12 Aug 2024 05:33:31 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v4] In-Reply-To: References: Message-ID: On Fri, 9 Aug 2024 20:05:50 GMT, Vladimir Kozlov wrote: >> Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. >> `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: >> [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) >> >> On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) >> As result we waste two registers to pass constant and NULL. >> >> Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) >> >> I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. >> >> Tested tier1-3,stress,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > add ForwardExceptionNode type to vmStruct Looks good! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20437#pullrequestreview-2232070772 From thartmann at openjdk.org Mon Aug 12 05:33:32 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 12 Aug 2024 05:33:32 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v2] In-Reply-To: References: <3yuLs7rogsy-qznycivDI6m7TizfSMJrbvEgfqhyW30=.4018f37c-64d2-4629-808c-1c538d67a9ce@github.com> Message-ID: On Fri, 9 Aug 2024 19:53:33 GMT, Vladimir Kozlov wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 16188: >> >>> 16186: >>> 16187: // Forward exception. >>> 16188: instruct ForwardExceptionjmp() >> >> Suggestion: >> >> instruct ForwardException() >> >> >> Same for other AD files. > > We can't use the same name for Mach node as for Ideal node: > > src/hotspot/cpu/x86/x86_64.ad(12596) Syntax Error: :duplicate name ForwardException for instruction > Error Context: >>>(<<<) > src/hotspot/cpu/x86/x86_64.ad(12597) Syntax Error: :Identifier expected, but found '%{ > match(ForwardEx[...]'. > Error Context: >>>%<<<{ Ah, makes sense. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20437#discussion_r1713185542 From chagedorn at openjdk.org Mon Aug 12 05:53:31 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 12 Aug 2024 05:53:31 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v3] In-Reply-To: <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> Message-ID: On Fri, 9 Aug 2024 14:50:03 GMT, SendaoYan wrote: >> Hi all, >> The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. And ths test will fail without c2 compiler, so it also need `@requires vm.compiler2.enabled`. >> >> The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete vm.compMode != "Xint" Looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20524#pullrequestreview-2232088066 From syan at openjdk.org Mon Aug 12 06:24:39 2024 From: syan at openjdk.org (SendaoYan) Date: Mon, 12 Aug 2024 06:24:39 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v3] In-Reply-To: <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> Message-ID: <9j-_jBEq5uBNzRg4sQTnew9Mg---rLQJR3QUW9KkGzM=.7093ce97-e392-4a7d-be47-5adf04fa27f8@github.com> On Fri, 9 Aug 2024 14:50:03 GMT, SendaoYan wrote: >> Hi all, >> The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. And ths test will fail without c2 compiler, so it also need `@requires vm.compiler2.enabled`. >> >> The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete vm.compMode != "Xint" Thanks all for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20524#issuecomment-2283186104 From duke at openjdk.org Mon Aug 12 06:24:39 2024 From: duke at openjdk.org (duke) Date: Mon, 12 Aug 2024 06:24:39 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v3] In-Reply-To: <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> Message-ID: On Fri, 9 Aug 2024 14:50:03 GMT, SendaoYan wrote: >> Hi all, >> The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. And ths test will fail without c2 compiler, so it also need `@requires vm.compiler2.enabled`. >> >> The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete vm.compMode != "Xint" @sendaoYan Your change (at version 292ad2a830b7ec3670acd95f40c3632fd0b0aba4) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20524#issuecomment-2283188371 From syan at openjdk.org Mon Aug 12 06:31:40 2024 From: syan at openjdk.org (SendaoYan) Date: Mon, 12 Aug 2024 06:31:40 GMT Subject: RFR: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build [v3] In-Reply-To: <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> <7U4YnZY9Dgf7Mi2y9o3vd_mvmWL_Y-zccHPX98zUcks=.f8966376-3a94-4dcc-8aa9-3020cc496786@github.com> Message-ID: On Fri, 9 Aug 2024 14:50:03 GMT, SendaoYan wrote: >> Hi all, >> The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. And ths test will fail without c2 compiler, so it also need `@requires vm.compiler2.enabled`. >> >> The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete vm.compMode != "Xint" Thanks for the sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20524#issuecomment-2283196621 From syan at openjdk.org Mon Aug 12 06:31:40 2024 From: syan at openjdk.org (SendaoYan) Date: Mon, 12 Aug 2024 06:31:40 GMT Subject: Integrated: 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build In-Reply-To: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> References: <_yw7yokeFXUsQ8mZjBAfM8TnNzn-mIoDrcj3GvvxKTQ=.0640becf-9b54-4016-b238-3a0721a4944f@github.com> Message-ID: On Fri, 9 Aug 2024 13:12:14 GMT, SendaoYan wrote: > Hi all, > The testcase `testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` fails with release build, the test log snippet says `IR verification disabled due to not running a debug build`. So I think it's need `@requires vm.debug == true`. And ths test will fail without c2 compiler, so it also need `@requires vm.compiler2.enabled`. > > The change has been verified with linux x64 release build and linux x64 fastdebug build, no risk, This pull request has now been integrated. Changeset: 0e7c1c1a Author: SendaoYan Committer: Jie Fu URL: https://git.openjdk.org/jdk/commit/0e7c1c1afeaba1c125b70cabe7b1b7a3193ee5c3 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8338112: Test testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java fails with release build Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/20524 From jbhateja at openjdk.org Mon Aug 12 06:32:33 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 12 Aug 2024 06:32:33 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: On Fri, 9 Aug 2024 03:28:53 GMT, Jasmine Karthikeyan wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8338201 >> - Removed redundant comment >> - 8338021: Support saturating vector operators in VectorAPI > > src/hotspot/share/opto/type.cpp line 495: > >> 493: TypeInt::POS1 = TypeInt::make(1,max_jint, WidenMin); // Positive values >> 494: TypeInt::INT = TypeInt::make(min_jint,max_jint, WidenMax); // 32-bit integers >> 495: TypeInt::UINT = TypeInt::make(0, max_juint, WidenMin); // Unsigned ints > > This would make an illegal type, right? Since `TypeInt` is signed using `max_juint` as the hi value would end up as signed -1, resulting in the type `0..-1`, an empty type. I wonder if there's a better way to handle this, since in the type system empty types are in a sense equivalent to `TOP`. @jaskarth , its usage in existing patch is limited to [type comparison.](https://github.com/openjdk/jdk/pull/20507/files#diff-3559dcf23b719805be5fd06fd5c1851dbd8f53e47afe6d99cba13a3de0ebc6b2R1542). My plan is to address intrinsification of new core lib APIs, associated value range folding optimization (since unsigned numbers have different value range of [0, MAX_VALUE) vs signed [-MIN_VALUE/2, +MAX_VALUE/2) numbers) and auto-vectorization in a follow up patch. **Notes on C2 type system:** Unlike Type::FLOAT, integral type ranges are specified using _lo and _hi value range, these ranges are pruned using flow functions associated with each operation IR. Constraining the value ranges allows logic pruning, e.g. in1[TypeInt] & 0x7FFFFFFF will chop off -ve values ranges from in1, thus a constrol structure like . `if (in1 < 0) { true_path ; } else { false_path; } ` which uses in1 as a flow condition will sweepout the true path. C2 type system only maintains value ranges for integral types i.e. long and int, any sub-word type which as per JVM specification has an int storage "word" only constrains the value range of TypeInt. A type which represent a constant value has same _hi and _lo value. Floating point types Type::FLOAT / DOUBLE cannot maintain upper / lower value ranges due to rounding constraints. Thus a C2 type system maintains a separate type TypeF and TypeD which are singletons and represent a constant value. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1713220777 From duke at openjdk.org Mon Aug 12 06:38:33 2024 From: duke at openjdk.org (Tobias Hotz) Date: Mon, 12 Aug 2024 06:38:33 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v4] In-Reply-To: References: Message-ID: On Fri, 9 Aug 2024 20:05:50 GMT, Vladimir Kozlov wrote: >> Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. >> `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: >> [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) >> >> On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) >> As result we waste two registers to pass constant and NULL. >> >> Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) >> >> I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. >> >> Tested tier1-3,stress,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > add ForwardExceptionNode type to vmStruct src/hotspot/share/opto/generateOptoStub.cpp line 258: > 256: > 257: assert (StubRoutines::forward_exception_entry() != nullptr, "must be generated before"); > 258: Node *exc_target = makecon(TypeRawPtr::make( StubRoutines::forward_exception_entry() )); `exc_target` is no longer used, so this should probably be removed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20437#discussion_r1713228039 From kbarrett at openjdk.org Mon Aug 12 07:20:46 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 12 Aug 2024 07:20:46 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp Message-ID: Please review this change to remove -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp. We're changing uses of _0 suffixed exception macros to instead use the corresponding _NULL suffixed macros where appropriate. Testing: mach5 tier1 ------------- Commit messages: - fix jvmCompilerToVM.cpp Changes: https://git.openjdk.org/jdk/pull/20538/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20538&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338156 Stats: 31 lines in 1 file changed: 0 ins; 0 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/20538.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20538/head:pull/20538 PR: https://git.openjdk.org/jdk/pull/20538 From kbarrett at openjdk.org Mon Aug 12 07:28:59 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 12 Aug 2024 07:28:59 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp Message-ID: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Please review this change to machnode.cpp to remove dead code following calls to ShouldNotReachHere() and ShouldNotCallThis(). This takes advantage of the availability and use of [[noreturn]] attributes on all supported platforms, and makes all uses of those functions in this file consistent in this respect. As a side effect, this removes some -Wzero-as-null-pointer-constant warnings. Testing: mach5 tier1 ------------- Commit messages: - ShouldNotXXX cleanup in machnode.cpp Changes: https://git.openjdk.org/jdk/pull/20540/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20540&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338158 Stats: 9 lines in 1 file changed: 0 ins; 4 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/20540.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20540/head:pull/20540 PR: https://git.openjdk.org/jdk/pull/20540 From tschatzl at openjdk.org Mon Aug 12 07:40:30 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 12 Aug 2024 07:40:30 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 07:16:34 GMT, Kim Barrett wrote: > Please review this change to remove -Wzero-as-null-pointer-constant warnings > in jvmciCompilerToVM.cpp. We're changing uses of _0 suffixed exception macros > to instead use the corresponding _NULL suffixed macros where appropriate. > > Testing: mach5 tier1 Marked as reviewed by tschatzl (Reviewer). src/hotspot/share/jvmci/jvmciCompilerToVM.cpp line 553: > 551: if (!klass->is_interface()) { > 552: THROW_MSG_NULL(vmSymbols::java_lang_IllegalArgumentException(), > 553: err_msg("Expected interface type, got %s", klass->external_name())); Unlike in the other cases, the indentation has not been updated. Looks good otherwise. ------------- PR Review: https://git.openjdk.org/jdk/pull/20538#pullrequestreview-2232240694 PR Review Comment: https://git.openjdk.org/jdk/pull/20538#discussion_r1713292066 From chagedorn at openjdk.org Mon Aug 12 07:54:31 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 12 Aug 2024 07:54:31 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Message-ID: On Mon, 12 Aug 2024 07:23:57 GMT, Kim Barrett wrote: > Please review this change to machnode.cpp to remove dead code following calls > to ShouldNotReachHere() and ShouldNotCallThis(). This takes advantage of the > availability and use of [[noreturn]] attributes on all supported platforms, > and makes all uses of those functions in this file consistent in this respect. > > As a side effect, this removes some -Wzero-as-null-pointer-constant warnings. > > Testing: mach5 tier1 Looks good! Will you update `fatal()` uses separately? There is, for example: https://github.com/openjdk/jdk/blob/03204600c596214895ef86581eba9722f76d39b3/src/hotspot/share/ci/ciEnv.cpp#L851-L852 ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20540#pullrequestreview-2232266040 From azafari at openjdk.org Mon Aug 12 08:19:44 2024 From: azafari at openjdk.org (Afshin Zafari) Date: Mon, 12 Aug 2024 08:19:44 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' [v2] In-Reply-To: References: Message-ID: > The operand of shift which is a constant `0` changed to `unsigned long`. Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: Used UCONST64(0) instead of UL. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20530/files - new: https://git.openjdk.org/jdk/pull/20530/files/9143adaf..69033fa7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20530&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20530&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20530.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20530/head:pull/20530 PR: https://git.openjdk.org/jdk/pull/20530 From adinn at openjdk.org Mon Aug 12 08:31:31 2024 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 12 Aug 2024 08:31:31 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 08:19:44 GMT, Afshin Zafari wrote: >> The operand of shift which is a constant `0` changed to `unsigned long`. > > Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: > > Used UCONST64(0) instead of UL. Kim is correct that this code is needed for Windows/aarch64. Hence that it should use UCONST64(0). ------------- Marked as reviewed by adinn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20530#pullrequestreview-2232342472 From fgao at openjdk.org Mon Aug 12 08:38:37 2024 From: fgao at openjdk.org (Fei Gao) Date: Mon, 12 Aug 2024 08:38:37 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v4] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Fri, 5 Jul 2024 09:19:47 GMT, Andrew Haley wrote: >>> > We can still see the extra register copy. So, I guess it is not caused by this PR and it does exist before? >>> >>> If so then might it be better to relocate Andrew's patch to a separate issue? We may want to backport it independent from this change. >> >> Thanks for your review @adinn . That sounds quite worthwhile. I'm afraid we can't relocate the fix directly, which depends on part of my cleanup in this PR. I'll try to withdraw the common part to make the separate fix simple and clean. Thanks. > >> > We can still see the extra register copy. So, I guess it is not caused by this PR and it does exist before? >> >> If so then might it be better to relocate Andrew's patch to a separate issue? We may want to backport it independent from this change. > > And there may be a more powerful way to fix it than my suggestion. I'm thinking of something analgous to `iRegIorL2I` which may be used instead of `regP` as an input to all address-producing patterns. Hi @theRealAph @adinn @dean-long , can I have a review please? Thanks :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2283393982 From rcastanedalo at openjdk.org Mon Aug 12 08:38:37 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 12 Aug 2024 08:38:37 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 05:23:06 GMT, Amit Kumar wrote: > is there issue if we replace this code: > > ``` > if (in_bytes(SATBMarkQueue::byte_width_of_active()) == 4) { > __ ldrw(rscratch1, in_progress); > } else { > assert(in_bytes(SATBMarkQueue::byte_width_of_active()) == 1, "Assumption"); > __ ldrb(rscratch1, in_progress); > } > ``` > > in method `G1BarrierSetAssembler::gen_write_ref_array_pre_barrier` with `generate_queue_test_and_insertion(masm, rthread, rscratch1)` ? > > Though you have to move the `gen_write_ref_array_pre_barrier` on top otherwise compiler wouldn't be able to find it. Thanks for the suggestion Amit! this refactoring would work (assuming you mean `generate_pre_barrier_fast_path` instead of `generate_queue_test_and_insertion`), however I am hesitant to apply it because 1) it would further increase the size of the changelog and hence the burden of reviewing it and 2) it is not a clear maintainability win: some engineers prefer a little bit of code duplication to preserve the assembly code flow (see discussion [here](https://github.com/openjdk/jdk/pull/19746#discussion_r1645713269)). ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2283395013 From rcastanedalo at openjdk.org Mon Aug 12 08:46:16 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 12 Aug 2024 08:46:16 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <5Q8PqULlpKfoPLXRqI0ua0dVWAy3zPBqtFpycNwBg0Y=.f2830c84-63ba-43cd-85e3-2245e4ac8917@github.com> Message-ID: On Fri, 9 Aug 2024 14:05:43 GMT, Martin Doerr wrote: >> I agree that using indirect memory operands is the most readable choice, and is slightly less wasteful from a register usage perspective. However, when I tried this choice a couple of months ago, I observed timeouts in some CTW runs, which as far as I remember were caused when LCM processed huge basic blocks with lots of memory writes (e.g. arising from static initializations of large String arrays such as in [here](https://github.com/apache/lucene/blob/ea562f6ef2b32fe6eadf57c6381d9a69acb043c7/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/KStemData1.java#L47-L748)), in combination with C2 stress options. In these scenarios, the large number of additional Mach nodes seemed to cause the timeouts. I settled for materializing the store address internally to guard against such corner cases. I did not see any significant performance difference between the two choices in my benchmark results. >> >> I would like to study whether LCM can be made more robust in this scenario, which would enable using indirect memory operands here, but I think this would be best addressed in a separate RFE. Would it be OK by now to extend the code comment with the details provided in the above explanation? > > Ok, doing it in a separate RFE is fine with me. This sounds like a C2 problem which should get investigated. It may cause other performance problems, too. Maybe a native profiler can show what takes too much time. Thanks Martin, I have added this to my list of follow-up tasks and extended the comment in the code with some more details (commit d21104ca8). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1713372749 From rcastanedalo at openjdk.org Mon Aug 12 08:46:16 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 12 Aug 2024 08:46:16 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Further motivate the choice of internal store address materialization in x64 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/1834bf41..d21104ca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=04-05 Stats: 4 lines in 1 file changed: 1 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From amitkumar at openjdk.org Mon Aug 12 08:50:35 2024 From: amitkumar at openjdk.org (Amit Kumar) Date: Mon, 12 Aug 2024 08:50:35 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 08:35:57 GMT, Roberto Casta?eda Lozano wrote: > > is there issue if we replace this code: > > ``` > > if (in_bytes(SATBMarkQueue::byte_width_of_active()) == 4) { > > __ ldrw(rscratch1, in_progress); > > } else { > > assert(in_bytes(SATBMarkQueue::byte_width_of_active()) == 1, "Assumption"); > > __ ldrb(rscratch1, in_progress); > > } > > ``` > > > > > > > > > > > > > > > > > > > > > > > > in method `G1BarrierSetAssembler::gen_write_ref_array_pre_barrier` with `generate_queue_test_and_insertion(masm, rthread, rscratch1)` ? > > Though you have to move the `gen_write_ref_array_pre_barrier` on top otherwise compiler wouldn't be able to find it. > > Thanks for the suggestion Amit! this refactoring would work (assuming you mean `generate_pre_barrier_fast_path` instead of `generate_queue_test_and_insertion`), however I am hesitant to apply it because 1) it would further increase the size of the changelog and hence the burden of reviewing it and 2) it is not a clear maintainability win: some engineers prefer a little bit of code duplication to preserve the assembly code flow (see discussion [here](https://github.com/openjdk/jdk/pull/19746#discussion_r1645713269)). Ha! makes sense. Are you planning to rebase it with master ? Nothing important, but there were couple of failures which are fixed after this PR. So will make test result a bit clean for us ?. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2283418237 From aph at openjdk.org Mon Aug 12 09:13:37 2024 From: aph at openjdk.org (Andrew Haley) Date: Mon, 12 Aug 2024 09:13:37 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v5] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Mon, 29 Jul 2024 13:05:46 GMT, Fei Gao wrote: >> On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: >> >> cast<64> (32-bit compressed reference) + field_offset >> >> >> When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. >> >> For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. >> >> In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. >> >> Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. >> >> We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. >> >> Tier 1-3 passed on aarch64. > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Merge branch 'master' into fg8319690 > - Discard IndOffXX style and let legitimize_address() fix any out-of-range immediate offsets > - Merge branch 'master' into fg8319690 > - Add the assertion back and merge matchrules with a better predicate > - Merge branch 'master' into fg8319690 > - Remove unused immIOffset/immLOffset > - Merge branch 'master' into fg8319690 > - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" > > On LP64 systems, if the heap can be moved into low virtual > address space (below 4GB) and the heap size is smaller than the > interesting threshold of 4 GB, we can use unscaled decoding > pattern for narrow klass decoding. It means that a generic field > reference can be decoded by: > ``` > cast<64> (32-bit compressed reference) + field_offset > ``` > > When the `field_offset` is an immediate, on aarch64 platform, the > unscaled decoding pattern can match perfectly with a direct > addressing mode, i.e., `base_plus_offset`, supported by LDR/STR > instructions. But for certain data width, not all immediates can > be encoded in the instruction field of LDR/STR[1]. The ranges are > different as data widths vary. > > For example, when we try to load a value of long type at offset of > `1030`, the address expression is `(AddP (DecodeN base) 1030)`. > Before the patch, the expression was matching with > `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate > byte offset must be in the range -256 to 255 or positive immediate > byte offset must be a multiple of 8 in the range 0 to 32760[2]. > `1030` can't be encoded in the instruction field. So, after > matching, when we do checking for instruction encoding, the > assertion would fail. > > In this patch, we're going to filter out invalid immediates > when deciding if current addressing mode can be matched as > `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and > `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data > type separately in the patch. E.g., for `memory4`, we remove > the generic `indOffIN/indOffLN`, which matches wrong unscaled > immediate range, and replace them with `indOffIN4/indOffLN4` > instead. > > Since 8-bit and 16-bit LDR/STR instructions also support the > unscaled decoding pattern, we add the addressing mode in the > lists of `memory1` and `memory... Looks good. That's a nice simplification and cleanup. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16991#pullrequestreview-2232427855 From epeter at openjdk.org Mon Aug 12 09:16:47 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 12 Aug 2024 09:16:47 GMT Subject: RFR: 8338124: C2 SuperWord: MulAddS2I input permutation still partially broken after JDK-8333840 Message-ID: [JDK-8333840](https://bugs.openjdk.org/browse/JDK-8333840) attempted to fix some input permutation cases. Sadly, there is a typo which means that there can still be some broken cases that produce wrong results. `MulAddS2I` is a special case, where we have to verify that the 4 inputs for all members of the pack are in the same permutation. I only started the checking on the second element. This worked on all my previous examples, but does not work on the added regression test. We must, of course, start the checking on the first element. ------------- Commit messages: - JDK-8338124 Changes: https://git.openjdk.org/jdk/pull/20539/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20539&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338124 Stats: 18 lines in 2 files changed: 16 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20539.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20539/head:pull/20539 PR: https://git.openjdk.org/jdk/pull/20539 From epeter at openjdk.org Mon Aug 12 09:21:10 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 12 Aug 2024 09:21:10 GMT Subject: RFR: 8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion [v4] In-Reply-To: <9Wq_S7WPbQa5JP4G3dqR8fqW8lfLZufZ5kYGs7QKh2o=.ab980543-4c6e-4c00-bbb5-2756768ad0df@github.com> References: <9Wq_S7WPbQa5JP4G3dqR8fqW8lfLZufZ5kYGs7QKh2o=.ab980543-4c6e-4c00-bbb5-2756768ad0df@github.com> Message-ID: > After [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155), there is now only a single use of `SuperWord::longer_type_for_conversion`. > The check there is rather useless, and we should just remove it for the sake of code complexity. > > **What did this check do?** > It checks if the input-type of a node is larger than the output type, relevant in these cases: > - `is_convert_opcode` > - `is_scalar_op_that_returns_int_but_vector_op_returns_long` > > If there is a larger type, it checks if this larger type can even create vectors of at least size 2: > - `Matcher::max_vector_size_auto_vectorization(longer_bt) < 2)` > > Of course if there cannot be vectors for this larger type, then we could avoid creating these vectors. > > **Why is it rather useless?** > On most modern platforms, all types have vectors of at least length 2. So this would never fail on those platforms. > And even if this check fails: we would pack the nodes and later reject those packs when we do the `implemented` checks. Hence, we do not wrongly vectorize anyway. The cost of removing this check is that on most platforms we do not have to do this check, and the few that it would fail for just have a bit more compilation time (but very very minimal increase). I think this is worth the reduction in complexity. But we could always revert this change if there are problems because of it. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion - 8335628 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20009/files - new: https://git.openjdk.org/jdk/pull/20009/files/7b441807..b7ceb7cd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20009&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20009&range=02-03 Stats: 42024 lines in 1343 files changed: 23021 ins; 13209 del; 5794 mod Patch: https://git.openjdk.org/jdk/pull/20009.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20009/head:pull/20009 PR: https://git.openjdk.org/jdk/pull/20009 From kbarrett at openjdk.org Mon Aug 12 09:26:31 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 12 Aug 2024 09:26:31 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 08:19:44 GMT, Afshin Zafari wrote: >> The operand of shift which is a constant `0` changed to `unsigned long`. > > Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: > > Used UCONST64(0) instead of UL. src/hotspot/cpu/aarch64/immediate_aarch64.cpp line 298: > 296: uint64_t or_bits_sub = replicate(or_bit, 1, nbits); > 297: uint64_t and_bits_top = (and_bits_sub << nbits) | ones(nbits); > 298: uint64_t or_bits_top = (UCONST64(0) << nbits) | or_bits_sub; I focused on the UL suffix earlier, and didn't really think about what this is doing. Why are we shifting a zero value at all? This equivalent to `uint64_t or_bits_top = or_bits_sub;`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20530#discussion_r1713423493 From chagedorn at openjdk.org Mon Aug 12 10:48:34 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 12 Aug 2024 10:48:34 GMT Subject: RFR: 8338124: C2 SuperWord: MulAddS2I input permutation still partially broken after JDK-8333840 In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 07:17:22 GMT, Emanuel Peter wrote: > [JDK-8333840](https://bugs.openjdk.org/browse/JDK-8333840) attempted to fix some input permutation cases. Sadly, there is a typo which means that there can still be some broken cases that produce wrong results. > > `MulAddS2I` is a special case, where we have to verify that the 4 inputs for all members of the pack are in the same permutation. I only started the checking on the second element. This worked on all my previous examples, but does not work on the added regression test. We must, of course, start the checking on the first element. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20539#pullrequestreview-2232634333 From epeter at openjdk.org Mon Aug 12 12:13:35 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 12 Aug 2024 12:13:35 GMT Subject: RFR: 8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion [v3] In-Reply-To: References: <9Wq_S7WPbQa5JP4G3dqR8fqW8lfLZufZ5kYGs7QKh2o=.ab980543-4c6e-4c00-bbb5-2756768ad0df@github.com> Message-ID: On Mon, 29 Jul 2024 15:56:00 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion >> - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion >> - 8335628 > > There was micro `TypeVectorOperations.java ` in [8283091](https://github.com/openjdk/jdk/commit/a1795901ee292fa6272768cef2fedcaaf8044074) changes. Can you check that there is no regression on our supported platforms? @vnkozlov I ran that benchmark both on x64 (AVX512) and aarch64 (neon). There is no significant difference. (i.e. the difference between the SuperWord and non-SuperWord runs of that benchmark is in factors 2x-10x, the difference between master and patch of the SuperWord run is in the normal noise range). ------------- PR Comment: https://git.openjdk.org/jdk/pull/20009#issuecomment-2283804394 From rcastanedalo at openjdk.org Mon Aug 12 12:13:42 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 12 Aug 2024 12:13:42 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 08:48:24 GMT, Amit Kumar wrote: > Are you planning to rebase it with master ? Nothing important, but there were couple of failures which are fixed after this PR. So will make test result a bit clean for us ?. Actually, I have refrained to update to the latest mainline changes to avoid interfering with the porting work while it is in progress, but if there is consensus among the port maintainers I would be happy to update the changeset regularly. @TheRealMDoerr @feilongjiang @offamitkumar @snazarkin what do you prefer? ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2283802801 From mdoerr at openjdk.org Mon Aug 12 12:25:41 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 12 Aug 2024 12:25:41 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 08:46:16 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Further motivate the choice of internal store address materialization in x64 I'm a bit concerned about regular updates. We should at least check if all platforms are in a good shape before merging. JDK head looks good at the moment, so I'd appreciate an update. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2283828888 From thartmann at openjdk.org Mon Aug 12 12:30:31 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 12 Aug 2024 12:30:31 GMT Subject: RFR: 8338124: C2 SuperWord: MulAddS2I input permutation still partially broken after JDK-8333840 In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 07:17:22 GMT, Emanuel Peter wrote: > [JDK-8333840](https://bugs.openjdk.org/browse/JDK-8333840) attempted to fix some input permutation cases. Sadly, there is a typo which means that there can still be some broken cases that produce wrong results. > > `MulAddS2I` is a special case, where we have to verify that the 4 inputs for all members of the pack are in the same permutation. I only started the checking on the second element. This worked on all my previous examples, but does not work on the added regression test. We must, of course, start the checking on the first element. Looks good to me too! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20539#pullrequestreview-2232856837 From adinn at openjdk.org Mon Aug 12 12:50:33 2024 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 12 Aug 2024 12:50:33 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 09:23:24 GMT, Kim Barrett wrote: >> Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: >> >> Used UCONST64(0) instead of UL. > > src/hotspot/cpu/aarch64/immediate_aarch64.cpp line 298: > >> 296: uint64_t or_bits_sub = replicate(or_bit, 1, nbits); >> 297: uint64_t and_bits_top = (and_bits_sub << nbits) | ones(nbits); >> 298: uint64_t or_bits_top = (UCONST64(0) << nbits) | or_bits_sub; > > I focused on the UL suffix earlier, and didn't really think about what this is doing. Why are we shifting > a zero value at all? This equivalent to `uint64_t or_bits_top = or_bits_sub;`. I believe this question came up in an earlier thread and was answered by @theRealAph. The shift of zero is there to emphasise continuity of this case with other cases where a non-zero value is shifted i.e. it serves to emphasize/document the connection between this implementation and the algorithm that it embodies. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20530#discussion_r1713703513 From enikitin at openjdk.org Mon Aug 12 13:14:32 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Mon, 12 Aug 2024 13:14:32 GMT Subject: RFR: 8337102: JITTester: Fix breaks in static initialization blocks In-Reply-To: References: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> Message-ID: On Tue, 30 Jul 2024 17:01:21 GMT, Igor Veresov wrote: > Why would it generate a break if there is no enclosing loop? That's the essence of the change - the static initialisation block creates an inner body (Block node) at [factories/StaticConstructorDefinitionFactory.java#L58](https://github.com/openjdk/jdk/blob/89a15f1414f89d2dd32eac791e9155fcb4207e56/test/hotspot/jtreg/testlibrary/jittester/src/jdk/test/lib/jittester/factories/StaticConstructorDefinitionFactory.java#L58), with `canHaveBreaks` enabled. The inner blocks/nodes just inherit that. Well, majority of them: body = new IRNodeBuilder() .. .setCanHaveBreaks(true) // <- Children inherit this "can" ... .produce(); ... return new StaticConstructorDefinition(body); That behaviour has been checked and current version happily creates such erroneous blocks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20310#issuecomment-2283948493 From enikitin at openjdk.org Mon Aug 12 13:17:31 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Mon, 12 Aug 2024 13:17:31 GMT Subject: RFR: 8337102: JITTester: Fix breaks in static initialization blocks In-Reply-To: References: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> Message-ID: <7pM2y92iNEuWqMjSO37Tj4sECTgIqNqxIZNTw2J6h1s=.cea05d8e-691e-4093-a57f-ec0e0d255fd4@github.com> On Tue, 30 Jul 2024 17:04:37 GMT, Emanuel Peter wrote: > A drive-by question: I suppose this would still allow `break` in `switch`? I don't know. But the change only alters the static initialisation behaviour. If `switch` statements got their `breaks` correctly now, they will continue getting them after the change as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20310#issuecomment-2283957047 From rcastanedalo at openjdk.org Mon Aug 12 13:18:33 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 12 Aug 2024 13:18:33 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Thu, 8 Aug 2024 09:29:17 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove leftover CHUNK_SIZE reference src/hotspot/share/opto/postaloc.cpp line 765: > 763: // in both registers. > 764: OptoReg::Name nreg_lo = OptoReg::add(nreg,-1); > 765: if( !lrgs(lidx).mask().Member(nreg_lo) ) { // Nearly always adjacent Is the removal of `// Either a spill slot, or` intentional? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1713751207 From rcastanedalo at openjdk.org Mon Aug 12 13:45:37 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 12 Aug 2024 13:45:37 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: <0XaixuIp0R9cEzxQncknnvUAvgWSi9LPNe--m2veEVg=.408f013c-c0dc-41ec-b8df-a11e1a0b0357@github.com> On Thu, 8 Aug 2024 09:29:17 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove leftover CHUNK_SIZE reference src/hotspot/share/opto/regmask.hpp line 82: > 80: // In rare situations (e.g., "more than 90+ parameters on Intel"), we need to > 81: // extend the register mask with dynamically allocated memory. > 82: uintptr_t* _RM_UP_EXT = nullptr; Have you considered using a growable array (`src/hotspot/share/utilities/growableArray.hpp`) for this part? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1713809721 From rcastanedalo at openjdk.org Mon Aug 12 14:00:36 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 12 Aug 2024 14:00:36 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 12:23:24 GMT, Martin Doerr wrote: > I'm a bit concerned about regular updates. We should at least check if all platforms are in a good shape before merging. JDK head looks good at the moment, so I'd appreciate an update. OK, I will test and push a merge of jdk-24+10 (Thu Aug 8) in the next days, unless @feilongjiang or @snazarkin object. We can then check in a few weeks if another update is required. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2284064522 From mdoerr at openjdk.org Mon Aug 12 14:06:35 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 12 Aug 2024 14:06:35 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 08:46:16 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Further motivate the choice of internal store address materialization in x64 src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 203: > 201: // Do we need to load the previous value? > 202: if (obj != noreg) { > 203: __ load_heap_oop(pre_val, Address(obj, 0), noreg, noreg, AS_RAW); How do we handle implicit null checks for which `obj` is null? Note that we may expect the store instruction to trigger SIGSEGV. Does it work correctly if we trigger the SIGSEGV, here in the pre barrier? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1713842991 From iveresov at openjdk.org Mon Aug 12 15:01:30 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Mon, 12 Aug 2024 15:01:30 GMT Subject: RFR: 8337102: JITTester: Fix breaks in static initialization blocks In-Reply-To: References: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> Message-ID: On Mon, 12 Aug 2024 13:12:01 GMT, Evgeny Nikitin wrote: > > Why would it generate a break if there is no enclosing loop? > > That's the essence of the change - the static initialisation block creates an inner body (Block node) at [factories/StaticConstructorDefinitionFactory.java#L58](https://github.com/openjdk/jdk/blob/89a15f1414f89d2dd32eac791e9155fcb4207e56/test/hotspot/jtreg/testlibrary/jittester/src/jdk/test/lib/jittester/factories/StaticConstructorDefinitionFactory.java#L58), with `canHaveBreaks` enabled. The inner blocks/nodes just inherit that. Well, majority of them: > > ``` > body = new IRNodeBuilder() > .. > .setCanHaveBreaks(true) // <- Children inherit this "can" > ... > .produce(); > ... > return new StaticConstructorDefinition(body); > ``` > > That behaviour has been checked and current version happily creates such erroneous blocks. What I mean is - loops with breaks are totally legal in static initializers. So, the bug, I think, is actually that we generated a break outside of the loop. Disabling breaks of course solves it, but it doesn't look like it fixes the underlying problem, does it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20310#issuecomment-2284215303 From kvn at openjdk.org Mon Aug 12 16:14:08 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Aug 2024 16:14:08 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v5] In-Reply-To: References: Message-ID: > Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. > `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: > [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) > > On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) > As result we waste two registers to pass constant and NULL. > > Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) > > I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. > > Tested tier1-3,stress,xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Removed unused variable. Updated copyright year. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20437/files - new: https://git.openjdk.org/jdk/pull/20437/files/931e15ce..f08c1728 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20437&range=03-04 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20437.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20437/head:pull/20437 PR: https://git.openjdk.org/jdk/pull/20437 From kvn at openjdk.org Mon Aug 12 16:14:09 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Aug 2024 16:14:09 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v4] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 06:35:53 GMT, Tobias Hotz wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> add ForwardExceptionNode type to vmStruct > > src/hotspot/share/opto/generateOptoStub.cpp line 258: > >> 256: >> 257: assert (StubRoutines::forward_exception_entry() != nullptr, "must be generated before"); >> 258: Node *exc_target = makecon(TypeRawPtr::make( StubRoutines::forward_exception_entry() )); > > `exc_target` is no longer used, so this should probably be removed removed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20437#discussion_r1714056302 From thartmann at openjdk.org Mon Aug 12 16:41:32 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 12 Aug 2024 16:41:32 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v5] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 16:14:08 GMT, Vladimir Kozlov wrote: >> Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. >> `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: >> [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) >> >> On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) >> As result we waste two registers to pass constant and NULL. >> >> Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) >> >> I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. >> >> Tested tier1-3,stress,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Removed unused variable. Updated copyright year. Still good. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20437#pullrequestreview-2233550496 From kvn at openjdk.org Mon Aug 12 16:49:31 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Aug 2024 16:49:31 GMT Subject: RFR: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() [v5] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 16:14:08 GMT, Vladimir Kozlov wrote: >> Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. >> `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: >> [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) >> >> On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) >> As result we waste two registers to pass constant and NULL. >> >> Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) >> >> I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. >> >> Tested tier1-3,stress,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Removed unused variable. Updated copyright year. Thank you, Tobias ------------- PR Comment: https://git.openjdk.org/jdk/pull/20437#issuecomment-2284485072 From kvn at openjdk.org Mon Aug 12 17:20:33 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Aug 2024 17:20:33 GMT Subject: RFR: 8338124: C2 SuperWord: MulAddS2I input permutation still partially broken after JDK-8333840 In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 07:17:22 GMT, Emanuel Peter wrote: > [JDK-8333840](https://bugs.openjdk.org/browse/JDK-8333840) attempted to fix some input permutation cases. Sadly, there is a typo which means that there can still be some broken cases that produce wrong results. > > `MulAddS2I` is a special case, where we have to verify that the 4 inputs for all members of the pack are in the same permutation. I only started the checking on the second element. This worked on all my previous examples, but does not work on the added regression test. We must, of course, start the checking on the first element. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20539#pullrequestreview-2233629728 From kvn at openjdk.org Mon Aug 12 17:20:37 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Aug 2024 17:20:37 GMT Subject: RFR: 8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion [v4] In-Reply-To: References: <9Wq_S7WPbQa5JP4G3dqR8fqW8lfLZufZ5kYGs7QKh2o=.ab980543-4c6e-4c00-bbb5-2756768ad0df@github.com> Message-ID: On Mon, 12 Aug 2024 09:21:10 GMT, Emanuel Peter wrote: >> After [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155), there is now only a single use of `SuperWord::longer_type_for_conversion`. >> The check there is rather useless, and we should just remove it for the sake of code complexity. >> >> **What did this check do?** >> It checks if the input-type of a node is larger than the output type, relevant in these cases: >> - `is_convert_opcode` >> - `is_scalar_op_that_returns_int_but_vector_op_returns_long` >> >> If there is a larger type, it checks if this larger type can even create vectors of at least size 2: >> - `Matcher::max_vector_size_auto_vectorization(longer_bt) < 2)` >> >> Of course if there cannot be vectors for this larger type, then we could avoid creating these vectors. >> >> **Why is it rather useless?** >> On most modern platforms, all types have vectors of at least length 2. So this would never fail on those platforms. >> And even if this check fails: we would pack the nodes and later reject those packs when we do the `implemented` checks. Hence, we do not wrongly vectorize anyway. The cost of removing this check is that on most platforms we do not have to do this check, and the few that it would fail for just have a bit more compilation time (but very very minimal increase). I think this is worth the reduction in complexity. But we could always revert this change if there are problems because of it. > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion > - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion > - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion > - 8335628 Thank you for additional performance testing. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20009#pullrequestreview-2233628212 From kvn at openjdk.org Mon Aug 12 17:23:35 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Aug 2024 17:23:35 GMT Subject: Integrated: 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() In-Reply-To: References: Message-ID: On Fri, 2 Aug 2024 06:00:22 GMT, Vladimir Kozlov wrote: > Currently C2 uses `TailCall` node when it generates code to forward exception in C2 runtime stubs. > `StubRoutines::forward_exception_entry()` address is passed as constant and method pointer is `NULL`: > [generateOptoStub.cpp#L258](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/generateOptoStub.cpp#L258) > > On other hand TailCall mach node uses 2 registers as parameter which is hardcoded in `Matcher`: [matcher.cpp#L828](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L828) > As result we waste two registers to pass constant and NULL. > > Also incorrect relocation is used for such call because the address of `forward_exception` stub passed in register in mach node. When it is converted to `Address` for `jmp` instruction the default `external_word_type` relocation is used when `runtime_call_type` should be used. See discussion in PR [JDK-8337396](https://github.com/openjdk/jdk/pull/20412) > > I added new ideal node `ForwardExceptionNode` to solve these issues. It is similar to `Rethrow` node (which mach node definition I used as template) but I kept it based on `Return` node similar to `TailCall` node. > > Tested tier1-3,stress,xcomp This pull request has now been integrated. Changeset: 99edb4a4 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/99edb4a45d6a4a871dec9c07b41b3ab66b89a4b6 Stats: 155 lines in 15 files changed: 130 ins; 2 del; 23 mod 8337702: Use new ForwardExceptionNode to call StubRoutines::forward_exception_entry() Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/20437 From kvn at openjdk.org Mon Aug 12 17:24:34 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Aug 2024 17:24:34 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Message-ID: On Mon, 12 Aug 2024 07:23:57 GMT, Kim Barrett wrote: > Please review this change to machnode.cpp to remove dead code following calls > to ShouldNotReachHere() and ShouldNotCallThis(). This takes advantage of the > availability and use of [[noreturn]] attributes on all supported platforms, > and makes all uses of those functions in this file consistent in this respect. > > As a side effect, this removes some -Wzero-as-null-pointer-constant warnings. > > Testing: mach5 tier1 Please activate GHA testing for this branch to make sure all builds passed in it. ------------- PR Review: https://git.openjdk.org/jdk/pull/20540#pullrequestreview-2233635108 From enikitin at openjdk.org Mon Aug 12 17:37:32 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Mon, 12 Aug 2024 17:37:32 GMT Subject: RFR: 8337102: JITTester: Fix breaks in static initialization blocks In-Reply-To: References: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> Message-ID: On Mon, 12 Aug 2024 14:58:51 GMT, Igor Veresov wrote: > What I mean is - loops with breaks are totally legal in static initializers. Yes. But I cannot imagine a SIB defined inside a loop. Probably, you could give me a counterexample, but AFAIR they are only allowed on class level and one cannot define a class in a loop. > So, the bug, I think, is actually that we generated a break outside of the loop. Checking whether we have an enclosing cycle (and therefore breaks are allowed), AFAIU, is done via passing those `canHaveBreaks` through children constructors. For example, the [BlockFactory#L59](https://github.com/openjdk/jdk/blob/89a15f1414f89d2dd32eac791e9155fcb4207e56/test/hotspot/jtreg/testlibrary/jittester/src/jdk/test/lib/jittester/factories/BlockFactory.java#L59): BlockFactory(TypeKlass klass, Type returnType, long complexityLimit, int statementLimit, int operatorLimit, int level, boolean subBlock, boolean canHaveBreaks, // <-- passed down the children's tree boolean canHaveContinues, boolean canHaveReturn, boolean canHaveThrows) Loops, including those created in SIBs, explicitly define `canHaveBreaks=true` for their children. For example, ForLoop does this: https://github.com/openjdk/jdk/blob/89a15f1414f89d2dd32eac791e9155fcb4207e56/test/hotspot/jtreg/testlibrary/jittester/src/jdk/test/lib/jittester/factories/ForFactory.java#L154 https://github.com/openjdk/jdk/blob/89a15f1414f89d2dd32eac791e9155fcb4207e56/test/hotspot/jtreg/testlibrary/jittester/src/jdk/test/lib/jittester/factories/ForFactory.java#L169 https://github.com/openjdk/jdk/blob/89a15f1414f89d2dd32eac791e9155fcb4207e56/test/hotspot/jtreg/testlibrary/jittester/src/jdk/test/lib/jittester/factories/ForFactory.java#L183 And neither SIB nor For/While/DoWhile factories do take `canHaveBreaks` in their constructors, as they control that themselves. Here's [WhileFactory](https://github.com/openjdk/jdk/blob/89a15f1414f89d2dd32eac791e9155fcb4207e56/test/hotspot/jtreg/testlibrary/jittester/src/jdk/test/lib/jittester/factories/WhileFactory.java#L51), for example: WhileFactory(TypeKlass ownerClass, Type returnType, long complexityLimit, int statementLimit, int operatorLimit, int level, boolean canHaveReturn) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20310#issuecomment-2284569281 From kcr at openjdk.org Mon Aug 12 17:57:32 2024 From: kcr at openjdk.org (Kevin Rushforth) Date: Mon, 12 Aug 2024 17:57:32 GMT Subject: [jdk23] RFR: 8334715: [riscv] Mixed use of tab and whitespace in riscv.ad In-Reply-To: References: Message-ID: <3jqW_xPuEwfyoMb3We9lfGj1ZkjDnVLFoSZkk6N2u5Y=.cc1a8e52-8b93-405d-bb7c-01f278aebbaa@github.com> On Fri, 21 Jun 2024 15:56:35 GMT, SendaoYan wrote: > Hi all, > > This pull request contains a backport of commit [93d98027](https://github.com/openjdk/jdk/commit/93d98027649615afeeeb6a9510230d9655a74a8f) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by SendaoYan on 21 Jun 2024 and was reviewed by Christian Hagedorn and Amit Kumar. > > Thanks! JDK 23 is in the release candidate (RC) phase. This PR does not meet the RC criteria. Please close it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19834#issuecomment-2284603855 From dhanalla at openjdk.org Mon Aug 12 18:31:06 2024 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Mon, 12 Aug 2024 18:31:06 GMT Subject: RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded [v2] In-Reply-To: References: Message-ID: > In the debug build, the assert is triggered during the parsing (before Code_Gen). In the Release build, however, the compilation bails out at `Compile::check_node_count()` during the code generation phase and completes execution without any issues. > > When I commented out the assert(C->live_nodes() <= C->max_node_limit()), both debug and Release builds exhibited the same behavior: the compilation bails out, and execution completes without any issues. > > The assert statement is not essential, as it is causing unnecessary failures in the debug build. Dhamoder Nalla has updated the pull request incrementally with one additional commit since the last revision: add test case ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20504/files - new: https://git.openjdk.org/jdk/pull/20504/files/ba37b4f5..7d3367f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20504&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20504&range=00-01 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20504.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20504/head:pull/20504 PR: https://git.openjdk.org/jdk/pull/20504 From kbarrett at openjdk.org Mon Aug 12 18:39:32 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 12 Aug 2024 18:39:32 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 08:19:44 GMT, Afshin Zafari wrote: >> The operand of shift which is a constant `0` changed to `unsigned long`. > > Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: > > Used UCONST64(0) instead of UL. Looks good. ------------- Marked as reviewed by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20530#pullrequestreview-2233771418 From kbarrett at openjdk.org Mon Aug 12 18:39:33 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 12 Aug 2024 18:39:33 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 12:47:42 GMT, Andrew Dinn wrote: >> src/hotspot/cpu/aarch64/immediate_aarch64.cpp line 298: >> >>> 296: uint64_t or_bits_sub = replicate(or_bit, 1, nbits); >>> 297: uint64_t and_bits_top = (and_bits_sub << nbits) | ones(nbits); >>> 298: uint64_t or_bits_top = (UCONST64(0) << nbits) | or_bits_sub; >> >> I focused on the UL suffix earlier, and didn't really think about what this is doing. Why are we shifting >> a zero value at all? This equivalent to `uint64_t or_bits_top = or_bits_sub;`. > > I believe this question came up in an earlier thread and was answered by @theRealAph. The shift of zero is there to emphasise continuity of this case with other cases where a non-zero value is shifted i.e. it serves to emphasize/document the connection between this implementation and the algorithm that it embodies. Thanks for the background. It still looks weird and I can't unsee it now. But a comment might be almost as intrusive to readability. So okay. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20530#discussion_r1714219217 From dlong at openjdk.org Mon Aug 12 20:24:32 2024 From: dlong at openjdk.org (Dean Long) Date: Mon, 12 Aug 2024 20:24:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: <2J558IEDhou1AcFbhww1g1J3eN51lvvj7nWzS0v4RRs=.46e850b9-c335-42e0-a01b-acda28e7a44c@github.com> On Thu, 8 Aug 2024 09:29:17 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove leftover CHUNK_SIZE reference Don't we still need RegMask::can_represent() checks to make sure regmask size doesn't exceed what OptoReg and OptoRegPair can represent? Otherwise we will need checks every time we create a new OptoReg or OptoRegPair, or is there a better way? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2284840611 From psandoz at openjdk.org Mon Aug 12 22:06:30 2024 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 12 Aug 2024 22:06:30 GMT Subject: RFR: 8338023: Support two vector selectFrom API In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Thu, 8 Aug 2024 06:57:28 GMT, Jatin Bhateja wrote: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... The results look promising. I can provide guidance on the specification e.g., we can specify the behavior in terms of rearrange, with the addition of throwing on out of bounds indexes. Regarding the throwing of exceptions, some wider context will help to know where we are heading before we finalize the specification. I believe we are considering changing the default throwing behavior for index out of bounds to wrapping, thereby we can avoid bounds checks. If that is the case we should wait until that is done then update rather than submitting a CSR just yet? I see you created a specific intrinsic, which will avoid the cost of shuffle creation. Should we apply the same approach (in a subsequent PR) to the single argument shuffle? Or perhaps if we manage to optimize shuffles and change the default wrapping we don't require a specific intrinsic and can just use defer to rearrange? ------------- PR Review: https://git.openjdk.org/jdk/pull/20508#pullrequestreview-2234095541 From psandoz at openjdk.org Mon Aug 12 22:36:48 2024 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 12 Aug 2024 22:36:48 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 17:20:06 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SATURATING_UADD : Saturating unsigned addition. >> . SATURATING_ADD : Saturating signed addition. >> . SATURATING_USUB : Saturating unsigned subtraction. >> . SATURATING_SUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8338201 > - Removed redundant comment > - 8338021: Support saturating vector operators in VectorAPI Naming wise for the scalar methods i recommend the pattern of `op{Saturating}{Unsigned}`, that fits better with naming patterns used elsewhere, where we tend to be literal. For the vector operations we should refer to unsigned consistently with the unsigned compare operation names. Here we can be more terse. Which makes me wonder if we should use `U` consistently for unsigned and `S` for saturating e.g. `SUADD`, `UGT`, `UMAX` etc. Then that flows into the names used in `VectorSupport.java` and `vectorSupport.hpp`. ------------- PR Review: https://git.openjdk.org/jdk/pull/20507#pullrequestreview-2234118377 From iveresov at openjdk.org Tue Aug 13 00:52:49 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 13 Aug 2024 00:52:49 GMT Subject: RFR: 8337102: JITTester: Fix breaks in static initialization blocks In-Reply-To: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> References: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> Message-ID: On Wed, 24 Jul 2024 10:28:54 GMT, Evgeny Nikitin wrote: > Static initialisation blocks (SIBs) should not have `break`s in their code, as well as their descendants. Currently, StaticConstructorDefinitionFactory allows them, causing non-compilable constructions like this: > > > class Test_0 { > static { > if (true) { > break; // <- compilation error here > } > } > } > > > It seems like previously an attempt have been made to resolve this by disabling SIBs whatsoever. > > Currently, out of 100 generated tests we have 2-3 compilation errors. > Allowing SIBs raises this to 80 out of 100 tests failing due to erroneous 'break' blocks. > Disabling breaks in StaticConstructorDefinition gives us SIBs, and returns failure rate to the same 2-3%. > Disabling breaks in StaticConstructorDefinition doesn't prevent breaks from happening, as loop factories (`ForFactory`, `WhileFactory`, etc.) explicitly allow for breaks in their descendant trees. > > Testing: > 1. 200-300 generations in various setups to get the numbers mentioned above; > 2. I checked manually that breaks do not disappear from code, > 3. ... and appear in loops' (for, while, do-while) descendants. Marked as reviewed by iveresov (Reviewer). Got it, thanks for the explanation. ------------- PR Review: https://git.openjdk.org/jdk/pull/20310#pullrequestreview-2234254656 PR Comment: https://git.openjdk.org/jdk/pull/20310#issuecomment-2285141949 From syan at openjdk.org Tue Aug 13 01:09:02 2024 From: syan at openjdk.org (SendaoYan) Date: Tue, 13 Aug 2024 01:09:02 GMT Subject: [jdk23] RFR: 8334715: [riscv] Mixed use of tab and whitespace in riscv.ad In-Reply-To: <3jqW_xPuEwfyoMb3We9lfGj1ZkjDnVLFoSZkk6N2u5Y=.cc1a8e52-8b93-405d-bb7c-01f278aebbaa@github.com> References: <3jqW_xPuEwfyoMb3We9lfGj1ZkjDnVLFoSZkk6N2u5Y=.cc1a8e52-8b93-405d-bb7c-01f278aebbaa@github.com> Message-ID: On Mon, 12 Aug 2024 17:55:11 GMT, Kevin Rushforth wrote: > JDK 23 is in the release candidate (RC) phase. This PR does not meet the RC criteria. Please close it. Okey ------------- PR Comment: https://git.openjdk.org/jdk/pull/19834#issuecomment-2285155765 From syan at openjdk.org Tue Aug 13 01:09:02 2024 From: syan at openjdk.org (SendaoYan) Date: Tue, 13 Aug 2024 01:09:02 GMT Subject: [jdk23] Withdrawn: 8334715: [riscv] Mixed use of tab and whitespace in riscv.ad In-Reply-To: References: Message-ID: <-ZQYYHEaj3Whlb4EClWM8hDl3MU8fkJQEYc97R_A9w8=.243e06c3-6576-4ba0-ab18-2a05eabdc4f6@github.com> On Fri, 21 Jun 2024 15:56:35 GMT, SendaoYan wrote: > Hi all, > > This pull request contains a backport of commit [93d98027](https://github.com/openjdk/jdk/commit/93d98027649615afeeeb6a9510230d9655a74a8f) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by SendaoYan on 21 Jun 2024 and was reviewed by Christian Hagedorn and Amit Kumar. > > Thanks! This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/19834 From dlong at openjdk.org Tue Aug 13 03:22:47 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 13 Aug 2024 03:22:47 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Message-ID: On Mon, 12 Aug 2024 07:23:57 GMT, Kim Barrett wrote: > Please review this change to machnode.cpp to remove dead code following calls > to ShouldNotReachHere() and ShouldNotCallThis(). This takes advantage of the > availability and use of [[noreturn]] attributes on all supported platforms, > and makes all uses of those functions in this file consistent in this respect. > > As a side effect, this removes some -Wzero-as-null-pointer-constant warnings. > > Testing: mach5 tier1 Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20540#pullrequestreview-2234423646 From duke at openjdk.org Tue Aug 13 04:43:48 2024 From: duke at openjdk.org (duke) Date: Tue, 13 Aug 2024 04:43:48 GMT Subject: RFR: 8337102: JITTester: Fix breaks in static initialization blocks In-Reply-To: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> References: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> Message-ID: On Wed, 24 Jul 2024 10:28:54 GMT, Evgeny Nikitin wrote: > Static initialisation blocks (SIBs) should not have `break`s in their code, as well as their descendants. Currently, StaticConstructorDefinitionFactory allows them, causing non-compilable constructions like this: > > > class Test_0 { > static { > if (true) { > break; // <- compilation error here > } > } > } > > > It seems like previously an attempt have been made to resolve this by disabling SIBs whatsoever. > > Currently, out of 100 generated tests we have 2-3 compilation errors. > Allowing SIBs raises this to 80 out of 100 tests failing due to erroneous 'break' blocks. > Disabling breaks in StaticConstructorDefinition gives us SIBs, and returns failure rate to the same 2-3%. > Disabling breaks in StaticConstructorDefinition doesn't prevent breaks from happening, as loop factories (`ForFactory`, `WhileFactory`, etc.) explicitly allow for breaks in their descendant trees. > > Testing: > 1. 200-300 generations in various setups to get the numbers mentioned above; > 2. I checked manually that breaks do not disappear from code, > 3. ... and appear in loops' (for, while, do-while) descendants. @lepestock Your change (at version ad8b42b554152cc2af8790c06a30cca14e44be23) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20310#issuecomment-2285327822 From epeter at openjdk.org Tue Aug 13 05:55:57 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Aug 2024 05:55:57 GMT Subject: RFR: 8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion [v4] In-Reply-To: References: <9Wq_S7WPbQa5JP4G3dqR8fqW8lfLZufZ5kYGs7QKh2o=.ab980543-4c6e-4c00-bbb5-2756768ad0df@github.com> Message-ID: On Mon, 12 Aug 2024 17:17:32 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion >> - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion >> - Merge branch 'master' into JDK-8335628-rm-sw-longer_type_for_conversion >> - 8335628 > > Thank you for additional performance testing. Thanks @vnkozlov @chhagedorn for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20009#issuecomment-2285389004 From epeter at openjdk.org Tue Aug 13 05:56:03 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Aug 2024 05:56:03 GMT Subject: RFR: 8338124: C2 SuperWord: MulAddS2I input permutation still partially broken after JDK-8333840 In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 17:18:20 GMT, Vladimir Kozlov wrote: >> [JDK-8333840](https://bugs.openjdk.org/browse/JDK-8333840) attempted to fix some input permutation cases. Sadly, there is a typo which means that there can still be some broken cases that produce wrong results. >> >> `MulAddS2I` is a special case, where we have to verify that the 4 inputs for all members of the pack are in the same permutation. I only started the checking on the second element. This worked on all my previous examples, but does not work on the added regression test. We must, of course, start the checking on the first element. > > Good. Thanks for the reviews @vnkozlov @chhagedorn @TobiHartmann ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20539#issuecomment-2285389821 From epeter at openjdk.org Tue Aug 13 05:55:57 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Aug 2024 05:55:57 GMT Subject: Integrated: 8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion In-Reply-To: <9Wq_S7WPbQa5JP4G3dqR8fqW8lfLZufZ5kYGs7QKh2o=.ab980543-4c6e-4c00-bbb5-2756768ad0df@github.com> References: <9Wq_S7WPbQa5JP4G3dqR8fqW8lfLZufZ5kYGs7QKh2o=.ab980543-4c6e-4c00-bbb5-2756768ad0df@github.com> Message-ID: On Wed, 3 Jul 2024 15:38:06 GMT, Emanuel Peter wrote: > After [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155), there is now only a single use of `SuperWord::longer_type_for_conversion`. > The check there is rather useless, and we should just remove it for the sake of code complexity. > > **What did this check do?** > It checks if the input-type of a node is larger than the output type, relevant in these cases: > - `is_convert_opcode` > - `is_scalar_op_that_returns_int_but_vector_op_returns_long` > > If there is a larger type, it checks if this larger type can even create vectors of at least size 2: > - `Matcher::max_vector_size_auto_vectorization(longer_bt) < 2)` > > Of course if there cannot be vectors for this larger type, then we could avoid creating these vectors. > > **Why is it rather useless?** > On most modern platforms, all types have vectors of at least length 2. So this would never fail on those platforms. > And even if this check fails: we would pack the nodes and later reject those packs when we do the `implemented` checks. Hence, we do not wrongly vectorize anyway. The cost of removing this check is that on most platforms we do not have to do this check, and the few that it would fail for just have a bit more compilation time (but very very minimal increase). I think this is worth the reduction in complexity. But we could always revert this change if there are problems because of it. This pull request has now been integrated. Changeset: 73ddb7de Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/73ddb7deab26c526337ec6e7cd5f528f698a552c Stats: 27 lines in 2 files changed: 0 ins; 26 del; 1 mod 8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/20009 From kbarrett at openjdk.org Tue Aug 13 05:56:16 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 05:56:16 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 07:38:12 GMT, Thomas Schatzl wrote: >> Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: >> >> fix indent in jvmciCompilerToVM.cpp > > src/hotspot/share/jvmci/jvmciCompilerToVM.cpp line 553: > >> 551: if (!klass->is_interface()) { >> 552: THROW_MSG_NULL(vmSymbols::java_lang_IllegalArgumentException(), >> 553: err_msg("Expected interface type, got %s", klass->external_name())); > > Unlike in the other cases, the indentation has not been updated. Looks good otherwise. In other cases I was maintaining pre-existing alignment. Here it was already formatted differently, and I thought this was a file that used this indentation style (not that I like it). But looking around, I'm not finding other uses of this style, so I've aligned the arguments. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20538#discussion_r1714703715 From kbarrett at openjdk.org Tue Aug 13 05:56:16 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 05:56:16 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp [v2] In-Reply-To: References: Message-ID: > Please review this change to remove -Wzero-as-null-pointer-constant warnings > in jvmciCompilerToVM.cpp. We're changing uses of _0 suffixed exception macros > to instead use the corresponding _NULL suffixed macros where appropriate. > > Testing: mach5 tier1 Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: fix indent in jvmciCompilerToVM.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20538/files - new: https://git.openjdk.org/jdk/pull/20538/files/fdf8f929..a97ee392 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20538&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20538&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20538.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20538/head:pull/20538 PR: https://git.openjdk.org/jdk/pull/20538 From epeter at openjdk.org Tue Aug 13 05:56:04 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 13 Aug 2024 05:56:04 GMT Subject: Integrated: 8338124: C2 SuperWord: MulAddS2I input permutation still partially broken after JDK-8333840 In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 07:17:22 GMT, Emanuel Peter wrote: > [JDK-8333840](https://bugs.openjdk.org/browse/JDK-8333840) attempted to fix some input permutation cases. Sadly, there is a typo which means that there can still be some broken cases that produce wrong results. > > `MulAddS2I` is a special case, where we have to verify that the 4 inputs for all members of the pack are in the same permutation. I only started the checking on the second element. This worked on all my previous examples, but does not work on the added regression test. We must, of course, start the checking on the first element. This pull request has now been integrated. Changeset: c27a8c8c Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/c27a8c8c8b867e6812b905f6154762802a498dbd Stats: 18 lines in 2 files changed: 16 ins; 0 del; 2 mod 8338124: C2 SuperWord: MulAddS2I input permutation still partially broken after JDK-8333840 Reviewed-by: chagedorn, thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/20539 From kbarrett at openjdk.org Tue Aug 13 06:33:50 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 06:33:50 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Message-ID: <3XUVo3pib1RhzrqA-76Y8xud2h_2KNjqTUrVrOn3hpQ=.9ef57efe-024b-4762-b893-b40ade5987d0@github.com> On Mon, 12 Aug 2024 17:21:33 GMT, Vladimir Kozlov wrote: > Please activate GHA testing for this branch to make sure all builds passed in it. https://github.com/kimbarrett/openjdk-jdk/actions/runs/10363346918/job/28687676623 All green. Although there seems to be some configuration problem with linux-riscv64, yet still reports success?! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20540#issuecomment-2285437770 From kbarrett at openjdk.org Tue Aug 13 06:37:49 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 06:37:49 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Message-ID: <1NRzN6bd1tAkoSLffSJBxO20fYTB2gSUEYXzKRHIPXI=.2bc04ab6-258c-499c-b8d8-4fb95e24b1f1@github.com> On Mon, 12 Aug 2024 07:52:22 GMT, Christian Hagedorn wrote: > Looks good! Just a general question, will you update `fatal()` uses in a separate RFE? I'm not planning to. There are lots and lots of these sorts of things for ShouldNotXXX, fatal(), &etc. I only did these because I was touching some to eliminate -Wzero-as-null-pointer-constant warnings, and there were few enough it was still a relatively small change. Also there was already inconstency of usage in this one file. I suggest just similarly cleaning these up as code is being touched for other reasons. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20540#issuecomment-2285443478 From chagedorn at openjdk.org Tue Aug 13 06:57:48 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 13 Aug 2024 06:57:48 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: <1NRzN6bd1tAkoSLffSJBxO20fYTB2gSUEYXzKRHIPXI=.2bc04ab6-258c-499c-b8d8-4fb95e24b1f1@github.com> References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> <1NRzN6bd1tAkoSLffSJBxO20fYTB2gSUEYXzKRHIPXI=.2bc04ab6-258c-499c-b8d8-4fb95e24b1f1@github.com> Message-ID: On Tue, 13 Aug 2024 06:34:50 GMT, Kim Barrett wrote: > > Looks good! Just a general question, will you update `fatal()` uses in a separate RFE? > > I'm not planning to. There are lots and lots of these sorts of things for ShouldNotXXX, fatal(), &etc. I only did these because I was touching some to eliminate -Wzero-as-null-pointer-constant warnings, and there were few enough it was still a relatively small change. Also there was already inconstency of usage in this one file. I suggest just similarly cleaning these up as code is being touched for other reasons. Sounds good, thanks for sharing some details! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20540#issuecomment-2285479162 From fgao at openjdk.org Tue Aug 13 08:39:01 2024 From: fgao at openjdk.org (Fei Gao) Date: Tue, 13 Aug 2024 08:39:01 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v5] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Mon, 12 Aug 2024 09:10:42 GMT, Andrew Haley wrote: > Looks good. That's a nice simplification and cleanup. @theRealAph thanks for your approval! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2285682132 From dlunden at openjdk.org Tue Aug 13 10:19:49 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 13 Aug 2024 10:19:49 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Mon, 12 Aug 2024 13:16:11 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove leftover CHUNK_SIZE reference > > src/hotspot/share/opto/postaloc.cpp line 765: > >> 763: // in both registers. >> 764: OptoReg::Name nreg_lo = OptoReg::add(nreg,-1); >> 765: if( !lrgs(lidx).mask().Member(nreg_lo) ) { // Nearly always adjacent > > Is the removal of `// Either a spill slot, or` intentional? Unintentional, and I have to look closer at this. I suspect the "Either a spill slot" comment refers to if the register is larger than or equal to `LRG::SPILL_REG`, which I believe is implied by `!RegMask::can_represent(nreg_lo)` at this stage of `PhaseChaitin`. We should probably replace `RegMask::can_represent(nreg_lo)` with an explicit check `nreg_lo < LRG::SPILL_REG`. > src/hotspot/share/opto/regmask.hpp line 82: > >> 80: // In rare situations (e.g., "more than 90+ parameters on Intel"), we need to >> 81: // extend the register mask with dynamically allocated memory. >> 82: uintptr_t* _RM_UP_EXT = nullptr; > > Have you considered using a growable array (`src/hotspot/share/utilities/growableArray.hpp`) for this part? No, I'll have a look and see if it makes sense to use growable arrays in this case. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1715045904 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1715046316 From dlunden at openjdk.org Tue Aug 13 10:25:50 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 13 Aug 2024 10:25:50 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <2J558IEDhou1AcFbhww1g1J3eN51lvvj7nWzS0v4RRs=.46e850b9-c335-42e0-a01b-acda28e7a44c@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> <2J558IEDhou1AcFbhww1g1J3eN51lvvj7nWzS0v4RRs=.46e850b9-c335-42e0-a01b-acda28e7a44c@github.com> Message-ID: On Mon, 12 Aug 2024 20:22:14 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove leftover CHUNK_SIZE reference > > Don't we still need RegMask::can_represent() checks to make sure regmask size doesn't exceed what OptoReg and OptoRegPair can represent? Otherwise we will need checks every time we create a new OptoReg or OptoRegPair, or is there a better way? @dean-long > Does C2 have a maximum frame size? I'll investigate, unless someone already has the answer. > Don't we still need RegMask::can_represent() checks to make sure regmask size doesn't exceed what OptoReg and OptoRegPair can represent? Otherwise we will need checks every time we create a new OptoReg or OptoRegPair, or is there a better way? Yes, we should add checks somewhere. My initial thought was to have them whenever growing or offsetting register masks (operations that change what a register mask can represent), and bail out upon failing a check. But, then we may have bailouts whenever, e.g., inserting into register masks, which is probably difficult to handle. Also, even before my changes, `OptoReg`s may overflow in `PhaseChaitin::Select` because, even though register masks have a limit, there is no limit on how much `chunk` can grow. That is, one could also argue that `OptoReg` overflow is a separate issue. I will have to think a bit more about this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2285904155 From jwaters at openjdk.org Tue Aug 13 12:44:49 2024 From: jwaters at openjdk.org (Julian Waters) Date: Tue, 13 Aug 2024 12:44:49 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp [v2] In-Reply-To: References: Message-ID: On Tue, 13 Aug 2024 05:56:16 GMT, Kim Barrett wrote: >> Please review this change to remove -Wzero-as-null-pointer-constant warnings >> in jvmciCompilerToVM.cpp. We're changing uses of _0 suffixed exception macros >> to instead use the corresponding _NULL suffixed macros where appropriate. >> >> Testing: mach5 tier1 > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > fix indent in jvmciCompilerToVM.cpp Marked as reviewed by jwaters (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20538#pullrequestreview-2235448924 From rcastanedalo at openjdk.org Tue Aug 13 14:23:59 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 13 Aug 2024 14:23:59 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 14:03:53 GMT, Martin Doerr wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Further motivate the choice of internal store address materialization in x64 > > src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 203: > >> 201: // Do we need to load the previous value? >> 202: if (obj != noreg) { >> 203: __ load_heap_oop(pre_val, Address(obj, 0), noreg, noreg, AS_RAW); > > How do we handle implicit null checks for which `obj` is null? Note that we may expect the store instruction to trigger SIGSEGV. Does it work correctly if we trigger the SIGSEGV, here in the pre barrier? Good question! I have checked (no pun intended) and it turns out C2 never uses stores (or other memory access operations) with late-expanded barriers to perform implicit null checks. This is accidental, due to the fact that all these memory operations use MachTemp nodes in the C2 code generation stage (to reserve registers for their ADL TEMP operands). C2's implicit null check analysis requires that all inputs of a candidate memory operation dominate the null check [1], which fails if the operation uses MachTemp nodes, since these are always placed in the same basic block [2]. Note that this optimization triggers very rarely, if at all, for memory operations in the current early barrier expansion model, since the additional control flow of the barrier code obfuscates the analysis. For late barrier expansion, the analysis could be easily extended to recognize and hoist MachTemp nodes together with their user memory operation that is a candidate to implement the implicit null check [3], but that would require extending the barrier assembly emission step to populate the implicit null exception table correctly. Since this seems non-trivial, and would also affect other garbage collectors (ZGC), I suggest to simply assert for now that we do not generate implicit null checks for memory operations with barrier data (as in [4]), and leave full support for implicit null checks for these G1 and ZGC operations to a future RFE. What do you think? [1] https://github.com/robcasloz/jdk/blob/d21104ca8ff1eef88a9d87fb78dda3009414b5b8/src/hotspot/share/opto/lcm.cpp#L310-L328 [2] https://github.com/robcasloz/jdk/blob/d21104ca8ff1eef88a9d87fb78dda3009414b5b8/src/hotspot/share/opto/gcm.cpp#L1397-L1404 [3] https://github.com/robcasloz/jdk/commit/e0ab1e418b81c0acddff2190ca57b3335b5214ba [4] https://github.com/robcasloz/jdk/commit/e0ab1e418b81c0acddff2190ca57b3335b5214ba#diff-554fddca91406a67dc0f8faee12dc30c709181685a0add7f4ba9ae5ace68f192R2031-R2032 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1715387255 From kbarrett at openjdk.org Tue Aug 13 15:42:48 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 15:42:48 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 07:38:23 GMT, Thomas Schatzl wrote: >> Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: >> >> fix indent in jvmciCompilerToVM.cpp > > Marked as reviewed by tschatzl (Reviewer). Thanks for reviews @tschatzl and @TheShermanTanker . ------------- PR Comment: https://git.openjdk.org/jdk/pull/20538#issuecomment-2286559517 From kvn at openjdk.org Tue Aug 13 16:10:50 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 13 Aug 2024 16:10:50 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Message-ID: <0-H1xAn1uyxof9BWXAU9xcM1JwDQTUu_goHglbTgnDo=.d7e419ed-4801-4482-9388-b46f45ebaecc@github.com> On Mon, 12 Aug 2024 07:23:57 GMT, Kim Barrett wrote: > Please review this change to machnode.cpp to remove dead code following calls > to ShouldNotReachHere() and ShouldNotCallThis(). This takes advantage of the > availability and use of [[noreturn]] attributes on all supported platforms, > and makes all uses of those functions in this file consistent in this respect. > > As a side effect, this removes some -Wzero-as-null-pointer-constant warnings. > > Testing: mach5 tier1 Looks good. Thank you for testing in GHA. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20540#pullrequestreview-2235993043 From azafari at openjdk.org Tue Aug 13 16:29:01 2024 From: azafari at openjdk.org (Afshin Zafari) Date: Tue, 13 Aug 2024 16:29:01 GMT Subject: RFR: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 18:37:16 GMT, Kim Barrett wrote: >> Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: >> >> Used UCONST64(0) instead of UL. > > Looks good. Thank you @kimbarrett and @adinn for your comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20530#issuecomment-2286647134 From azafari at openjdk.org Tue Aug 13 16:29:01 2024 From: azafari at openjdk.org (Afshin Zafari) Date: Tue, 13 Aug 2024 16:29:01 GMT Subject: Integrated: 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' In-Reply-To: References: Message-ID: <28yqAVNZTTeMaUBH7KHNJjdwWjegs1uM5pd89pqzFRM=.ac81c063-5424-4fa6-ac79-fc57f45c5362@github.com> On Fri, 9 Aug 2024 17:54:17 GMT, Afshin Zafari wrote: > The operand of shift which is a constant `0` changed to `unsigned long`. This pull request has now been integrated. Changeset: 21ca91e5 Author: Afshin Zafari URL: https://git.openjdk.org/jdk/commit/21ca91e55dd83dc011e67a2d056e3e3bd44d40b5 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8300800: UB: Shift exponent 32 is too large for 32-bit type 'int' Reviewed-by: kbarrett, adinn, gziemski ------------- PR: https://git.openjdk.org/jdk/pull/20530 From tschatzl at openjdk.org Tue Aug 13 16:52:50 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 13 Aug 2024 16:52:50 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp [v2] In-Reply-To: References: Message-ID: On Tue, 13 Aug 2024 05:56:16 GMT, Kim Barrett wrote: >> Please review this change to remove -Wzero-as-null-pointer-constant warnings >> in jvmciCompilerToVM.cpp. We're changing uses of _0 suffixed exception macros >> to instead use the corresponding _NULL suffixed macros where appropriate. >> >> Testing: mach5 tier1 > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > fix indent in jvmciCompilerToVM.cpp Marked as reviewed by tschatzl (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20538#pullrequestreview-2236085904 From dnsimon at openjdk.org Tue Aug 13 17:19:49 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 13 Aug 2024 17:19:49 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp [v2] In-Reply-To: References: Message-ID: On Tue, 13 Aug 2024 05:56:16 GMT, Kim Barrett wrote: >> Please review this change to remove -Wzero-as-null-pointer-constant warnings >> in jvmciCompilerToVM.cpp. We're changing uses of _0 suffixed exception macros >> to instead use the corresponding _NULL suffixed macros where appropriate. >> >> Testing: mach5 tier1 > > Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: > > fix indent in jvmciCompilerToVM.cpp Marked as reviewed by dnsimon (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20538#pullrequestreview-2236137977 From kbarrett at openjdk.org Tue Aug 13 18:06:00 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 18:06:00 GMT Subject: RFR: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp [v2] In-Reply-To: References: Message-ID: On Tue, 13 Aug 2024 16:50:33 GMT, Thomas Schatzl wrote: >> Kim Barrett has updated the pull request incrementally with one additional commit since the last revision: >> >> fix indent in jvmciCompilerToVM.cpp > > Marked as reviewed by tschatzl (Reviewer). Thanks for reviews @tschatzl , @TheShermanTanker , and @dougxc ------------- PR Comment: https://git.openjdk.org/jdk/pull/20538#issuecomment-2286821598 From kbarrett at openjdk.org Tue Aug 13 18:06:01 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 18:06:01 GMT Subject: Integrated: 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp In-Reply-To: References: Message-ID: <8iUiq2I5lAMSNY8dC-3nN16-9aJiKeBT3APMoHRh8NM=.ba0bbb48-d9e5-4edb-890a-d7758743c68e@github.com> On Mon, 12 Aug 2024 07:16:34 GMT, Kim Barrett wrote: > Please review this change to remove -Wzero-as-null-pointer-constant warnings > in jvmciCompilerToVM.cpp. We're changing uses of _0 suffixed exception macros > to instead use the corresponding _NULL suffixed macros where appropriate. > > Testing: mach5 tier1 This pull request has now been integrated. Changeset: ca99f37f Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/ca99f37f82bf59fc720babbc155502ef92d34de6 Stats: 32 lines in 1 file changed: 0 ins; 0 del; 32 mod 8338156: Fix -Wzero-as-null-pointer-constant warnings in jvmciCompilerToVM.cpp Reviewed-by: tschatzl, jwaters, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/20538 From kbarrett at openjdk.org Tue Aug 13 18:06:54 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 18:06:54 GMT Subject: Integrated: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> Message-ID: On Mon, 12 Aug 2024 07:23:57 GMT, Kim Barrett wrote: > Please review this change to machnode.cpp to remove dead code following calls > to ShouldNotReachHere() and ShouldNotCallThis(). This takes advantage of the > availability and use of [[noreturn]] attributes on all supported platforms, > and makes all uses of those functions in this file consistent in this respect. > > As a side effect, this removes some -Wzero-as-null-pointer-constant warnings. > > Testing: mach5 tier1 This pull request has now been integrated. Changeset: 8e682aca Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/8e682aca24fba0803dceef513957fb2122895b87 Stats: 9 lines in 1 file changed: 0 ins; 4 del; 5 mod 8338158: Cleanup ShouldNotXXX uses in machnode.cpp Reviewed-by: chagedorn, kvn, dlong ------------- PR: https://git.openjdk.org/jdk/pull/20540 From kbarrett at openjdk.org Tue Aug 13 18:06:54 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 13 Aug 2024 18:06:54 GMT Subject: RFR: 8338158: Cleanup ShouldNotXXX uses in machnode.cpp In-Reply-To: References: <1r_IesfNVnksCD44wEapS7EN6CgvsJnKGl0PHT04MoQ=.ffc4d0fa-98b9-483a-9b96-f75927afa0be@github.com> <1NRzN6bd1tAkoSLffSJBxO20fYTB2gSUEYXzKRHIPXI=.2bc04ab6-258c-499c-b8d8-4fb95e24b1f1@github.com> Message-ID: On Tue, 13 Aug 2024 06:55:23 GMT, Christian Hagedorn wrote: >>> Looks good! Just a general question, will you update `fatal()` uses in a separate RFE? >> >> I'm not planning to. There are lots and lots of these sorts of things for ShouldNotXXX, fatal(), &etc. >> I only did these because I was touching some to eliminate -Wzero-as-null-pointer-constant warnings, >> and there were few enough it was still a relatively small change. Also there was already inconstency >> of usage in this one file. I suggest just similarly cleaning these up as code is being touched for other >> reasons. > >> > Looks good! Just a general question, will you update `fatal()` uses in a separate RFE? >> >> I'm not planning to. There are lots and lots of these sorts of things for ShouldNotXXX, fatal(), &etc. I only did these because I was touching some to eliminate -Wzero-as-null-pointer-constant warnings, and there were few enough it was still a relatively small change. Also there was already inconstency of usage in this one file. I suggest just similarly cleaning these up as code is being touched for other reasons. > > Sounds good, thanks for sharing some details! Thanks for reviews, @chhagedorn , @vnkozlov , and @dean-long ------------- PR Comment: https://git.openjdk.org/jdk/pull/20540#issuecomment-2286823915 From kvn at openjdk.org Tue Aug 13 18:45:51 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 13 Aug 2024 18:45:51 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> <2J558IEDhou1AcFbhww1g1J3eN51lvvj7nWzS0v4RRs=.46e850b9-c335-42e0-a01b-acda28e7a44c@github.com> Message-ID: On Tue, 13 Aug 2024 10:23:16 GMT, Daniel Lund?n wrote: > Does C2 have a maximum frame size? No, but there is arbitrary 1M limit check: [chaitin.cpp#L631](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L631) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2286892865 From kvn at openjdk.org Tue Aug 13 19:06:49 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 13 Aug 2024 19:06:49 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: <2wZdLgj3bxWUKOupkG2Wl6pImQorm-FSORgSnDqUHqo=.3e45f7d4-14eb-4d0c-ad4c-22dac3e5fcc1@github.com> On Tue, 13 Aug 2024 10:17:25 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 82: >> >>> 80: // In rare situations (e.g., "more than 90+ parameters on Intel"), we need to >>> 81: // extend the register mask with dynamically allocated memory. >>> 82: uintptr_t* _RM_UP_EXT = nullptr; >> >> Have you considered using a growable array (`src/hotspot/share/utilities/growableArray.hpp`) for this part? > > No, I'll have a look and see if it makes sense to use growable arrays in this case. Thanks! Yes, it is good suggestion. Please look. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1715794736 From mdoerr at openjdk.org Tue Aug 13 20:44:53 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 13 Aug 2024 20:44:53 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: On Tue, 13 Aug 2024 14:21:01 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 203: >> >>> 201: // Do we need to load the previous value? >>> 202: if (obj != noreg) { >>> 203: __ load_heap_oop(pre_val, Address(obj, 0), noreg, noreg, AS_RAW); >> >> How do we handle implicit null checks for which `obj` is null? Note that we may expect the store instruction to trigger SIGSEGV. Does it work correctly if we trigger the SIGSEGV, here in the pre barrier? > > Good question! I have checked (no pun intended) and it turns out C2 never uses stores (or other memory access operations) with late-expanded barriers to perform implicit null checks. This is accidental, due to the fact that all these memory operations use MachTemp nodes in the C2 code generation stage (to reserve registers for their ADL TEMP operands). C2's implicit null check analysis requires that all inputs of a candidate memory operation dominate the null check [1], which fails if the operation uses MachTemp nodes, since these are always placed in the same basic block [2]. > > Note that this optimization triggers very rarely, if at all, for memory operations in the current early barrier expansion model, since the additional control flow of the barrier code obfuscates the analysis. For late barrier expansion, the analysis could be easily extended to recognize and hoist MachTemp nodes together with their user memory operation that is a candidate to implement the implicit null check [3], but that would require extending the barrier assembly emission step to populate the implicit null exception table correctly. Since this seems non-trivial, and would also affect other garbage collectors (ZGC), I suggest to simply assert for now that we do not generate implicit null checks for memory operations with barrier data (as in [4]), and leave full support for implicit null checks for these G1 and ZGC operations to a future RFE. What do you think? > > [1] https://github.com/robcasloz/jdk/blob/d21104ca8ff1eef88a9d87fb78dda3009414b5b8/src/hotspot/share/opto/lcm.cpp#L310-L328 > [2] https://github.com/robcasloz/jdk/blob/d21104ca8ff1eef88a9d87fb78dda3009414b5b8/src/hotspot/share/opto/gcm.cpp#L1397-L1404 > [3] https://github.com/robcasloz/jdk/commit/e0ab1e418b81c0acddff2190ca57b3335b5214ba > [4] https://github.com/robcasloz/jdk/commit/e0ab1e418b81c0acddff2190ca57b3335b5214ba#diff-554fddca91406a67dc0f8faee12dc30c709181685a0add7f4ba9ae5ace68f192R2031-R2032 Thanks for figuring it out! Makes sense. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1715904218 From syan at openjdk.org Wed Aug 14 03:36:10 2024 From: syan at openjdk.org (SendaoYan) Date: Wed, 14 Aug 2024 03:36:10 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform Message-ID: Hi, Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. ------------- Commit messages: - 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform Changes: https://git.openjdk.org/jdk/pull/20576/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20576&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338344 Stats: 3 lines in 2 files changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20576.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20576/head:pull/20576 PR: https://git.openjdk.org/jdk/pull/20576 From jbhateja at openjdk.org Wed Aug 14 04:59:23 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 14 Aug 2024 04:59:23 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v3] In-Reply-To: References: Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. > > > . SATURATING_UADD : Saturating unsigned addition. > . SATURATING_ADD : Saturating signed addition. > . SATURATING_USUB : Saturating unsigned subtraction. > . SATURATING_SUB : Saturating signed subtraction. > . UMAX : Unsigned max > . UMIN : Unsigned min. > > > New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. > > As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. > > Summary of changes: > - Java side implementation of new vector operators. > - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. > - C2 compiler IR and inline expander changes. > - Optimized x86 backend implementation for new vector operators and their predicated counterparts. > - Extends existing VectorAPI Jtreg test suite to cover new operations. > > Kindly review and share your feedback. > > Best Regards, > PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. > > [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20507/files - new: https://git.openjdk.org/jdk/pull/20507/files/5468e72b..8c9bfeca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20507&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20507&range=01-02 Stats: 776 lines in 34 files changed: 0 ins; 0 del; 776 mod Patch: https://git.openjdk.org/jdk/pull/20507.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20507/head:pull/20507 PR: https://git.openjdk.org/jdk/pull/20507 From chagedorn at openjdk.org Wed Aug 14 06:00:57 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 14 Aug 2024 06:00:57 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 03:29:13 GMT, SendaoYan wrote: > Hi, > Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. > When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` > > bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; > > > I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. Why doesn't it fail locally/standalone? What is different in your CI? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20576#issuecomment-2287914893 From syan at openjdk.org Wed Aug 14 07:12:50 2024 From: syan at openjdk.org (SendaoYan) Date: Wed, 14 Aug 2024 07:12:50 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 05:58:27 GMT, Christian Hagedorn wrote: > Why doesn't it fail locally/standalone? What is different in your CI? When run this test locally, the `jdk/test/lib/Platform.class` locate in `JTwork/classes/0/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d`, and `bootClassPath` set the same path, so test run passed. Locally run command: rm -rf tmp/ ; time jtreg -va -nr -w tmp -conc:40 -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java When run this by CI, the `jdk/test/lib/Platform.class` locate in `JTwork/hotspot_jtreg/classes/5/test/lib`, so bootclass loader can't find `jdk/test/lib/Platform.class`. CI invoke jtreg command like: timeout 860000 jtreg -nativepath:/var/tmp/tone/run/jtreg/test-images/hotspot/jtreg/native -e:TEST_IMAGE_DIR=/var/tmp/tone/run/jtreg/test-images -e:LD_LIBRARY_PATH=/var/tmp/tone/run/jtreg/jdk-repo/build/tools/lib -a -ea -esa -retain:fail,error,*.dmp,javacore.*,heapdump.*,*.trc -ignore:quiet -xml:verify -v:fail,error -timeoutFactor:10 -conc:16 -w jt-work/jtreg -r jt-report/jtreg jdk-repo/test/hotspot/jtreg jdk-repo/test/jdk jdk-repo/test/langtools jdk-repo/test/jaxp jdk-repo/test/lib-test I think we can not expected that the `jdk/test/lib/Platform.class` always locate in `TEST_CLASSES`, we should set `bootClassPath` as `TEST_CLASS_PATH ` to make sure bootclass loader can find `jdk/test/lib/Platform.class`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20576#issuecomment-2288012486 From jbhateja at openjdk.org Wed Aug 14 07:43:52 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 14 Aug 2024 07:43:52 GMT Subject: RFR: 8338023: Support two vector selectFrom API In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Mon, 12 Aug 2024 22:03:44 GMT, Paul Sandoz wrote: > The results look promising. I can provide guidance on the specification e.g., we can specify the behavior in terms of rearrange, with the addition of throwing on out of bounds indexes. > > Regarding the throwing of exceptions, some wider context will help to know where we are heading before we finalize the specification. I believe we are considering changing the default throwing behavior for index out of bounds to wrapping, thereby we can avoid bounds checks. If that is the case we should wait until that is done then update rather than submitting a CSR just yet? > > I see you created a specific intrinsic, which will avoid the cost of shuffle creation. Should we apply the same approach (in a subsequent PR) to the single argument shuffle? Or perhaps if we manage to optimize shuffles and change the default wrapping we don't require a specific intrinsic and can just use defer to rearrange? Hi @PaulSandoz , Thanks for your comments. With this new API we intend to enforce stricter specification w.r.t to index values to emit a lean instruction sequence preventing any cycles spent on massaging inputs to a consumable form, thus preventing redundant wrapping and unwrapping operations. Existing [two vector rearrange API](https://docs.oracle.com/en/java/javase/22/docs/api/jdk.incubator.vector/jdk/incubator/vector/Vector.html#rearrange(jdk.incubator.vector.VectorShuffle,jdk.incubator.vector.Vector)) has a flexible specification which allows wrapping out of bounds shuffle indexes into exceptional index with a -ve value. Even if we optimize existing two vector rearrange implementation we will still need to emit additional instructions to generate an indexes which lie within two vector range [0, 2*VLEN). I see this as a specialized API like vector compress/expand which cater to targets like x86-AVX512+ and aarch64-SVE which offers direct instruction for two vector lookups. May be the API nomenclature can be refined to better reflect its semantics i.e. from selectFrom to twoVectorLookup ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2288062038 From rcastanedalo at openjdk.org Wed Aug 14 08:29:45 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 14 Aug 2024 08:29:45 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v7] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Rename 'HeapRegionBounds' to 'G1HeapRegionBounds' - Merge jdk-24+10 - Further motivate the choice of internal store address materialization in x64 - Give barrier generation helper functions a more consistent name - Also include HOTSPOT_TARGET_CPU_ARCH-based G1 ADL source file - Flatten barrier assembly generation code by removing helpers individual barrier tests and operations - Build barrier data in G1BarrierSetC2::get_store_barrier() by adding, rather than removing, barrier tags - Implement JEP 475 Co-authored-by: Erik ?sterlund, Siyao Liu, and Kim Barrett ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/d21104ca..88d28b9f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=05-06 Stats: 99129 lines in 2523 files changed: 60137 ins; 27053 del; 11939 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Wed Aug 14 08:29:45 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 14 Aug 2024 08:29:45 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 13:57:27 GMT, Roberto Casta?eda Lozano wrote: > OK, I will test and push a merge of jdk-24+10 (Thu Aug 8) in the next days, unless @feilongjiang or @snazarkin object. We can then check in a few weeks if another update is required. Done. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2288146342 From fjiang at openjdk.org Wed Aug 14 09:12:53 2024 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 14 Aug 2024 09:12:53 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 12:10:28 GMT, Roberto Casta?eda Lozano wrote: > > Are you planning to rebase it with master ? Nothing important, but there were couple of failures which are fixed after this PR. So will make test result a bit clean for us ?. > > Actually, I have refrained to update to the latest mainline changes to avoid interfering with the porting work while it is in progress, but if there is consensus among the port maintainers I would be happy to update the changeset regularly. @TheRealMDoerr @feilongjiang @offamitkumar @snazarkin what do you prefer? I have already merged upstream commits on my local branch, so I'm fine with regular updates. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2288247680 From chagedorn at openjdk.org Wed Aug 14 10:51:49 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 14 Aug 2024 10:51:49 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 03:29:13 GMT, SendaoYan wrote: > Hi, > Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. > When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` > > bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; > > > I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. Thanks for the detailed explanation. Using `TEST_CLASS_PATH` instead seems reasonable. test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java line 38: > 36: * @modules java.base/jdk.internal.vm.annotation > 37: * @library /test/lib / > 38: * @build jdk.test.lib.Platform Is that really required or is it enough to set `TEST_CLASS_PATH`? ------------- PR Review: https://git.openjdk.org/jdk/pull/20576#pullrequestreview-2237804609 PR Review Comment: https://git.openjdk.org/jdk/pull/20576#discussion_r1716711503 From syan at openjdk.org Wed Aug 14 11:55:48 2024 From: syan at openjdk.org (SendaoYan) Date: Wed, 14 Aug 2024 11:55:48 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 10:48:44 GMT, Christian Hagedorn wrote: >> Hi, >> Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. >> When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` >> >> bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; >> >> >> I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. > > test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java line 38: > >> 36: * @modules java.base/jdk.internal.vm.annotation >> 37: * @library /test/lib / >> 38: * @build jdk.test.lib.Platform > > Is that really required or is it enough to set `TEST_CLASS_PATH`? Yes, it's enough to set `TEST_CLASS_PATH`. But the jtreg [doc](https://openjdk.org/jtreg/tag-spec.html) suggest that `appropriate @build directives to ensure that the classes will be compiled`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20576#discussion_r1716792631 From syan at openjdk.org Wed Aug 14 12:17:02 2024 From: syan at openjdk.org (SendaoYan) Date: Wed, 14 Aug 2024 12:17:02 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: > Hi, > Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. > When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` > > bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; > > > I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. SendaoYan has updated the pull request incrementally with one additional commit since the last revision: delete "@build jdk.test.lib.Platform" ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20576/files - new: https://git.openjdk.org/jdk/pull/20576/files/756c4bec..cb0f9212 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20576&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20576&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20576.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20576/head:pull/20576 PR: https://git.openjdk.org/jdk/pull/20576 From chagedorn at openjdk.org Wed Aug 14 12:17:03 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 14 Aug 2024 12:17:03 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 11:53:27 GMT, SendaoYan wrote: >> test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java line 38: >> >>> 36: * @modules java.base/jdk.internal.vm.annotation >>> 37: * @library /test/lib / >>> 38: * @build jdk.test.lib.Platform >> >> Is that really required or is it enough to set `TEST_CLASS_PATH`? > > Yes, it's enough to set `TEST_CLASS_PATH`. But the jtreg [doc](https://openjdk.org/jtreg/tag-spec.html) suggest that `appropriate @build directives to ensure that the classes will be compiled`. Okay, but then I suggest to remove `@build` when `TEST_CLASS_PATH` is enough because the IR framework is already special in that regard as it aims for simplicity: The IR test itself should not need to worry about the IR framework internal classes that need to be built. For example, we already silently build and install the `WhiteBox` class without the need to specify it with `@build` in the IR test. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20576#discussion_r1716814640 From syan at openjdk.org Wed Aug 14 12:22:52 2024 From: syan at openjdk.org (SendaoYan) Date: Wed, 14 Aug 2024 12:22:52 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 12:11:47 GMT, Christian Hagedorn wrote: >> Yes, it's enough to set `TEST_CLASS_PATH`. But the jtreg [doc](https://openjdk.org/jtreg/tag-spec.html) suggest that `appropriate @build directives to ensure that the classes will be compiled`. > > Okay, but then I suggest to remove `@build` when `TEST_CLASS_PATH` is enough because the IR framework is already special in that regard as it aims for simplicity: The IR test itself should not need to worry about the IR framework internal classes that need to be built. For example, we already silently build and install the `WhiteBox` class without the need to specify it with `@build` in the IR test. Okey, the `@build jdk.test.lib.Platform` has been deleted. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20576#discussion_r1716825652 From rcastanedalo at openjdk.org Wed Aug 14 12:38:51 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 14 Aug 2024 12:38:51 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: <259a7NXcZVtVnc3vlOTN2eF4zPq3U_QBKDLNnvE1OJw=.894d8054-8947-40c2-a62d-1dd387477013@github.com> On Wed, 14 Aug 2024 09:10:10 GMT, Feilong Jiang wrote: > I have already merged upstream commits on my local branch, so I'm fine with regular updates. Thanks, let's go with this version and see if we need a new update in a few weeks (or, perhaps, all platforms have been ported by then ?). ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2288628007 From rcastanedalo at openjdk.org Wed Aug 14 13:11:25 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 14 Aug 2024 13:11:25 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v8] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Assert that no implicit null checks are generated for memory accesses with barriers ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/88d28b9f..554de779 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=06-07 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Wed Aug 14 13:11:25 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 14 Aug 2024 13:11:25 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v6] In-Reply-To: References: Message-ID: On Tue, 13 Aug 2024 20:42:36 GMT, Martin Doerr wrote: >> Good question! I have checked (no pun intended) and it turns out C2 never uses stores (or other memory access operations) with late-expanded barriers to perform implicit null checks. This is accidental, due to the fact that all these memory operations use MachTemp nodes in the C2 code generation stage (to reserve registers for their ADL TEMP operands). C2's implicit null check analysis requires that all inputs of a candidate memory operation dominate the null check [1], which fails if the operation uses MachTemp nodes, since these are always placed in the same basic block [2]. >> >> Note that this optimization triggers very rarely, if at all, for memory operations in the current early barrier expansion model, since the additional control flow of the barrier code obfuscates the analysis. For late barrier expansion, the analysis could be easily extended to recognize and hoist MachTemp nodes together with their user memory operation that is a candidate to implement the implicit null check [3], but that would require extending the barrier assembly emission step to populate the implicit null exception table correctly. Since this seems non-trivial, and would also affect other garbage collectors (ZGC), I suggest to simply assert for now that we do not generate implicit null checks for memory operations with barrier data (as in [4]), and leave full support for implicit null checks for these G1 and ZGC operations to a future RFE. What do you think? >> >> [1] https://github.com/robcasloz/jdk/blob/d21104ca8ff1eef88a9d87fb78dda3009414b5b8/src/hotspot/share/opto/lcm.cpp#L310-L328 >> [2] https://github.com/robcasloz/jdk/blob/d21104ca8ff1eef88a9d87fb78dda3009414b5b8/src/hotspot/share/opto/gcm.cpp#L1397-L1404 >> [3] https://github.com/robcasloz/jdk/commit/e0ab1e418b81c0acddff2190ca57b3335b5214ba >> [4] https://github.com/robcasloz/jdk/commit/e0ab1e418b81c0acddff2190ca57b3335b5214ba#diff-554fddca91406a67dc0f8faee12dc30c709181685a0add7f4ba9ae5ace68f192R2031-R2032 > > Thanks for figuring it out! Makes sense. Added the assertion in commit 554de779. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1716895420 From chagedorn at openjdk.org Wed Aug 14 14:15:48 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 14 Aug 2024 14:15:48 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 12:17:02 GMT, SendaoYan wrote: >> Hi, >> Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. >> When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` >> >> bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; >> >> >> I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete "@build jdk.test.lib.Platform" Looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20576#pullrequestreview-2238317232 From shade at openjdk.org Wed Aug 14 14:21:50 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 14 Aug 2024 14:21:50 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 12:17:02 GMT, SendaoYan wrote: >> Hi, >> Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. >> When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` >> >> bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; >> >> >> I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete "@build jdk.test.lib.Platform" Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20576#pullrequestreview-2238335158 From syan at openjdk.org Wed Aug 14 15:51:51 2024 From: syan at openjdk.org (SendaoYan) Date: Wed, 14 Aug 2024 15:51:51 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 12:17:02 GMT, SendaoYan wrote: >> Hi, >> Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. >> When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` >> >> bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; >> >> >> I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete "@build jdk.test.lib.Platform" Thanks all for the review and approved. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20576#issuecomment-2289156575 From duke at openjdk.org Wed Aug 14 15:51:51 2024 From: duke at openjdk.org (duke) Date: Wed, 14 Aug 2024 15:51:51 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: <0mCW78nVJZYD4atRzjW3xCW-QsoHB-8rjH_YPx193tg=.6134f635-446d-4336-b20e-c703ed9f9df3@github.com> On Wed, 14 Aug 2024 12:17:02 GMT, SendaoYan wrote: >> Hi, >> Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. >> When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` >> >> bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; >> >> >> I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete "@build jdk.test.lib.Platform" @sendaoYan Your change (at version cb0f9212c67ec51372885f3927fcffccc19ac33f) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20576#issuecomment-2289159181 From qamai at openjdk.org Wed Aug 14 16:30:26 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 14 Aug 2024 16:30:26 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v7] In-Reply-To: References: Message-ID: > Hi, > > This patch adds unsigned bounds and known bits constraints to TypeInt and TypeLong. This opens more transformation opportunities in an elegant manner as well as helps avoid some ad-hoc rules in Hotspot. > > In general, a TypeInt/Long represents a set of values x that satisfies: x s>= lo && x s<= hi && x u>= ulo && x u<= uhi && (x & zeros) == 0 && (~x & ones) == 0. These constraints are not independent, e.g. an int that lies in [0, 3] in signed domain must also lie in [0, 3] in unsigned domain and have all bits but the last 2 being unset. As a result, we must normalize the constraints (tighten the constraints so that they are optimal) before constructing a TypeInt/Long instance. > > This is extracted from #15440 , node value transformations are left for later PRs. I have also added unit tests to verify the soundness of constraint normalization. > > Please kindly review, thanks a lot. > > Testing > > - [x] GHA > - [x] Linux x64, tier 1-4 Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - fix compile errors - Merge branch 'master' into unsignedbounds - add comments - Merge branch 'master' into unsignedbounds - fix release build - add comments, group arguments to reduce C-style reference passing arguments - fix tests, add verify - add unit tests - fix template parameter - refactor - ... and 1 more: https://git.openjdk.org/jdk/compare/d8e4d3f2...d5ad9f1a ------------- Changes: https://git.openjdk.org/jdk/pull/17508/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17508&range=06 Stats: 1482 lines in 17 files changed: 925 ins; 286 del; 271 mod Patch: https://git.openjdk.org/jdk/pull/17508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17508/head:pull/17508 PR: https://git.openjdk.org/jdk/pull/17508 From qamai at openjdk.org Wed Aug 14 16:30:26 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 14 Aug 2024 16:30:26 GMT Subject: RFR: 8315066: Add unsigned bounds and known bits to TypeInt/Long [v2] In-Reply-To: References: Message-ID: On Mon, 22 Jan 2024 09:54:59 GMT, Emanuel Peter wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> fix tests, add verify > > @merykitty I just had a quick look. Thanks for spitting out parts and making it more reviewable that way! Since John Rose is generally excited (https://github.com/openjdk/jdk/pull/15440#issuecomment-1901609719), I'll now put in a bit more effort into reviewing this. > > Thanks for adding some gtests. > I would really like to see some IR tests, where we can see that this folds cases, and folds them correctly. > And just some general java-code correctness tests, which test your optimizations in an end-to-end way. > > I have a general concern with the math functions. They have quite a few arguments, often 5-10. And on average half of them are passed as a reference. Sometimes it is hard to immediately see which are the arguments that will not be mutated, and which are return values, and which are both arguments and return values, which are simply further constrained/narrowed etc. > > I wonder if it might be better to have types like: > > SRange {lo, hi} > URange {lo, hi} > KnownBits {ones, zeros} > > Make them immutable, i.e. the fields are constant. > Then as function parameters, you always pass in these as const, and return the new values (possibly in some combined type, or a pair or tuple or whatever). > > I think it would make the code cleaner, have fewer arguments, and a bit easier to reason about when things are immutable. > > Plus, then you can put the range-inference methods inside those classes, you can directly ask such an object if it is empty etc. You could for example have somelthing like: > `SRange::constrained_with(KnownBits) -> returns SRange`. Basically I'm asking for the code to be a little more object-oriented, and less C-style ;) @eme64 Gentle ping regarding this PR. Thanks a lot. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17508#issuecomment-2289245712 From syan at openjdk.org Wed Aug 14 18:58:53 2024 From: syan at openjdk.org (SendaoYan) Date: Wed, 14 Aug 2024 18:58:53 GMT Subject: Integrated: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 03:29:13 GMT, SendaoYan wrote: > Hi, > Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. > When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` > > bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; > > > I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. This pull request has now been integrated. Changeset: e3a5e265 Author: SendaoYan Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/e3a5e265a7747b02b8f828fbedea0dda7246fc51 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform Reviewed-by: chagedorn, shade ------------- PR: https://git.openjdk.org/jdk/pull/20576 From dlong at openjdk.org Wed Aug 14 19:03:53 2024 From: dlong at openjdk.org (Dean Long) Date: Wed, 14 Aug 2024 19:03:53 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v5] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: <_eEXPKMxJIQZSNbOzfpHZwDPHB7tM02bba7bipV1jrc=.5f699c96-8d55-477e-9783-4872efd762c2@github.com> On Mon, 29 Jul 2024 13:05:46 GMT, Fei Gao wrote: >> On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: >> >> cast<64> (32-bit compressed reference) + field_offset >> >> >> When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. >> >> For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. >> >> In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. >> >> Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. >> >> We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. >> >> Tier 1-3 passed on aarch64. > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Merge branch 'master' into fg8319690 > - Discard IndOffXX style and let legitimize_address() fix any out-of-range immediate offsets > - Merge branch 'master' into fg8319690 > - Add the assertion back and merge matchrules with a better predicate > - Merge branch 'master' into fg8319690 > - Remove unused immIOffset/immLOffset > - Merge branch 'master' into fg8319690 > - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" > > On LP64 systems, if the heap can be moved into low virtual > address space (below 4GB) and the heap size is smaller than the > interesting threshold of 4 GB, we can use unscaled decoding > pattern for narrow klass decoding. It means that a generic field > reference can be decoded by: > ``` > cast<64> (32-bit compressed reference) + field_offset > ``` > > When the `field_offset` is an immediate, on aarch64 platform, the > unscaled decoding pattern can match perfectly with a direct > addressing mode, i.e., `base_plus_offset`, supported by LDR/STR > instructions. But for certain data width, not all immediates can > be encoded in the instruction field of LDR/STR[1]. The ranges are > different as data widths vary. > > For example, when we try to load a value of long type at offset of > `1030`, the address expression is `(AddP (DecodeN base) 1030)`. > Before the patch, the expression was matching with > `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate > byte offset must be in the range -256 to 255 or positive immediate > byte offset must be a multiple of 8 in the range 0 to 32760[2]. > `1030` can't be encoded in the instruction field. So, after > matching, when we do checking for instruction encoding, the > assertion would fail. > > In this patch, we're going to filter out invalid immediates > when deciding if current addressing mode can be matched as > `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and > `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data > type separately in the patch. E.g., for `memory4`, we remove > the generic `indOffIN/indOffLN`, which matches wrong unscaled > immediate range, and replace them with `indOffIN4/indOffLN4` > instead. > > Since 8-bit and 16-bit LDR/STR instructions also support the > unscaled decoding pattern, we add the addressing mode in the > lists of `memory1` and `memory... Looks good. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16991#pullrequestreview-2238974915 From sviswanathan at openjdk.org Thu Aug 15 00:43:54 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 15 Aug 2024 00:43:54 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: <_SE6Bidb70F5A9OGDWUIBGuvKQz3ObZnyJNJKtPQI60=.d16d3a29-f58f-4b8f-81fe-f809fd1fc96f@github.com> On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3371: > 3369: //B00_03, B04_07, B08_11, B12_15 overwritten with shuffled cipher text > 3370: __ bind(cont); > 3371: if (no_ghash) { We are always calling initial_blocks_16_avx512 with no_ghash as false, so we could remove the no_ghash parameter and code associated with no_ghash as true. The GHASH parameter is also then not required. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3454: > 3452: __ evmovdquq(ADD_1234, ExternalAddress(counter_mask_add_1234_addr()), Assembler::AVX_512bit, rbx /*rscratch*/); > 3453: > 3454: //Shuffle counter, Broadcast counter value to 512 bit register and subtract 1 from the pre-incremented counter value Comment should be: // Shuffle counter, subtract 1 from the pre-incremented counter value, and broadcast counter value to 512 bit register src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3480: > 3478: __ cmpl(len, 2 * 32 * 16); > 3479: __ jcc(Assembler::below, ENCRYPT_BIG_NBLKS); > 3480: ghash16_encrypt_parallel16_avx512(in, out, ct, pos, avx512_subkeyHtbl, CTR_CHECK, rounds, key, true, true, false, false, false, ghashin_offset, aesout_offset, HashKey_32); ghash16_encrypt_parallel16_avx512 needs to pass in CTL_BLOCKx, AAD_HASHx, ADDBE_4x4, ADDBE_1234, AAD_HASHx, SHUF_MASK just like we do in initial_blocks_16_avx512. Also the GL and GH needs to be passed in. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1717646768 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1717518011 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1717678309 From syan at openjdk.org Thu Aug 15 01:12:52 2024 From: syan at openjdk.org (SendaoYan) Date: Thu, 15 Aug 2024 01:12:52 GMT Subject: RFR: 8338344: Test TestPrivilegedMode.java intermittent fails java.lang.NoClassDefFoundError: jdk/test/lib/Platform [v2] In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 12:17:02 GMT, SendaoYan wrote: >> Hi, >> Test `test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.java` always fails by ours daily CI with fastdebug build, but I can't reproduce this fail standalone. >> When the test fails `java.lang.NoClassDefFoundError: jdk/test/lib/Platform`, the `jdk/test/lib/Platform.class` locate in `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/test/lib`, but `bootClassPath` [set as](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/driver/TestVMProcess.java#L103) `/var/tmp/tone/run/jtreg/jt-work/jtreg/hotspot_jtreg/classes/5/testlibrary_tests/ir_framework/tests/TestPrivilegedMode.d` >> >> bootClassPath += File.pathSeparator + Utils.TEST_CLASSES; >> >> >> I think `bootClassPath` should set as `Utils.TEST_CLASS_PATH`. > > SendaoYan has updated the pull request incrementally with one additional commit since the last revision: > > delete "@build jdk.test.lib.Platform" Thanks for the sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20576#issuecomment-2290205735 From jkarthikeyan at openjdk.org Thu Aug 15 03:03:49 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 15 Aug 2024 03:03:49 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: On Mon, 12 Aug 2024 06:29:03 GMT, Jatin Bhateja wrote: > its usage in existing patch is limited to [type comparison.](https://github.com/openjdk/jdk/pull/20507/files#diff-3559dcf23b719805be5fd06fd5c1851dbd8f53e47afe6d99cba13a3de0ebc6b2R1542) Ah, that makes sense to me. I took a closer look and I think since the patch is creating a `VectorReinterpret` node after unsigned vector nodes, it might be able to avoid cases where the type might get filtered/joined, like with `PhiNode::Value`. That might lead to errors since `empty_type->filter(other_type) == TOP`. It's unfortunate that it's not really possible to disambiguate between an empty type and an unsigned range, which would allow us to solve this elegantly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1717820103 From enikitin at openjdk.org Thu Aug 15 05:38:53 2024 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Thu, 15 Aug 2024 05:38:53 GMT Subject: Integrated: 8337102: JITTester: Fix breaks in static initialization blocks In-Reply-To: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> References: <8JfbPJYkBr00hKBUCngwNu5JM_Q-aPRnV2xsvUmhmTA=.2d5cda28-83ec-4236-9609-74391591f4d8@github.com> Message-ID: On Wed, 24 Jul 2024 10:28:54 GMT, Evgeny Nikitin wrote: > Static initialisation blocks (SIBs) should not have `break`s in their code, as well as their descendants. Currently, StaticConstructorDefinitionFactory allows them, causing non-compilable constructions like this: > > > class Test_0 { > static { > if (true) { > break; // <- compilation error here > } > } > } > > > It seems like previously an attempt have been made to resolve this by disabling SIBs whatsoever. > > Currently, out of 100 generated tests we have 2-3 compilation errors. > Allowing SIBs raises this to 80 out of 100 tests failing due to erroneous 'break' blocks. > Disabling breaks in StaticConstructorDefinition gives us SIBs, and returns failure rate to the same 2-3%. > Disabling breaks in StaticConstructorDefinition doesn't prevent breaks from happening, as loop factories (`ForFactory`, `WhileFactory`, etc.) explicitly allow for breaks in their descendant trees. > > Testing: > 1. 200-300 generations in various setups to get the numbers mentioned above; > 2. I checked manually that breaks do not disappear from code, > 3. ... and appear in loops' (for, while, do-while) descendants. This pull request has now been integrated. Changeset: 4669e7b7 Author: Evgeny Nikitin Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/4669e7b7b02636a8bd7381a9d401aaaf0c1d7294 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod 8337102: JITTester: Fix breaks in static initialization blocks Reviewed-by: kvn, iveresov ------------- PR: https://git.openjdk.org/jdk/pull/20310 From chagedorn at openjdk.org Thu Aug 15 06:09:52 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 15 Aug 2024 06:09:52 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v2] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: On Wed, 7 Aug 2024 01:16:47 GMT, Jasmine Karthikeyan wrote: > Hmm, that failure is quite peculiar, since now it seems we're failing because we're creating extra `MacroLogicV` nodes rather than failing because we're not creating any. Unfortunately, I didn't have much luck debugging the root cause since I don't have access to AVX-512 hardware. I've changed the IR check to a phase before `MacroLogicV` nodes are created, which should hopefully fix the failure. Even if you don't have access to real hardware, you could try to emulate it with [Intel SDE](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html) to see what's going on. Might be worth a shot. Nevertheless, I will have a look at your patch again in closer detail tomorrow or hopefully on Monday. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2290733900 From thartmann at openjdk.org Thu Aug 15 06:40:53 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 15 Aug 2024 06:40:53 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v5] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Mon, 29 Jul 2024 13:05:46 GMT, Fei Gao wrote: >> On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: >> >> cast<64> (32-bit compressed reference) + field_offset >> >> >> When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. >> >> For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. >> >> In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. >> >> Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. >> >> We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. >> >> Tier 1-3 passed on aarch64. > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Merge branch 'master' into fg8319690 > - Discard IndOffXX style and let legitimize_address() fix any out-of-range immediate offsets > - Merge branch 'master' into fg8319690 > - Add the assertion back and merge matchrules with a better predicate > - Merge branch 'master' into fg8319690 > - Remove unused immIOffset/immLOffset > - Merge branch 'master' into fg8319690 > - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" > > On LP64 systems, if the heap can be moved into low virtual > address space (below 4GB) and the heap size is smaller than the > interesting threshold of 4 GB, we can use unscaled decoding > pattern for narrow klass decoding. It means that a generic field > reference can be decoded by: > ``` > cast<64> (32-bit compressed reference) + field_offset > ``` > > When the `field_offset` is an immediate, on aarch64 platform, the > unscaled decoding pattern can match perfectly with a direct > addressing mode, i.e., `base_plus_offset`, supported by LDR/STR > instructions. But for certain data width, not all immediates can > be encoded in the instruction field of LDR/STR[1]. The ranges are > different as data widths vary. > > For example, when we try to load a value of long type at offset of > `1030`, the address expression is `(AddP (DecodeN base) 1030)`. > Before the patch, the expression was matching with > `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate > byte offset must be in the range -256 to 255 or positive immediate > byte offset must be a multiple of 8 in the range 0 to 32760[2]. > `1030` can't be encoded in the instruction field. So, after > matching, when we do checking for instruction encoding, the > assertion would fail. > > In this patch, we're going to filter out invalid immediates > when deciding if current addressing mode can be matched as > `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and > `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data > type separately in the patch. E.g., for `memory4`, we remove > the generic `indOffIN/indOffLN`, which matches wrong unscaled > immediate range, and replace them with `indOffIN4/indOffLN4` > instead. > > Since 8-bit and 16-bit LDR/STR instructions also support the > unscaled decoding pattern, we add the addressing mode in the > lists of `memory1` and `memory... This PR is still referring to [JDK-8319690](https://bugs.openjdk.org/browse/JDK-8319690) which got integrated. Should this refer to a new RFE? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2290760909 From epeter at openjdk.org Thu Aug 15 06:41:56 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Aug 2024 06:41:56 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v11] In-Reply-To: <8gXt0Adth6IJw_QitnY53W_4Ouup1DuUHZmkoM8ytuY=.ccb05c52-cdec-4d89-a993-c005b8aa3d0d@github.com> References: <8gXt0Adth6IJw_QitnY53W_4Ouup1DuUHZmkoM8ytuY=.ccb05c52-cdec-4d89-a993-c005b8aa3d0d@github.com> Message-ID: On Mon, 8 Jul 2024 16:46:33 GMT, Kangcheng Xu wrote: >> Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: >> >> - Merge branch 'master' into boolnode-refactor >> - update test values, @run directive, and remove an empty line >> - Merge branch 'master' into boolnode-refactor >> - move test location, add negative test case, simplify imports >> - Merge branch 'master' into boolnode-refactor >> - refactor BoolNode::Value() and extract code to ::Value_cmpu_and_mask >> - update comments >> - fix indentation again >> - apply test only on x64, aarch64 and riscv64 >> - also renames the class name in @run >> - ... and 10 more: https://git.openjdk.org/jdk/compare/d3817351...715a6304 > > I pushed a commit to spread test cases compounded with `&` and `|` into subcases to avoid optimizing out semantically equivalent ones and to make test clearer. To make test passing even with `CmpU3` miscounted as `CmpU`, I specified `counts = {IRNode.CMP_U, ">=1"}` instead of strictly `1`. I hope this is acceptable. > > --- > > A case illustrating `CmpU3` matched as `CmpU`: > > > @Test > @Arguments(values = {Argument.DEFAULT, Argument.DEFAULT}) > @IR(counts = {IRNode.CMP_U, "1"}, // <-- expecting strictly 1 > phase = CompilePhase.AFTER_PARSING, > applyIfPlatformOr = {"x64", "true", "aarch64", "true", "riscv64", "true"}) > public static boolean testShouldHaveCpmUCase1(int x, int m) { > return !(Integer.compareUnsigned((x & m), m - 1) > 0); > } > > > > 1) Method "public static boolean compiler.c2.gvn.TestBoolNodeGVN.testShouldHaveCpmUCase1(int,int)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={AFTER_PARSING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#CMP_U#_", "1"}, applyIfPlatform={}, failOn={}, applyIfPlatformOr={"x64", "true", "aarch64", "true", "riscv64", "true"}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" > > Phase "After Parsing": > - counts: Graph contains wrong number of nodes: > * Constraint 1: "(\d+(\s){2}(CmpU.*)+(\s){2}===.*)" > - Failed comparison: [found] 2 = 1 [given] > - Matched nodes (2): > mismatched--> * 28 CmpU3 === _ 23 27 [[ 39 ]] !jvms: TestBoolNodeGVN::testShouldHaveCpmUCase1 @ bci:6 (line 93) > * 31 CmpU === _ 23 27 [[ 32 ]] !jvms: TestBoolNodeGVN::testShouldHaveCpmUCase1 @ bci:9 (line 93) @tabjy Are you planning to keep working on this? I talked with @chhagedorn and we would like you to change the IR rule to something like `@IR(counts = {IRNode.CMP_U + "\b", "1"}` In a later and separate RFE, we can then adjust the regex for all nodes, in a bulk update. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2290761682 From epeter at openjdk.org Thu Aug 15 06:47:54 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Aug 2024 06:47:54 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v5] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Thu, 15 Aug 2024 06:38:28 GMT, Tobias Hartmann wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - Merge branch 'master' into fg8319690 >> - Discard IndOffXX style and let legitimize_address() fix any out-of-range immediate offsets >> - Merge branch 'master' into fg8319690 >> - Add the assertion back and merge matchrules with a better predicate >> - Merge branch 'master' into fg8319690 >> - Remove unused immIOffset/immLOffset >> - Merge branch 'master' into fg8319690 >> - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" >> >> On LP64 systems, if the heap can be moved into low virtual >> address space (below 4GB) and the heap size is smaller than the >> interesting threshold of 4 GB, we can use unscaled decoding >> pattern for narrow klass decoding. It means that a generic field >> reference can be decoded by: >> ``` >> cast<64> (32-bit compressed reference) + field_offset >> ``` >> >> When the `field_offset` is an immediate, on aarch64 platform, the >> unscaled decoding pattern can match perfectly with a direct >> addressing mode, i.e., `base_plus_offset`, supported by LDR/STR >> instructions. But for certain data width, not all immediates can >> be encoded in the instruction field of LDR/STR[1]. The ranges are >> different as data widths vary. >> >> For example, when we try to load a value of long type at offset of >> `1030`, the address expression is `(AddP (DecodeN base) 1030)`. >> Before the patch, the expression was matching with >> `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate >> byte offset must be in the range -256 to 255 or positive immediate >> byte offset must be a multiple of 8 in the range 0 to 32760[2]. >> `1030` can't be encoded in the instruction field. So, after >> matching, when we do checking for instruction encoding, the >> assertion would fail. >> >> In this patch, we're going to filter out invalid immediates >> when deciding if current addressing mode can be matched as >> `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and >> `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data >> type separately in the patch. E.g., for `memory4`, we remove >> the generic `indOffIN/indOffLN`, which matches wrong unscaled >> immediate range, and replace them with `indOffIN4/indOffLN4` >> instead. >> >> Since 8-bit and 16-bit LDR/STR instructions also support the >> ... > > This PR is still referring to [JDK-8319690](https://bugs.openjdk.org/browse/JDK-8319690) which got integrated. Should this refer to a new RFE? @TobiHartmann @fg1417 Yes, the bug as such is already "fixed" (I had removed the assert because it was not necessary for correctness). I guess this should be an RFE. Either as a **cleanup** or **performance improvement** (if it can be measured). ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2290767815 From thartmann at openjdk.org Thu Aug 15 06:48:54 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 15 Aug 2024 06:48:54 GMT Subject: RFR: 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop In-Reply-To: References: Message-ID: On Thu, 25 Jul 2024 15:16:39 GMT, Roland Westrelin wrote: > A store is sunk from a counted loop into an enclosing infinite > loop. The assert fires because: > > > get_loop(lca)->_nest < n_loop->_nest > > > is false. That happens because the outer loop was found to be infinite > in the current loop opts pass. When that happens, it's not properly > attached to the loop tree. The second part of the assert was added to > cover a similar case: > > > lca->in(0)->is_NeverBranch() > > > but it doesn't work in this case bcause lca is not a projection of the > `NeverBranch`. It's the exit projection of the counted loop. The fix I > propose changes that part of the assert to test that lca is, indeed, > in an infinite loop in a way that's robust. > > I also removed some code that I believe to be useless following > 8335709. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20334#pullrequestreview-2239816039 From jbhateja at openjdk.org Thu Aug 15 07:02:53 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 15 Aug 2024 07:02:53 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 03:01:00 GMT, Jasmine Karthikeyan wrote: >> @jaskarth , its usage in existing patch is limited to [type comparison.](https://github.com/openjdk/jdk/pull/20507/files#diff-3559dcf23b719805be5fd06fd5c1851dbd8f53e47afe6d99cba13a3de0ebc6b2R1542). >> >> My plan is to address intrinsification of new core lib APIs, associated value range folding optimization (since unsigned numbers have different value range of [0, MAX_VALUE) vs signed [-MIN_VALUE/2, +MAX_VALUE/2) numbers) and auto-vectorization in a follow up patch. >> >> **Notes on C2 type system:** >> Unlike Type::FLOAT, integral type ranges are specified using _lo and _hi value range, these ranges are pruned using flow functions associated with each operation IR. Constraining the value ranges allows logic pruning, e.g. in1[TypeInt] & 0x7FFFFFFF will chop off -ve values ranges from in1, thus a constrol structure like . `if (in1 < 0) { true_path ; } else { false_path; } ` which uses in1 as a flow condition will sweepout the true path. >> >> C2 type system only maintains value ranges for integral types i.e. long and int, any sub-word type which as per JVM specification has an int storage "word" only constrains the value range of TypeInt. >> >> A type which represent a constant value has same _hi and _lo value. >> >> Floating point types Type::FLOAT / DOUBLE cannot maintain upper / lower value ranges due to rounding constraints. >> Thus C2 type system maintains a separate type TypeF and TypeD which are singletons and represent a constant value. > >> its usage in existing patch is limited to [type comparison.](https://github.com/openjdk/jdk/pull/20507/files#diff-3559dcf23b719805be5fd06fd5c1851dbd8f53e47afe6d99cba13a3de0ebc6b2R1542) > > Ah, that makes sense to me. I took a closer look and I think since the patch is creating a `VectorReinterpret` node after unsigned vector nodes, it might be able to avoid cases where the type might get filtered/joined, like with `PhiNode::Value`. That might lead to errors since `empty_type->filter(other_type) == TOP`. It's unfortunate that it's not really possible to disambiguate between an empty type and an unsigned range, which would allow us to solve this elegantly. @jaskarth , Central idea behind introducing VectorReinterpretNode after unsigned vector IR is to facilitate unboxing-boxing optimization, this explicit reinterpretation ensures type compatibility between value being boxed and box type which is always signed vector types. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1718044262 From chagedorn at openjdk.org Thu Aug 15 11:17:19 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 15 Aug 2024 11:17:19 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE Message-ID: The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. Thanks, Christian ------------- Commit messages: - 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE Changes: https://git.openjdk.org/jdk/pull/20594/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20594&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8336729 Stats: 50 lines in 3 files changed: 45 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/20594.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20594/head:pull/20594 PR: https://git.openjdk.org/jdk/pull/20594 From thartmann at openjdk.org Thu Aug 15 11:21:48 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 15 Aug 2024 11:21:48 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 11:11:54 GMT, Christian Hagedorn wrote: > The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 > > The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 > > This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. > > However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). > > The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. > > Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. > > Thanks, > Christian That looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20594#pullrequestreview-2240184428 From chagedorn at openjdk.org Thu Aug 15 11:33:47 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 15 Aug 2024 11:33:47 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 11:11:54 GMT, Christian Hagedorn wrote: > The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 > > The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 > > This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. > > However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). > > The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. > > Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. > > Thanks, > Christian Thanks Tobias for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20594#issuecomment-2291109691 From epeter at openjdk.org Thu Aug 15 11:48:49 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Aug 2024 11:48:49 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 11:11:54 GMT, Christian Hagedorn wrote: > The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 > > The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 > > This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. > > However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). > > The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. > > Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. > > Thanks, > Christian Marked as reviewed by epeter (Reviewer). test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java line 105: > 103: } > 104: > 105: // Fixed with JDK-8336792. Do you want to add this bug-number to the `@bug` above? ------------- PR Review: https://git.openjdk.org/jdk/pull/20594#pullrequestreview-2240216094 PR Review Comment: https://git.openjdk.org/jdk/pull/20594#discussion_r1718298025 From chagedorn at openjdk.org Thu Aug 15 11:59:04 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 15 Aug 2024 11:59:04 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE [v2] In-Reply-To: References: Message-ID: <4-nlf_ja5yRZN2A35C3qQO9ksc298jYlJ41jed5ItZ4=.2a56e71b-9113-46c9-8741-b1cb1a02c162@github.com> > The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 > > The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 > > This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. > > However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). > > The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. > > Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: add bug number ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20594/files - new: https://git.openjdk.org/jdk/pull/20594/files/8ba74170..c11cee54 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20594&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20594&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20594.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20594/head:pull/20594 PR: https://git.openjdk.org/jdk/pull/20594 From chagedorn at openjdk.org Thu Aug 15 11:59:05 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 15 Aug 2024 11:59:05 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 11:11:54 GMT, Christian Hagedorn wrote: > The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 > > The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 > > This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. > > However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). > > The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. > > Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. > > Thanks, > Christian Thanks Emanuel for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20594#issuecomment-2291137767 From chagedorn at openjdk.org Thu Aug 15 11:59:06 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 15 Aug 2024 11:59:06 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE [v2] In-Reply-To: References: Message-ID: <7Crj7f7ZqUTht8SAPF8eWCIu7tobJQIg7F-2sQVOXT4=.54c0171b-81df-4ae9-b7fc-89add725b631@github.com> On Thu, 15 Aug 2024 11:46:05 GMT, Emanuel Peter wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> add bug number > > test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java line 105: > >> 103: } >> 104: >> 105: // Fixed with JDK-8336792. > > Do you want to add this bug-number to the `@bug` above? I've already added it for the newly added runs but I guess it does not hurt to add it to the existing ones as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20594#discussion_r1718305363 From tholenstein at openjdk.org Thu Aug 15 12:09:53 2024 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 15 Aug 2024 12:09:53 GMT Subject: RFR: 8320308: C2 compilation crashes in LibraryCallKit::inline_unsafe_access In-Reply-To: References: Message-ID: On Wed, 10 Jul 2024 21:25:47 GMT, Vladimir Ivanov wrote: > > } else if (_gvn.type(base->uncast()) == TypePtr::NULL_PTR) { > > IMO a better alternative is to drop speculative part before performing the comparison: > > ``` > } else if (base_type->remove_speculative() == TypePtr::NULL_PTR) { > ``` This does not work unfortunately. type `LibraryCallKit::classify_unsafe_addr(Node* &base, ...` is called with `base` = `147 CheckCastPP` and then `base_type` is `java/lang/Object * (speculative=byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact * (inline_depth=2))` `base_type->remove_speculative()` results in `java/lang/Object *` ------------- PR Comment: https://git.openjdk.org/jdk/pull/20033#issuecomment-2291151246 From epeter at openjdk.org Thu Aug 15 12:13:20 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 15 Aug 2024 12:13:20 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE [v3] In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 12:10:27 GMT, Christian Hagedorn wrote: >> The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: >> https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 >> >> The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: >> https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 >> >> This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. >> >> However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). >> >> The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. >> >> Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > fix number Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20594#pullrequestreview-2240243725 From chagedorn at openjdk.org Thu Aug 15 12:13:20 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 15 Aug 2024 12:13:20 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE [v3] In-Reply-To: References: Message-ID: > The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 > > The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 > > This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. > > However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). > > The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. > > Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: fix number ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20594/files - new: https://git.openjdk.org/jdk/pull/20594/files/c11cee54..242b877a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20594&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20594&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20594.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20594/head:pull/20594 PR: https://git.openjdk.org/jdk/pull/20594 From fgao at openjdk.org Thu Aug 15 15:19:57 2024 From: fgao at openjdk.org (Fei Gao) Date: Thu, 15 Aug 2024 15:19:57 GMT Subject: RFR: 8338442: AArch64: Clean up IndOffXX type and let legitimize_address() fix out-of-range operands [v2] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Fri, 8 Dec 2023 02:13:02 GMT, Dean Long wrote: >> I think this patch is excessive for the problem and introduces a lot of code dupiication. Maybe it would be simpler, smaller, and faster to check for what we need: >> >> >> diff --git a/src/hotspot/cpu/aarch64/aarch64.ad b/src/hotspot/cpu/aarch64/aarch64.ad >> index 233f9b6af7c..ea842912ce9 100644 >> --- a/src/hotspot/cpu/aarch64/aarch64.ad >> +++ b/src/hotspot/cpu/aarch64/aarch64.ad >> @@ -5911,7 +5911,8 @@ operand indIndexN(iRegN reg, iRegL lreg) >> >> operand indOffIN(iRegN reg, immIOffset off) >> %{ >> - predicate(CompressedOops::shift() == 0); >> + predicate(CompressedOops::shift() == 0 >> + && Address::offset_ok_for_immed(n->in(3)->find_int_con(min_jint), exact_log2(sizeof(jint)))); >> constraint(ALLOC_IN_RC(ptr_reg)); >> match(AddP (DecodeN reg) off); >> op_cost(0); >> @@ -5926,7 +5927,8 @@ operand indOffIN(iRegN reg, immIOffset off) >> >> operand indOffLN(iRegN reg, immLoffset off) >> %{ >> - predicate(CompressedOops::shift() == 0); >> + predicate(CompressedOops::shift() == 0 >> + && Address::offset_ok_for_immed(n->in(3)->find_long_con(min_jint), exact_log2(sizeof(jlong)))); >> constraint(ALLOC_IN_RC(ptr_reg)); >> match(AddP (DecodeN reg) off); >> op_cost(0); > > @theRealAph , your patch only works if when `indOffIN` is used in `memory4` and `indOffLN` is used in `memory8`, right? > Introducing new operands like `indOffIN4` is consistent with how the code currently works with `indOffI4`. In fact I think the new `indOffIN` could be folded into the existing `indOffI` by using multiple `match` lines and a better predicate. @dean-long thanks for your approval! Thanks for your suggestions @eme64 @TobiHartmann. Updated with a new RFE :) I'll integrate it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2291501106 From fgao at openjdk.org Thu Aug 15 15:19:57 2024 From: fgao at openjdk.org (Fei Gao) Date: Thu, 15 Aug 2024 15:19:57 GMT Subject: Integrated: 8338442: AArch64: Clean up IndOffXX type and let legitimize_address() fix out-of-range operands In-Reply-To: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Wed, 6 Dec 2023 06:24:59 GMT, Fei Gao wrote: > On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: > > cast<64> (32-bit compressed reference) + field_offset > > > When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. > > For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. > > In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. > > Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. > > We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. > > Tier 1-3 passed on aarch64. This pull request has now been integrated. Changeset: 38591315 Author: Fei Gao URL: https://git.openjdk.org/jdk/commit/38591315058e6d3b764ca325facc5bf46bf7b16b Stats: 373 lines in 7 files changed: 12 ins; 250 del; 111 mod 8338442: AArch64: Clean up IndOffXX type and let legitimize_address() fix out-of-range operands Reviewed-by: aph, dlong ------------- PR: https://git.openjdk.org/jdk/pull/16991 From kvn at openjdk.org Thu Aug 15 16:28:49 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 15 Aug 2024 16:28:49 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE [v3] In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 12:13:20 GMT, Christian Hagedorn wrote: >> The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: >> https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 >> >> The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: >> https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 >> >> This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. >> >> However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). >> >> The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. >> >> Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > fix number Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20594#pullrequestreview-2240788090 From sviswanathan at openjdk.org Thu Aug 15 23:41:49 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 15 Aug 2024 23:41:49 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3084: > 3082: > 3083: __ cmpl(CTR_CHECK, (256 - 16)); > 3084: __ jcc(Assembler::greaterEqual, blocks_overflow); Should this be: __ cmpb(CTR_CHECK, (256 - 16) __ jcc(Assembler::aboveEqual, blocks_overflow); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719118790 From sviswanathan at openjdk.org Fri Aug 16 00:21:51 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 16 Aug 2024 00:21:51 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 298: > 296: // Align stack > 297: __ andq(rsp, -64); > 298: __ subptr(rsp, 200 * longSize); // Create space on the stack for htbl entries It will be helpful to document the structure of subkeyHtbl (what is there in the 200 long entries). src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3120: > 3118: //increment counter overflow check register > 3119: __ evshufi64x2(CTR_BE, B12_15, B12_15, 255, Assembler::AVX_512bit); > 3120: __ addl(CTR_CHECK, 16); Should this be: __ addb(CTR_CHECK, 16); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719144485 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719122880 From jkarthikeyan at openjdk.org Fri Aug 16 03:22:49 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 16 Aug 2024 03:22:49 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v2] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> Message-ID: On Thu, 15 Aug 2024 06:07:04 GMT, Christian Hagedorn wrote: > you could try to emulate it with [Intel SDE](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html) to see what's going on Oh I hadn't heard of that tool before, I'll give it a try. Thank you for the link! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2292658400 From jbhateja at openjdk.org Fri Aug 16 08:43:52 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 16 Aug 2024 08:43:52 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 src/hotspot/cpu/x86/assembler_x86.cpp line 8977: > 8975: > 8976: void Assembler::vinserti64x2(XMMRegister dst, XMMRegister nds, XMMRegister src, uint8_t imm8) { > 8977: assert(VM_Version::supports_avx512dq(), ""); You may all add an assertion for VL feature, some VM instances may have custom features src/hotspot/cpu/x86/assembler_x86.cpp line 11049: > 11047: > 11048: void Assembler::evbroadcastf64x2(XMMRegister dst, Address src, int vector_len) { > 11049: assert(VM_Version::supports_avx512dq(), ""); Same a above. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 181: > 179: 0x0000000000000000UL, 0x0400000000000000UL, > 180: 0x0000000000000000UL, 0x0400000000000000UL, > 181: }; Even though this file will be compiled only on 64bit targets (LP64) where unsigned longs are 8 bytes, still good to follow a convention of defining wordsize agnostic long long constants, please update suffix to ULL. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 191: > 189: 0x0000000000000000UL, 0x0200000000000000UL, > 190: 0x0000000000000000UL, 0x0300000000000000UL, > 191: 0x0000000000000000UL, 0x0400000000000000UL, Same as above. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 202: > 200: 0x0000000000000002UL, 0x0000000000000000UL, > 201: 0x0000000000000003UL, 0x0000000000000000UL, > 202: 0x0000000000000004UL, 0x0000000000000000UL, Same as above. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2827: > 2825: //compute HashKey ^ (8 + n), HashKey ^ (7 + n), ... HashKey ^ (5 + n) > 2826: gfmul_avx512(ZT7, ZT5); > 2827: __ evmovdquq(Address(avx512_htbl, 16 * 16), ZT7, Assembler::AVX_512bit); You can define a macro for this repetitive sequence and pass varying argument to it for brevity. src/hotspot/cpu/x86/stubGenerator_x86_64_ghash.cpp line 63: > 61: 0x0000000000000001UL, 0xC200000000000000UL, > 62: 0x0000000000000001UL, 0xC200000000000000UL, > 63: 0x0000000000000001UL, 0xC200000000000000UL Please suffix it with ULL, our intent is to declare wordsize agnostic long constants ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719466371 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719466856 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719486366 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719498315 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719498445 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719529833 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1719458801 From chagedorn at openjdk.org Fri Aug 16 11:38:50 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 16 Aug 2024 11:38:50 GMT Subject: RFR: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE [v3] In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 12:13:20 GMT, Christian Hagedorn wrote: >> The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: >> https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 >> >> The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: >> https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 >> >> This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. >> >> However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). >> >> The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. >> >> Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > fix number Thanks Vladimir for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20594#issuecomment-2293344903 From mbaesken at openjdk.org Fri Aug 16 12:01:57 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 16 Aug 2024 12:01:57 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero Message-ID: When running test compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java with ubsan enabled binaries we run into the issue reported below. Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) ------------- Commit messages: - JDK-8333098 Changes: https://git.openjdk.org/jdk/pull/20615/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20615&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8333098 Stats: 8 lines in 1 file changed: 6 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20615.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20615/head:pull/20615 PR: https://git.openjdk.org/jdk/pull/20615 From rcastanedalo at openjdk.org Fri Aug 16 13:06:55 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 16 Aug 2024 13:06:55 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> Message-ID: On Sun, 21 Jul 2024 08:27:52 GMT, Martin Doerr wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Build barrier data in G1BarrierSetC2::get_store_barrier() by adding, rather than removing, barrier tags > > src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 123: > >> 121: if ((barrier_data() & G1C2BarrierPost) != 0) { >> 122: __ movl($tmp2$$Register, $src$$Register); >> 123: if ((barrier_data() & G1C2BarrierPostNotNull) == 0) { > > `decode_heap_oop` contains a null check in some cases which makes some of your code redundant. Optimization idea: In case of `(((barrier_data() & G1C2BarrierPostNotNull) == 0) && CompressedOops::base() != nullptr)` use a null check and bail out because there's nothing left to do if it's null. After that, we can always use `decode_heap_oop_not_null`. Also note that the latter supports specifying different src and dst registers which saves the extra move operation. Thanks for the suggestion, Martin! I have prototyped the optimization [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `RemoveRedundantNullChecks`), and in my opinion its expected benefit does not justify the additional complexity, especially since the scope is limited (in my earlier experiments, most of the stores are implemented with `g1EncodePAndStoreN` rather than `g1StoreN`, plus the optimization only applies to a specific compressed OOPs mode). I have run a few general-purpose benchmarks using a non-zero base compressed oops mode and the optimization did not yield any statistically significant improvement, but please let me know if you have any specific benchmark/configuration in mind and I can re-check. > src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 182: > >> 180: $tmp2$$Register /* pre_val */, >> 181: $tmp3$$Register /* tmp */, >> 182: RegSet::of($mem$$Register, $newval$$Register, $oldval$$Register) /* preserve */); > > The only value which can get overwritten is `oldval`. Optimization idea: Pass `oldval` to the SATB barrier. There is no load of the old value required. Thanks, I will test and apply this one: it is simple enough and, in a way, even more intuitive than the current solution. For reference, I prototyped it [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `UseOldValInPreBarriers`). It should also apply to `g1CompareAndSwapP`, right? > src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 301: > >> 299: RegSet::of($mem$$Register, $newval$$Register) /* preserve */); >> 300: __ movq($tmp1$$Register, $newval$$Register); >> 301: __ xchgq($newval$$Register, Address($mem$$Register, 0)); > > Optimization idea: Despite its name, `g1_pre_write_barrier` can be moved after the xchg operation because there's no safepoint within this MachNode. This allows avoiding loading the old value twice. Thanks, I also prototyped this [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `UseExchangedValueinPreBarriers`). The atomic barrier implementation in this PR is purposefully simple based on 1) the assumption that the cost of the atomic operations will dominate that of their barriers, and 2) the risk of introducing subtle bugs which can be difficult to catch by regular testing. Because of this, I feel hesitant to introduce this kind of optimizations for atomic operation barriers. But I am happy to reconsider, if you have any specific benchmark/configuration in mind where the benefit could outweigh the cost. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1719811953 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1719812882 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1719814312 From kvn at openjdk.org Fri Aug 16 19:35:50 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Aug 2024 19:35:50 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 11:57:09 GMT, Matthias Baesken wrote: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) First, I think `Tier4MinInvocationThreshold` and `Tier3MinInvocationThreshold` should be `double` flags. I see that they are used only in `double` type expressions (usually with scaling value which is `double`). May be another RFE. I agree that we need to check `Tier4MinInvocationThreshold` == 0 because it is acceptable value and can be specified on command line. I think 0 means that `(freq < min_freq)` is always `true`. So your change could be changed: if (cp_min_inv == 0) { // Tier4MinInvocationThreshold == 0 means we should not inline min_freq = freq + 1.; ------------- PR Review: https://git.openjdk.org/jdk/pull/20615#pullrequestreview-2243429627 From dlong at openjdk.org Fri Aug 16 19:47:50 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 16 Aug 2024 19:47:50 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 11:57:09 GMT, Matthias Baesken wrote: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) src/hotspot/share/opto/bytecodeInfo.cpp line 321: > 319: int cp_min_inv = CompilationPolicy::min_invocations(); > 320: if (cp_min_inv == 0) { > 321: min_freq = MinInlineFrequencyRatio; Doesn't this need to be 1.0 or infinity to preserve the existing behavior? However, I'm not convinced the existing behavior is correct. If min_invocations decreases, shouldn't that make it easier to inline, not harder? @veresov, do you agree? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720282823 From dlong at openjdk.org Fri Aug 16 19:51:49 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 16 Aug 2024 19:51:49 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 19:33:15 GMT, Vladimir Kozlov wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > First, I think `Tier4MinInvocationThreshold` and `Tier3MinInvocationThreshold` should be `double` flags. I see that they are used only in `double` type expressions (usually with scaling value which is `double`). May be another RFE. > > I agree that we need to check `Tier4MinInvocationThreshold` == 0 because it is acceptable value and can be specified on command line. I think 0 means that `(freq < min_freq)` is always `true`. So your change could be changed: > > if (cp_min_inv == 0) { > // Tier4MinInvocationThreshold == 0 means we should not inline > min_freq = freq + 1.; @vnkozlov , I think this is happening because the scaling factor is 0.001, so 600 turns into 0. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2294126266 From iveresov at openjdk.org Fri Aug 16 20:51:48 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Fri, 16 Aug 2024 20:51:48 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 19:44:58 GMT, Dean Long wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > src/hotspot/share/opto/bytecodeInfo.cpp line 321: > >> 319: int cp_min_inv = CompilationPolicy::min_invocations(); >> 320: if (cp_min_inv == 0) { >> 321: min_freq = MinInlineFrequencyRatio; > > Doesn't this need to be 1.0 or infinity to preserve the existing behavior? > However, I'm not convinced the existing behavior is correct. If min_invocations decreases, shouldn't that make it easier to inline, not harder? @veresov, do you agree? I think the existing behavior is correct - the goal is to compute a minimum call freq that is considered for inlining. `1.0/min_invocations()` means call site was reached at least once before the method was submitted for a C2 compile. The question is what should we do if min_invocations() is 0. That basically means that there was no profiling. I would probably just return false in that case, which would mean "we can't really say that's low frequency, because we don't have data". It's a meaningless mode of operation regardless though... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720334554 From mdoerr at openjdk.org Fri Aug 16 20:58:53 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 16 Aug 2024 20:58:53 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> Message-ID: <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> On Fri, 16 Aug 2024 13:01:28 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 123: >> >>> 121: if ((barrier_data() & G1C2BarrierPost) != 0) { >>> 122: __ movl($tmp2$$Register, $src$$Register); >>> 123: if ((barrier_data() & G1C2BarrierPostNotNull) == 0) { >> >> `decode_heap_oop` contains a null check in some cases which makes some of your code redundant. Optimization idea: In case of `(((barrier_data() & G1C2BarrierPostNotNull) == 0) && CompressedOops::base() != nullptr)` use a null check and bail out because there's nothing left to do if it's null. After that, we can always use `decode_heap_oop_not_null`. Also note that the latter supports specifying different src and dst registers which saves the extra move operation. > > Thanks for the suggestion, Martin! I have prototyped the optimization [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `RemoveRedundantNullChecks`), and in my opinion its expected benefit does not justify the additional complexity, especially since the scope is limited (in my earlier experiments, most of the stores are implemented with `g1EncodePAndStoreN` rather than `g1StoreN`, plus the optimization only applies to a specific compressed OOPs mode). I have run a few general-purpose benchmarks using a non-zero base compressed oops mode and the optimization did not yield any statistically significant improvement, but please let me know if you have any specific benchmark/configuration in mind and I can re-check. Thanks for trying! I think I should try it on PPC64. The null check can be integrated into the oop decoding, so all Compressed Oops Modes would benefit. It could be that x86 is less sensitive to such optimizations. >> src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 182: >> >>> 180: $tmp2$$Register /* pre_val */, >>> 181: $tmp3$$Register /* tmp */, >>> 182: RegSet::of($mem$$Register, $newval$$Register, $oldval$$Register) /* preserve */); >> >> The only value which can get overwritten is `oldval`. Optimization idea: Pass `oldval` to the SATB barrier. There is no load of the old value required. > > Thanks, I will test and apply this one: it is simple enough and, in a way, even more intuitive than the current solution. For reference, I prototyped it [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `UseOldValInPreBarriers`). It should also apply to `g1CompareAndSwapP`, right? Exactly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1720338672 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1720340149 From mdoerr at openjdk.org Fri Aug 16 21:05:51 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 16 Aug 2024 21:05:51 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> Message-ID: <4aS0KysKaIPue1D60nootRPb8m7pdP_2hlPSwGUIk8w=.6b68b278-62ce-4df0-a6c6-c7fee40557aa@github.com> On Fri, 16 Aug 2024 13:03:51 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 301: >> >>> 299: RegSet::of($mem$$Register, $newval$$Register) /* preserve */); >>> 300: __ movq($tmp1$$Register, $newval$$Register); >>> 301: __ xchgq($newval$$Register, Address($mem$$Register, 0)); >> >> Optimization idea: Despite its name, `g1_pre_write_barrier` can be moved after the xchg operation because there's no safepoint within this MachNode. This allows avoiding loading the old value twice. > > Thanks, I also prototyped this [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `UseExchangedValueinPreBarriers`). The atomic barrier implementation in this PR is purposefully simple based on 1) the assumption that the cost of the atomic operations will dominate that of their barriers, and 2) the risk of introducing subtle bugs which can be difficult to catch by regular testing. Because of this, I feel hesitant to introduce this kind of optimizations for atomic operation barriers. But I am happy to reconsider, if you have any specific benchmark/configuration in mind where the benefit could outweigh the cost. Note that we had such an optimization already in C2: https://github.com/openjdk/jdk8u-dev/blob/4106121e0ae42d644e45c6eab9037874110ed670/hotspot/src/share/vm/opto/library_call.cpp#L3114 But, it's probably not a big deal. Maybe I can try it on PPC64 which may be more sensitive to accesses on contended memory. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1720347473 From dhanalla at openjdk.org Fri Aug 16 21:31:47 2024 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Fri, 16 Aug 2024 21:31:47 GMT Subject: RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded In-Reply-To: References: Message-ID: On Thu, 8 Aug 2024 06:22:34 GMT, Christian Hagedorn wrote: > Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. > > For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories: > > 1. We have a real bug and by fixing it, we no longer create this many nodes. > 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed). > 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well). > > Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). > > You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch. Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option `EliminateAllocationArraySizeLimit` (in `c2_globals.hpp`) to a range between 0 and 1024, instead of the current range of 0 to `max_jint`, as the upper limit of `max_jint` may not be practical? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20504#issuecomment-2294337305 From kvn at openjdk.org Fri Aug 16 21:36:57 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Aug 2024 21:36:57 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 19:33:15 GMT, Vladimir Kozlov wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > First, I think `Tier4MinInvocationThreshold` and `Tier3MinInvocationThreshold` should be `double` flags. I see that they are used only in `double` type expressions (usually with scaling value which is `double`). May be another RFE. > > I agree that we need to check `Tier4MinInvocationThreshold` == 0 because it is acceptable value and can be specified on command line. I think 0 means that `(freq < min_freq)` is always `true`. So your change could be changed: > > if (cp_min_inv == 0) { > // Tier4MinInvocationThreshold == 0 means we should not inline > min_freq = freq + 1.; > @vnkozlov , I think this is happening because the scaling factor is 0.001, so 600 turns into 0. Yes, I noticed that. But my point is that the flag's definition allows 0 regardless scaling factor: product(intx, Tier4InvocationThreshold, 5000, \ "Compile if number of method invocations crosses this " \ "threshold") \ range(0, max_jint) \ ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2294343354 From kvn at openjdk.org Fri Aug 16 22:05:07 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 16 Aug 2024 22:05:07 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 20:49:20 GMT, Igor Veresov wrote: > think the existing behavior is correct - the goal is to compute a minimum call freq that is considered for inlining. 1.0/min_invocations() means call site was reached at least once before the method was submitted for a C2 compile. Do we really need to take `1.0/min_invocations()` into account? `InlineSmallCode` should handle cases like that. Also `MinInlineFrequencyRatio ` is 0.0085 which corresponds to `Tier4MinInvocationThreshold` value equal approximately 117. The default value is 600. The only way to change it is to use `CompileThreshold` or `CompileThresholdScaling` which nobody use in production but only in testing. So in production we use `MinInlineFrequencyRatio` as `min_freq`. Why complicate the code. An other issue. This is `should_not_inline()` method. Returning `false` means **inlining** called method. I was curious why we return `false` when we run with `-Xcomp` (see check `(UseInterpreter)` at line 301) ? If we return `false` when `min_invocations()` is 0 it will be similar to `-Xcomp` mode. Is it intentional to inline when we don't have profiling info? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720392599 From iveresov at openjdk.org Fri Aug 16 22:47:59 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Fri, 16 Aug 2024 22:47:59 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 22:02:16 GMT, Vladimir Kozlov wrote: > > think the existing behavior is correct - the goal is to compute a minimum call freq that is considered for inlining. 1.0/min_invocations() means call site was reached at least once before the method was submitted for a C2 compile. > > Do we really need to take `1.0/min_invocations()` into account? `InlineSmallCode` should handle cases like that. > I'm not sure I understand the question. `InlineSmallCode` acts on already compiled methods. This logic analyzes call site frequencies. > Also `MinInlineFrequencyRatio ` is 0.0085 which corresponds to `Tier4MinInvocationThreshold` value equal approximately 117. The default value is 600. The only way to change it is to use `CompileThreshold` or `CompileThresholdScaling` which nobody use in production but only in testing. So in production we use `MinInlineFrequencyRatio` as `min_freq`. Why complicate the code. Can't exactly tell. But I'm sure I had a reason, probably some corner case exposed by testing. Probably something didn't get inlined with very low thresholds. > > An other issue. This is `should_not_inline()` method. Returning `false` means **inlining** called method. I was curious why we return `false` when we run with `-Xcomp` (see check `(UseInterpreter)` at line 301) ? If we return `false` when `min_invocations()` is 0 it will be similar to `-Xcomp` mode. > > Is it intentional to inline when we don't have profiling info? I think some tests run with -Xcomp and expect inlining to happen. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720426267 From dlong at openjdk.org Fri Aug 16 22:47:59 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 16 Aug 2024 22:47:59 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: <0otu-G39asM13W41LLoDM10iaBzumIn1cxuJcHq3KWs=.58db3172-a0b4-4988-b448-287e310423c1@github.com> On Fri, 16 Aug 2024 22:44:44 GMT, Igor Veresov wrote: >>> think the existing behavior is correct - the goal is to compute a minimum call freq that is considered for inlining. 1.0/min_invocations() means call site was reached at least once before the method was submitted for a C2 compile. >> >> Do we really need to take `1.0/min_invocations()` into account? `InlineSmallCode` should handle cases like that. >> >> Also `MinInlineFrequencyRatio ` is 0.0085 which corresponds to `Tier4MinInvocationThreshold` value equal approximately 117. The default value is 600. The only way to change it is to use `CompileThreshold` or `CompileThresholdScaling` which nobody use in production but only in testing. So in production we use `MinInlineFrequencyRatio` as `min_freq`. Why complicate the code. >> >> An other issue. This is `should_not_inline()` method. Returning `false` means **inlining** called method. I was curious why we return `false` when we run with `-Xcomp` (see check `(UseInterpreter)` at line 301) ? >> If we return `false` when `min_invocations()` is 0 it will be similar to `-Xcomp` mode. >> >> Is it intentional to inline when we don't have profiling info? > >> > think the existing behavior is correct - the goal is to compute a minimum call freq that is considered for inlining. 1.0/min_invocations() means call site was reached at least once before the method was submitted for a C2 compile. >> >> Do we really need to take `1.0/min_invocations()` into account? `InlineSmallCode` should handle cases like that. >> > > I'm not sure I understand the question. `InlineSmallCode` acts on already compiled methods. This logic analyzes call site frequencies. > >> Also `MinInlineFrequencyRatio ` is 0.0085 which corresponds to `Tier4MinInvocationThreshold` value equal approximately 117. The default value is 600. The only way to change it is to use `CompileThreshold` or `CompileThresholdScaling` which nobody use in production but only in testing. So in production we use `MinInlineFrequencyRatio` as `min_freq`. Why complicate the code. > > Can't exactly tell. But I'm sure I had a reason, probably some corner case exposed by testing. Probably something didn't get inlined with very low thresholds. > >> >> An other issue. This is `should_not_inline()` method. Returning `false` means **inlining** called method. I was curious why we return `false` when we run with `-Xcomp` (see check `(UseInterpreter)` at line 301) ? If we return `false` when `min_invocations()` is 0 it will be similar to `-Xcomp` mode. >> >> Is it intentional to inline when we don't have profiling info? > > I think some tests run with -Xcomp and expect inlining to happen. So lowering CompileThreshold to below 117, means "compile sooner", but inline less, because should_not_inline will return true more often? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720426670 From iveresov at openjdk.org Fri Aug 16 22:59:52 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Fri, 16 Aug 2024 22:59:52 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: <0otu-G39asM13W41LLoDM10iaBzumIn1cxuJcHq3KWs=.58db3172-a0b4-4988-b448-287e310423c1@github.com> References: <0otu-G39asM13W41LLoDM10iaBzumIn1cxuJcHq3KWs=.58db3172-a0b4-4988-b448-287e310423c1@github.com> Message-ID: On Fri, 16 Aug 2024 22:45:58 GMT, Dean Long wrote: >>> > think the existing behavior is correct - the goal is to compute a minimum call freq that is considered for inlining. 1.0/min_invocations() means call site was reached at least once before the method was submitted for a C2 compile. >>> >>> Do we really need to take `1.0/min_invocations()` into account? `InlineSmallCode` should handle cases like that. >>> >> >> I'm not sure I understand the question. `InlineSmallCode` acts on already compiled methods. This logic analyzes call site frequencies. >> >>> Also `MinInlineFrequencyRatio ` is 0.0085 which corresponds to `Tier4MinInvocationThreshold` value equal approximately 117. The default value is 600. The only way to change it is to use `CompileThreshold` or `CompileThresholdScaling` which nobody use in production but only in testing. So in production we use `MinInlineFrequencyRatio` as `min_freq`. Why complicate the code. >> >> Can't exactly tell. But I'm sure I had a reason, probably some corner case exposed by testing. Probably something didn't get inlined with very low thresholds. >> >>> >>> An other issue. This is `should_not_inline()` method. Returning `false` means **inlining** called method. I was curious why we return `false` when we run with `-Xcomp` (see check `(UseInterpreter)` at line 301) ? If we return `false` when `min_invocations()` is 0 it will be similar to `-Xcomp` mode. >>> >>> Is it intentional to inline when we don't have profiling info? >> >> I think some tests run with -Xcomp and expect inlining to happen. > > So lowering CompileThreshold to below 117, means "compile sooner", but inline less, because should_not_inline will return true more often? Yes, the problem is that the tests still expect you to inline things even with small or 0 thresholds. Hence the `1/min_invocations()` floor. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720434880 From sviswanathan at openjdk.org Fri Aug 16 23:11:56 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 16 Aug 2024 23:11:56 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2755: > 2753: __ vpshufb(HK, HK, xmm10, Assembler::AVX_128bit); > 2754: __ movdqu(xmm11, ExternalAddress(ghash_polynomial_addr()), r15); > 2755: __ movdqu(xmm12, ExternalAddress(ghash_polynomial_two_one_addr()), r15); There is a mix of direct xmm register usage and ZT based usage in this method, will be good to be consistent. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2932: > 2930: void StubGenerator::ghash16_avx512(bool start_ghash, bool do_reduction, bool uload_shuffle, bool hk_broadcast, bool do_hxor, > 2931: Register in, Register pos, Register subkeyHtbl, XMMRegister HASH, int in_offset, > 2932: int in_disp, int displacement, int hashkey_offset) { GL, GH and SHUFM could be added to the parameter list. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3038: > 3036: //new reduction > 3037: __ evmovdquq(xmm23, ExternalAddress(ghash_polynomial_addr()), Assembler::AVX_512bit, rbx /*rscratch*/); > 3038: __ evpclmulqdq(HASH, GL, xmm23, 0x10, Assembler::AVX_512bit); Good to refer to xmm23 as ZTMP22. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3048: > 3046: > 3047: //Stitched GHASH of 16 blocks(with reduction) with encryption of N blocks > 3048: //followed with GHASH of the N blocks. Should this comment be updated as there are 0 blocks to cipher? src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3053: > 3051: //there is 0 blocks to cipher so there are only 16 blocks for ghash and reduction > 3052: ghash16_avx512(start_ghash, do_reduction, false, false, true, in, pos, subkeyHtbl, HASH, ghashin_offset, 0, 0, hashkey_offset); > 3053: //**ZT01 may include sensitive data Spurious comment, no ZT01? src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3078: > 3076: const XMMRegister GHKEY1 = xmm1, GHKEY2 = xmm18, GHDAT1 = xmm8, GHDAT2 = xmm22; > 3077: const XMMRegister ADDBE_4x4 = xmm27, ADDBE_1234 = xmm28; > 3078: const XMMRegister GHASH_IN = xmm14, TO_REDUCE_L = xmm25, TO_REDUCE_H = xmm24; Good to add a const XMMRegister ZT = xmm23; and then use ZT below inplace of xmm23. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3222: > 3220: if (do_hash_reduction) { > 3221: __ evmovdquq(xmm23, ExternalAddress(ghash_polynomial_reduction_addr()), Assembler::AVX_512bit, rbx /*rscratch*/); > 3222: __ evpclmulqdq(THH1, TO_REDUCE_L, xmm23, 0x10, Assembler::AVX_512bit); Use previously defined ZT here. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3297: > 3295: const XMMRegister T2 = xmm4; > 3296: const XMMRegister T3 = xmm5; > 3297: const XMMRegister T4 = xmm6; Good to define const XMMRegister T5 = xmm30 and use that below. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3324: > 3322: > 3323: //move to AES encryption rounds > 3324: __ movdqu(xmm30, ExternalAddress(key_shuffle_mask_addr()), rbx /*rscratch*/); Use T5 here and below. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3417: > 3415: const XMMRegister ADDBE_4x4 = xmm27; > 3416: const XMMRegister ADDBE_1234 = xmm28; > 3417: const XMMRegister ADD_1234 = xmm13; Looks like xmm9 is available across so ADD_1234 could use xmm9 and then it will not need to be reloaded. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3503: > 3501: > 3502: __ bind(ENCRYPT_N_GHASH_32_N_BLKS); > 3503: ghash16_avx512(true, false, false, false, true, in, pos, avx512_subkeyHtbl, AAD_HASHx, stack_offset, 0, 0, HashKey_32); ghash16_avx512 needs to pass in GL, GH, and SHUF_MASK. src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3515: > 3513: __ subl(len, 16 * 16); > 3514: __ addl(pos, 16 * 16); > 3515: gcm_enc_dec_last_avx512(len, in, pos, AAD_HASHx, avx512_subkeyHtbl, ghashin_offset, HashKey_16, true, true); gcm_enc_dec_last needs to pass as argument: GL, GH, and SHUF_MASK. Note: Looks like GL, GH are internal scope only for all the methods (ghash16_avx512, ghash16_encrypt_parallel16_avx512, gcm_enc_dec_last). In which case we can skip passing GL/GH as argument everywhere. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720445419 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720360760 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720409416 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720386633 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720386899 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720419639 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720420026 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720423880 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720423997 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720394168 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720371004 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720379555 From john.r.rose at oracle.com Fri Aug 16 23:38:45 2024 From: john.r.rose at oracle.com (John Rose) Date: Fri, 16 Aug 2024 19:38:45 -0400 Subject: RFR: 8338023: Support two vector selectFrom API In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: (Better late than never, although I wish I?d been more explicit about this on panama-dev.) I think we should be moving away from throwing exceptions on all reorder/shuffle/permute vector ops, and moving toward wrapping. These ops all operate on vectors (small arrays) of vector lane indexes (small array indexes in a fixed domain, always a power of two). The throwing behavior checks an input for bad indexes and throws a (scalar) exception if there are any at all. The wrapping behavior reduces bad indexes to good ones by an unsigned modulo operation (which is at worst a mask for powers of two). If I?m right, then new API points should start out with wrap semantics, not throw semantics. And old API points should be migrated ASAP. There?s no loss of functionality in such a move. Instead the defaults are moved around. Before, throwing was the default and wrapping was an explicit operation. After, wrapping would be the default and throwing would be explicit. Both wrapping and throwing checks are available through explicit calls to VectorShuffle methods checkIndexes and wrapIndexes. OK, so why is wrapping better than throwing? And first, why did we start with throwing as the default? Well, we chose throwing as the default to make the vector operations more Java-like. Java scalar operations don?t try to reduce bad array indexes into the array domain; they throw. Since a shuffle op is like an array reference, it makes sense to emulate the checks built into Java array references. Or it did make sense. I think there is a technical debt here which is turning out to be hard to pay off. The tech debt is to suppress or hoist or strength-reduce the vector instructions that perform the check for invalid indexes (in parallel), then ask ?did any of those checks fail?? (a mask reduction), then do a conditional branch to failure code. I think I was over-confident that our scalar tactics for reducing array range checks would apply to vectors as well. On second thought, vectorizing our key optimization, of loop range splitting (pre/main/post loops) is kind of a nightmare. Instead, consider the alternative of wrapping. First, you use vpand or the like to mask the indexes down to the valid range. Then you run the shuffle/permute instruction. That?s it. There is no scalar query or branch. And, there are probably some circumstances where you can omit the vpand operation: Perhaps the hardware already masks the inputs (as with shift instructions). Or, perhaps C2 can do bitwise inference of the vectors and figure out that the vpand is a nop. (I am agitating for bitwise types in C2; this is a use case for them.) In the worst case, the vpand op is fast and pipelines well. This is why I think we should switch, ASAP, to masking instead of throwing, on bad indexes. I think some of our reports from customers have shown that the extra checks necessary for throwing on bad indexes are giving their code surprising slowdowns, relative to C-based vector code. Did I miss a point? ? John On 14 Aug 2024, at 3:43, Jatin Bhateja wrote: > On Mon, 12 Aug 2024 22:03:44 GMT, Paul Sandoz wrote: > >> The results look promising. I can provide guidance on the specification e.g., we can specify the behavior in terms of rearrange, with the addition of throwing on out of bounds indexes. >> >> Regarding the throwing of exceptions, some wider context will help to know where we are heading before we finalize the specification. I believe we are considering changing the default throwing behavior for index out of bounds to wrapping, thereby we can avoid bounds checks. If that is the case we should wait until that is done then update rather than submitting a CSR just yet? >> >> I see you created a specific intrinsic, which will avoid the cost of shuffle creation. Should we apply the same approach (in a subsequent PR) to the single argument shuffle? Or perhaps if we manage to optimize shuffles and change the default wrapping we don't require a specific intrinsic and can just use defer to rearrange? > > Hi @PaulSandoz , > Thanks for your comments. With this new API we intend to enforce stricter specification w.r.t to index values to emit a lean instruction sequence preventing any cycles spent on massaging inputs to a consumable form, thus preventing redundant wrapping and unwrapping operations. > > Existing [two vector rearrange API](https://docs.oracle.com/en/java/javase/22/docs/api/jdk.incubator.vector/jdk/incubator/vector/Vector.html#rearrange(jdk.incubator.vector.VectorShuffle,jdk.incubator.vector.Vector)) has a flexible specification which allows wrapping out of bounds shuffle indexes into exceptional index with a -ve value. > > Even if we optimize existing two vector rearrange implementation we will still need to emit additional instructions to generate an indexes which lie within two vector range [0, 2*VLEN). I see this as a specialized API like vector compress/expand which cater to targets like x86-AVX512+ and aarch64-SVE which offers direct instruction for two vector lookups. > > May be the API nomenclature can be refined to better reflect its semantics i.e. from selectFrom to twoVectorLookup ? > > ------------- > > PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2288062038 From dlong at openjdk.org Sat Aug 17 00:14:47 2024 From: dlong at openjdk.org (Dean Long) Date: Sat, 17 Aug 2024 00:14:47 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: <0otu-G39asM13W41LLoDM10iaBzumIn1cxuJcHq3KWs=.58db3172-a0b4-4988-b448-287e310423c1@github.com> Message-ID: On Fri, 16 Aug 2024 22:57:35 GMT, Igor Veresov wrote: >> So lowering CompileThreshold to below 117, means "compile sooner", but inline less, because should_not_inline will return true more often? > > Yes, the problem is that the tests still expect you to inline things even with small or 0 thresholds. Hence the `1/min_invocations()` floor. But this turns into should not inline for freq < infinity or freq < 1, or basically "always should not inline". Wouldn't we want to use MIN2 instead of MAX2 for floor? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720483786 From iveresov at openjdk.org Sat Aug 17 02:18:58 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 17 Aug 2024 02:18:58 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: <0otu-G39asM13W41LLoDM10iaBzumIn1cxuJcHq3KWs=.58db3172-a0b4-4988-b448-287e310423c1@github.com> Message-ID: On Sat, 17 Aug 2024 00:11:55 GMT, Dean Long wrote: >> Yes, the problem is that the tests still expect you to inline things even with small or 0 thresholds. Hence the `1/min_invocations()` floor. > > But this turns into should not inline for freq < infinity or freq < 1, or basically "always should not inline". Wouldn't we want to use MIN2 instead of MAX2 for floor? Good question. I think what I originally meant it to be is the minimum measurable frequency. In that case MAX2 would work. On the other hand, it's not a scaled value. I think we can try removing it and leaving only `freq < MinInlineFrequencyRatio` and see which test fails and why. I, unfortunately, don't remember the details. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1720551018 From aturbanov at openjdk.org Sat Aug 17 17:26:56 2024 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Sat, 17 Aug 2024 17:26:56 GMT Subject: RFR: 8338023: Support two vector selectFrom API In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Thu, 8 Aug 2024 06:57:28 GMT, Jatin Bhateja wrote: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... test/jdk/jdk/incubator/vector/Byte128VectorTests.java line 331: > 329: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 330: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 331: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Double64VectorTests.java line 348: > 346: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 347: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 348: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/DoubleMaxVectorTests.java line 353: > 351: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 352: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 353: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Float128VectorTests.java line 348: > 346: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 347: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 348: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Float256VectorTests.java line 348: > 346: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 347: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 348: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Float512VectorTests.java line 348: > 346: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 347: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 348: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/FloatMaxVectorTests.java line 353: > 351: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 352: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 353: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Int512VectorTests.java line 331: > 329: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 330: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 331: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/IntMaxVectorTests.java line 336: > 334: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 335: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 336: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Long256VectorTests.java line 288: > 286: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 287: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 288: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Long64VectorTests.java line 288: > 286: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 287: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 288: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); test/jdk/jdk/incubator/vector/Short256VectorTests.java line 331: > 329: boolean is_exceptional_idx = (int)order[idx] >= vector_len; > 330: int oidx = is_exceptional_idx ? ((int)order[idx] - vector_len) : (int)order[idx]; > 331: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); Suggestion: Assert.assertEquals(r[idx], (is_exceptional_idx ? b[i + oidx] : a[i + oidx]), "at index #" + idx + ", order = " + (int)order[idx] + ", a = " + a[i + oidx] + ", b = " + b[i + oidx]); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807165 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807191 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807216 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807254 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807143 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807202 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807129 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807262 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807098 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807239 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807206 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1720807231 From aturbanov at openjdk.org Sat Aug 17 17:28:56 2024 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Sat, 17 Aug 2024 17:28:56 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v3] In-Reply-To: References: Message-ID: <9inKZjq3czAlh1fgRHhzGPxABxYlC6FEVpg7nloQYok=.9cd4a3f6-6d87-40c1-b9ee-63927bd7391f@github.com> On Wed, 14 Aug 2024 04:59:23 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions src/java.base/share/classes/java/lang/Byte.java line 81: > 79: * A constant holding polarity(sign) mask used by saturating operations. > 80: */ > 81: public static final byte POLARITY_MASK_BYTE = (byte)(1 << 7); Suggestion: public static final byte POLARITY_MASK_BYTE = (byte)(1 << 7); src/java.base/share/classes/java/lang/Byte.java line 672: > 670: byte res = (byte)(a + b); > 671: boolean overflow = Byte.compareUnsigned(res, (byte)(a | b)) < 0; > 672: if (overflow) { Suggestion: if (overflow) { src/java.base/share/classes/java/lang/Long.java line 93: > 91: * A constant holding polarity(sign) mask used by saturating operations. > 92: */ > 93: public static final long POLARITY_MASK_LONG = 1L << 63; Suggestion: public static final long POLARITY_MASK_LONG = 1L << 63; src/java.base/share/classes/java/lang/Long.java line 2033: > 2031: long res = a + b; > 2032: boolean overflow = Long.compareUnsigned(res, (a | b)) < 0; > 2033: if (overflow) { Suggestion: if (overflow) { src/java.base/share/classes/java/lang/Short.java line 707: > 705: short res = (short)(a + b); > 706: boolean overflow = Short.compareUnsigned(res, (short)(a | b)) < 0; > 707: if (overflow) { Suggestion: if (overflow) { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1720807587 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1720807612 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1720807574 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1720807513 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1720807466 From duke at openjdk.org Sun Aug 18 07:35:41 2024 From: duke at openjdk.org (Joshua Cao) Date: Sun, 18 Aug 2024 07:35:41 GMT Subject: RFR: 8032218: Emit single post-constructor barrier for chain of superclass constructors [v4] In-Reply-To: <9Gvg3HswFVpqRnVZQ4HLf6WJKcM1_St_iVTC0GKhMgk=.b1081e57-b73e-43ee-ab1b-d8d6b0bf9cc5@github.com> References: <9Gvg3HswFVpqRnVZQ4HLf6WJKcM1_St_iVTC0GKhMgk=.b1081e57-b73e-43ee-ab1b-d8d6b0bf9cc5@github.com> Message-ID: > [C2 emits a StoreStore barrier for each constructor call](https://github.com/openjdk/jdk/blob/72ca7bafcd49a98c1fe09da72e4e47683f052e9d/src/hotspot/share/opto/parse1.cpp#L1016) in a chain of superclass constructor calls. It is unnecessary. We only need to emit a single barrier for each object allocation / each pair of `Allocation/InitializeNode`. > > [Macro expansion emits a trailing StoreStore after an InitializeNode](https://github.com/openjdk/jdk/blob/32946e1882e9b22c983cbba3c6bda3cc7295946a/src/hotspot/share/opto/macro.cpp#L1610-L1628). This `StoreStore` is sufficient as the post-constructor barrier. From the [InitializeNode definition](https://github.com/openjdk/jdk/blob/32946e1882e9b22c983cbba3c6bda3cc7295946a/src/hotspot/share/opto/memnode.cpp#L3639-L3642): > >> // An InitializeNode collects and isolates object initialization after > // an AllocateNode and before the next possible safepoint. As a > // memory barrier (MemBarNode), it keeps critical stores from drifting > // down past any safepoint or any publication of the allocation. > > This PR modifies `Parse::do_exits()` such that it only emits a barrier for a constructor if we find that the constructed object does not have an `InitializeNode`. It is possible that we cannot find an `InitializeNode` i.e. if the outermost method of the compilation unit is the constructor. We still need to emit a barrier in these cases. > > Passes hotspot tier1 locally on x86 linux machine. New tests make sure that there is a single `StoreStore` for chained constructors. Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 27 commits: - Add tests for Stable fields - Fix typo in comment block - Merge branch 'master' into chainstorestore - Attempt2: Only emit StoreStore in do_exits when there is no parent caller - Merge branch 'master' of https://git.openjdk.org/jdk into chainstorestore - 8032218: Emit single post-constructor barrier for chain of superclass constructors - Add riscv64 to test - Merge branch 'master' into storestore - Merge branch 'master' into storestore - Apply suggestions from code review some formatting suggestions from @shipilev Co-authored-by: Aleksey Shipil?v - ... and 17 more: https://git.openjdk.org/jdk/compare/8635642d...acca7a26 ------------- Changes: https://git.openjdk.org/jdk/pull/18870/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=18870&range=03 Stats: 449 lines in 2 files changed: 445 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/18870.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18870/head:pull/18870 PR: https://git.openjdk.org/jdk/pull/18870 From duke at openjdk.org Sun Aug 18 07:37:56 2024 From: duke at openjdk.org (Joshua Cao) Date: Sun, 18 Aug 2024 07:37:56 GMT Subject: RFR: 8032218: Emit single post-constructor barrier for chain of superclass constructors [v3] In-Reply-To: References: <9Gvg3HswFVpqRnVZQ4HLf6WJKcM1_St_iVTC0GKhMgk=.b1081e57-b73e-43ee-ab1b-d8d6b0bf9cc5@github.com> Message-ID: <1g-oDpBAPK3n_MMlSj8od-AKe8MVnY-3DySB2tRqyzM=.6351673b-2ed6-4070-b985-95a54cc6df84@github.com> On Sun, 23 Jun 2024 07:11:41 GMT, Joshua Cao wrote: >> [C2 emits a StoreStore barrier for each constructor call](https://github.com/openjdk/jdk/blob/72ca7bafcd49a98c1fe09da72e4e47683f052e9d/src/hotspot/share/opto/parse1.cpp#L1016) in a chain of superclass constructor calls. It is unnecessary. We only need to emit a single barrier for each object allocation / each pair of `Allocation/InitializeNode`. >> >> [Macro expansion emits a trailing StoreStore after an InitializeNode](https://github.com/openjdk/jdk/blob/32946e1882e9b22c983cbba3c6bda3cc7295946a/src/hotspot/share/opto/macro.cpp#L1610-L1628). This `StoreStore` is sufficient as the post-constructor barrier. From the [InitializeNode definition](https://github.com/openjdk/jdk/blob/32946e1882e9b22c983cbba3c6bda3cc7295946a/src/hotspot/share/opto/memnode.cpp#L3639-L3642): >> >>> // An InitializeNode collects and isolates object initialization after >> // an AllocateNode and before the next possible safepoint. As a >> // memory barrier (MemBarNode), it keeps critical stores from drifting >> // down past any safepoint or any publication of the allocation. >> >> This PR modifies `Parse::do_exits()` such that it only emits a barrier for a constructor if we find that the constructed object does not have an `InitializeNode`. It is possible that we cannot find an `InitializeNode` i.e. if the outermost method of the compilation unit is the constructor. We still need to emit a barrier in these cases. >> >> Passes hotspot tier1 locally on x86 linux machine. New tests make sure that there is a single `StoreStore` for chained constructors. > > Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 24 commits: > > - Attempt2: Only emit StoreStore in do_exits when there is no parent > caller > - Merge branch 'master' of https://git.openjdk.org/jdk into chainstorestore > - 8032218: Emit single post-constructor barrier for chain of superclass constructors > - Add riscv64 to test > - Merge branch 'master' into storestore > - Merge branch 'master' into storestore > - Apply suggestions from code review > > some formatting suggestions from @shipilev > > Co-authored-by: Aleksey Shipil?v > - Guard everything by feature flag > - Revert "Statistics for barriers generated/eliminated" > > This reverts commit 33d23635048afd3a1b40ae91e6fadf577742fa4f. > - Make flag product diagnostic and guard string concat storestore by flag > - ... and 14 more: https://git.openjdk.org/jdk/compare/72ca7baf...2a201397 Had some merge conflicts with https://bugs.openjdk.org/browse/JDK-8333791. As I understand it, when it comes to post-constructor barriers, we can treat stable and final fields the same. Fixed the merge conflict and added tests to reflect those changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18870#issuecomment-2295161506 From duke at openjdk.org Sun Aug 18 13:15:58 2024 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Sun, 18 Aug 2024 13:15:58 GMT Subject: RFR: 8338538: [JVMCI] Allow HotSpotJVMCIRuntime#getJObjectValue to be called by host compiler threads. Message-ID: The `HotSpotJVMCIRuntime#getJObjectValue` method is currently invoked in two distinct scenarios: Truffle Compiler: In this scenario, the method is called by a Truffle compiler thread. This thread is an ordinary Java thread that enters the shared library compiler (libgraal) via a Java native method call. Consequently, it always has a valid `JavaFrameAnchor` when invoking `HotSpotJVMCIRuntime#getJObjectValue` within the shared library compiler. Host Compiler: In the second scenario, the method is called by the host compiler thread while inlining a Truffle call target into a host method. Here, the compiler thread is a JavaThread in the `_thread_in_vm` state before entering the shared library compiler (libgraal) and does not have a `JavaFrameAnchor`. The `HotSpotJVMCIRuntime#getJObjectValue` method currently supports only the first scenario by asserting that the caller has a `JavaFrameAnchor`. However, this method should be adapted to also support the second scenario, where the caller thread lacks a `JavaFrameAnchor` but has an explicitly pushed JNI handle block. It is crucial that the `HotSpotJVMCIRuntime#getJObjectValue` method ensures it does not use the top-most `JNIHandleBlock`, which is never released. Utilizing this block for Java constants could potentially lead to memory leaks in the Java heap. To accommodate both scenarios, the method should be modified to allow execution also by threads without a `JavaFrameAnchor` provided they have an explicitly pushed JNI handle block. Implementation Details: The method determines whether the caller thread has pushed a JNI handle block by using `THREAD->active_handles()->pop_frame_link()`. The `pop_frame_link` is set when [JavaThread::push_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1360) is called and is reset in [JavaThread::pop_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1371). Each active JavaThread has a non-null `_active_handles` pointer, which is initialized in [JavaThread::run](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L730). ------------- Commit messages: - JDK-8338538: [JVMCI] Allow HotSpotJVMCIRuntime#getJObjectValue to be called by host compiler threads. Changes: https://git.openjdk.org/jdk/pull/20620/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20620&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338538 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20620.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20620/head:pull/20620 PR: https://git.openjdk.org/jdk/pull/20620 From dnsimon at openjdk.org Sun Aug 18 13:57:47 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Sun, 18 Aug 2024 13:57:47 GMT Subject: RFR: 8338538: [JVMCI] Allow HotSpotJVMCIRuntime#getJObjectValue to be called by host compiler threads. In-Reply-To: References: Message-ID: On Sun, 18 Aug 2024 13:11:24 GMT, Tom?? Zezula wrote: > The `HotSpotJVMCIRuntime#getJObjectValue` method is currently invoked in two distinct scenarios: > > Truffle Compiler: In this scenario, the method is called by a Truffle compiler thread. This thread is an ordinary Java thread that enters the shared library compiler (libgraal) via a Java native method call. Consequently, it always has a valid `JavaFrameAnchor` when invoking `HotSpotJVMCIRuntime#getJObjectValue` within the shared library compiler. > > Host Compiler: In the second scenario, the method is called by the host compiler thread while inlining a Truffle call target into a host method. Here, the compiler thread is a JavaThread in the `_thread_in_vm` state before entering the shared library compiler (libgraal) and does not have a `JavaFrameAnchor`. > > The `HotSpotJVMCIRuntime#getJObjectValue` method currently supports only the first scenario by asserting that the caller has a `JavaFrameAnchor`. However, this method should be adapted to also support the second scenario, where the caller thread lacks a `JavaFrameAnchor` but has an explicitly pushed JNI handle block. It is crucial that the `HotSpotJVMCIRuntime#getJObjectValue` method ensures it does not use the top-most `JNIHandleBlock`, which is never released. Utilizing this block for Java constants could potentially lead to memory leaks in the Java heap. To accommodate both scenarios, the method should be modified to allow execution also by threads without a `JavaFrameAnchor` provided they have an explicitly pushed JNI handle block. > > Implementation Details: The method determines whether the caller thread has pushed a JNI handle block by using `THREAD->active_handles()->pop_frame_link()`. The `pop_frame_link` is set when [JavaThread::push_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1360) is called and is reset in [JavaThread::pop_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1371). Each active JavaThread has a non-null `_active_handles` pointer, which is initialized in [JavaThread::run](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L730). Marked as reviewed by dnsimon (Reviewer). src/hotspot/share/jvmci/jvmciCompilerToVM.cpp line 713: > 711: C2V_VMENTRY_0(jlong, getJObjectValue, (JNIEnv* env, jobject, jobject constant_jobject)) > 712: requireNotInHotSpot("getJObjectValue", JVMCI_CHECK_0); > 713: // Ensure that we are not using the top-most JNIHandleBlock, which is never released. Suggestion: // Ensure that current JNI handle scope is not the top-most JNIHandleBlock as handles // in that scope are only released when the thread exits. ------------- PR Review: https://git.openjdk.org/jdk/pull/20620#pullrequestreview-2244319995 PR Review Comment: https://git.openjdk.org/jdk/pull/20620#discussion_r1720983156 From duke at openjdk.org Sun Aug 18 17:20:26 2024 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Sun, 18 Aug 2024 17:20:26 GMT Subject: RFR: 8338538: [JVMCI] Allow HotSpotJVMCIRuntime#getJObjectValue to be called by host compiler threads. [v2] In-Reply-To: References: Message-ID: > The `HotSpotJVMCIRuntime#getJObjectValue` method is currently invoked in two distinct scenarios: > > Truffle Compiler: In this scenario, the method is called by a Truffle compiler thread. This thread is an ordinary Java thread that enters the shared library compiler (libgraal) via a Java native method call. Consequently, it always has a valid `JavaFrameAnchor` when invoking `HotSpotJVMCIRuntime#getJObjectValue` within the shared library compiler. > > Host Compiler: In the second scenario, the method is called by the host compiler thread while inlining a Truffle call target into a host method. Here, the compiler thread is a JavaThread in the `_thread_in_vm` state before entering the shared library compiler (libgraal) and does not have a `JavaFrameAnchor`. > > The `HotSpotJVMCIRuntime#getJObjectValue` method currently supports only the first scenario by asserting that the caller has a `JavaFrameAnchor`. However, this method should be adapted to also support the second scenario, where the caller thread lacks a `JavaFrameAnchor` but has an explicitly pushed JNI handle block. It is crucial that the `HotSpotJVMCIRuntime#getJObjectValue` method ensures it does not use the top-most `JNIHandleBlock`, which is never released. Utilizing this block for Java constants could potentially lead to memory leaks in the Java heap. To accommodate both scenarios, the method should be modified to allow execution also by threads without a `JavaFrameAnchor` provided they have an explicitly pushed JNI handle block. > > Implementation Details: The method determines whether the caller thread has pushed a JNI handle block by using `THREAD->active_handles()->pop_frame_link()`. The `pop_frame_link` is set when [JavaThread::push_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1360) is called and is reset in [JavaThread::pop_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1371). Each active JavaThread has a non-null `_active_handles` pointer, which is initialized in [JavaThread::run](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L730). Tom?? Zezula has updated the pull request incrementally with one additional commit since the last revision: Updated comment in getObjectValue. Co-authored-by: Douglas Simon ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20620/files - new: https://git.openjdk.org/jdk/pull/20620/files/03010245..aa3838e1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20620&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20620&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20620.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20620/head:pull/20620 PR: https://git.openjdk.org/jdk/pull/20620 From thartmann at openjdk.org Mon Aug 19 06:12:53 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 19 Aug 2024 06:12:53 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v7] In-Reply-To: References: Message-ID: <2EBT6Sg3_rPMDKh2eAjMQozkNVF6EMukgu4kuPqslvI=.9625c896-0d7e-4ef4-997b-c2c134e13bf8@github.com> On Tue, 9 Jul 2024 03:10:55 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: > > Add copyright. This looks good to me but you need a second review. Please add brackets around if/else branches that you modified. Marked as reviewed by thartmann (Reviewer). src/hotspot/share/opto/memnode.cpp line 1664: > 1662: Node* region; > 1663: DomResult dom_result = DomResult::Dominate; > 1664: PhaseIterGVN* igvn = phase->is_IterGVN(); `igvn` should be moved below to its only usage. src/hotspot/share/opto/memnode.cpp line 1687: > 1685: region = mem->in(0); > 1686: } > 1687: // Otherwise we encounter a complex graph. Suggestion: // Otherwise we encountered a complex graph. src/hotspot/share/opto/node.hpp line 1113: > 1111: NotDominate, // 'this' node does not dominate 'sub'. > 1112: Dominate, // 'this' node dominates or is equal to 'sub'. > 1113: EncounteredDeadCode, // Result is undefined due to encountering dead code. Suggestion: EncounteredDeadCode // Result is undefined due to encountering dead code. ------------- PR Review: https://git.openjdk.org/jdk/pull/19496#pullrequestreview-2244668452 PR Review: https://git.openjdk.org/jdk/pull/19496#pullrequestreview-2244684033 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721251431 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721248449 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721246810 From thartmann at openjdk.org Mon Aug 19 06:12:54 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 19 Aug 2024 06:12:54 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v6] In-Reply-To: References: <_KheD9_dnqMix5cuAMGYOELNuqsbOB2_VgYBJsG092U=.0a88d9e6-32e9-4f60-85c7-0b1967d7b57a@github.com> Message-ID: On Tue, 9 Jul 2024 02:46:13 GMT, Qizheng Xing wrote: >> test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/ScalarReplacementWithGCBarrierTests.java line 107: >> >>> 105: @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.ALLOC, "1" }) >>> 106: @IR(phase = { CompilePhase.INCREMENTAL_BOXING_INLINE }, counts = { IRNode.ALLOC, "2" }) >>> 107: @IR(applyIf = { "UseG1GC", "true" }, phase = { CompilePhase.ITER_GVN_AFTER_ELIMINATION }, counts = { IRNode.ALLOC, "1" }) >> >> Your test checks for the number of allocations to return to one. However in your description and comments you talk about load nodes that aren't folded. What about adding another IR test to check the number of loads for completeness? > > I think it's hard to add an intuitive and meaningful IR test for load nodes in this test case. > > To construct a case that containing both eliminable GC pre-barriers and allocations, and have them interfere with each other, I had to put several object field reads/writes into nested loops. This makes a large number of load nodes appearing in this test case after some optimization phases. Perhaps after several more optimizations, most of them will disappear and new loads will appear again. So it's difficult to tell which load we really care about without looking at the Ideal graph dump via IGV. Right, I think the test is fine as is. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721256045 From jbhateja at openjdk.org Mon Aug 19 06:47:50 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 19 Aug 2024 06:47:50 GMT Subject: RFR: 8338023: Support two vector selectFrom API In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Thu, 8 Aug 2024 06:57:28 GMT, Jatin Bhateja wrote: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... > _Mailing list message from [John Rose](mailto:john.r.rose at oracle.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at mail.openjdk.org):_ > > (Better late than never, although I wish I?d been more explicit about this on panama-dev.) > > I think we should be moving away from throwing exceptions on all reorder/shuffle/permute vector ops, and moving toward wrapping. These ops all operate on vectors (small arrays) of vector lane indexes (small array indexes in a fixed domain, always a power of two). The throwing behavior checks an input for bad indexes and throws a (scalar) exception if there are any at all. The wrapping behavior reduces bad indexes to good ones by an unsigned modulo operation (which is at worst a mask for powers of two). > > If I?m right, then new API points should start out with wrap semantics, not throw semantics. And old API points should be migrated ASAP. > > There?s no loss of functionality in such a move. Instead the defaults are moved around. Before, throwing was the default and wrapping was an explicit operation. After, wrapping would be the default and throwing would be explicit. Both wrapping and throwing checks are available through explicit calls to VectorShuffle methods checkIndexes and wrapIndexes. > > OK, so why is wrapping better than throwing? And first, why did we start with throwing as the default? Well, we chose throwing as the default to make the vector operations more Java-like. Java scalar operations don?t try to reduce bad array indexes into the array domain; they throw. Since a shuffle op is like an array reference, it makes sense to emulate the checks built into Java array references. > > Or it did make sense. I think there is a technical debt here which is turning out to be hard to pay off. The tech debt is to suppress or hoist or strength-reduce the vector instructions that perform the check for invalid indexes (in parallel), then ask ?did any of those checks fail?? (a mask reduction), then do a conditional branch to failure code. I think I was over-confident that our scalar tactics for reducing array range checks would apply to vectors as well. On second thought, vectorizing our key optimization, of loop range splitting (pre/main/post loops) is kind of a nightmare. > > Instead, consider the alternative of wrapping. First, you use vpand or the like to mask the indexes down to the valid range. Then you run the shuffle/permute instruction. That?s it. There is no scalar query or branch. And, there are probably some circumstances where you can omit the vpand operation: Perhaps the hardware already masks the inputs (as with shift instructions). Or, perhaps C2 can do bitwise inference of the vectors and figure out that the vpand is a nop. (I am agitating for bitwise types in C2; this is a use case for them.) In the worst case, the vpand op is fast and pipelines well. > > This is why I think we should switch, ASAP, to masking instead of throwing, on bad indexes. > > I think some of our reports from customers have shown that the extra checks necessary for throwing on bad indexes are giving their code surprising slowdowns, relative to C-based vector code. > > Did I miss a point? > > ? John > > On 14 Aug 2024, at 3:43, Jatin Bhateja wrote: Hi @rose00, I agree that wrapping should be the default behaviour if indices are passed through shuffles, idea was to pick exception throwing semantics for out of bounds indexes *only* for selectFrom flavour of APIs which accept indexes through vector interface, this will save redundant partial wrapping and un-wrapping for cross vector permutation API which has a direct mappings in x86 and AARCH64 ISA. As @PaulSandoz [suggested](https://github.com/openjdk/jdk/pull/20508#pullrequestreview-2234095541) we can also tune existing single 'selectFrom' API to adopt default exception throwing semantics if any of the indices lies beyond valid index range. While we will continue keeping default wrapping semantics for APIs accepting shuffles, this little deviation of semantics for selectFrom family of APIs will enable generating efficient code and will enable users to chooses between the rearrange and selectFrom APIs based on convenience vs efficient code trade-off. Since, API interfaces were crafted keeping in view long term flexibility, having multiple permutation interfaces (selectFrom / rearrange) accepting indexes though vector or shuffle enables compiler to emit efficient code. Best Regards, Jatin ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2295785781 From jbhateja at openjdk.org Mon Aug 19 07:19:30 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 19 Aug 2024 07:19:30 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. > > > . SUADD : Saturating unsigned addition. > . SADD : Saturating signed addition. > . SUSUB : Saturating unsigned subtraction. > . SSUB : Saturating signed subtraction. > . UMAX : Unsigned max > . UMIN : Unsigned min. > > > New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. > > As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. > > Summary of changes: > - Java side implementation of new vector operators. > - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. > - C2 compiler IR and inline expander changes. > - Optimized x86 backend implementation for new vector operators and their predicated counterparts. > - Extends existing VectorAPI Jtreg test suite to cover new operations. > > Kindly review and share your feedback. > > Best Regards, > PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. > > [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20507/files - new: https://git.openjdk.org/jdk/pull/20507/files/8c9bfeca..c42b4afa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20507&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20507&range=02-03 Stats: 5 lines in 3 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/20507.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20507/head:pull/20507 PR: https://git.openjdk.org/jdk/pull/20507 From mli at openjdk.org Mon Aug 19 07:26:52 2024 From: mli at openjdk.org (Hamlin Li) Date: Mon, 19 Aug 2024 07:26:52 GMT Subject: RFR: 8321010: RISC-V: C2 RoundVF [v15] In-Reply-To: <8hGiyN1XJKBa5eFp9xy15NfL5iFkhFHaG55bR6gX-_I=.001d319e-fec1-4e44-ac38-ae0b13aaa104@github.com> References: <8hGiyN1XJKBa5eFp9xy15NfL5iFkhFHaG55bR6gX-_I=.001d319e-fec1-4e44-ac38-ae0b13aaa104@github.com> Message-ID: On Thu, 25 Jul 2024 14:29:55 GMT, Hamlin Li wrote: >> Hi, >> Can you have a review on this patch to add RoundVF/RoundDF intrinsics? >> >> Current test shows that, it bring performance gain when vlenb >= 32 (which is on bananapi), but bring regression when vlenb == 16 (which is on k230). So I only enable the intrinsic when vlenb >= 32. >> >> For double, even vlenb == 32, there is still some regression, so in this pr I only enable it when vlenb >= 64. Although there is no hardware to verify it, I think from the trend of performance data on bananapi and k230, it's promising to bring better performance rather than regression for double when vlenb == 64+. Please compare the data of `Benchmark on bananapi, +UseSuperWord` and `Benchmark on k230, +UseSuperWord, enable RoundVF/D when vlenb == 16`. >> >> Thanks! >> >> ## Tests >> >> test/hotspot/jtreg/compiler/vectorization/TestRoundVectRiscv64.java test/hotspot/jtreg/compiler/c2/cr6340864/TestFloatVect.java test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java test/hotspot/jtreg/compiler/floatingpoint/TestRound.java >> >> test/jdk/java/lang/Math/RoundTests.java >> >> ## Performance - with Intrinsic >> >> ### on bananapi >> Benchmark on bananapi, +UseSuperWord >> >> Benchmark on bananapi, +UseSuperWord | (TESTSIZE) | Mode | Cnt | Score +intrinsic | Score -intrinsic | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 2048 | avgt | 10 | 23794.153 | 20557.467 | 2899.266 | ns/op | 0.864 >> FpRoundingBenchmark.test_round_float | 2048 | avgt | 10 | 11531.853 | 16562.431 | 865.779 | ns/op | 1.436 >> >> >> >> ### on k230 (enable intrinsic even when vlenb == 16) >> Benchmark on k230, +UseSuperWord, enable RoundVF/D when vlenb == 16 >> >> Benchmark on k230, +UseSuperWord, enable RoundVF/D ... > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > minor Hey, Is someone avaiable to review this patch again? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17745#issuecomment-2295847406 From mbaesken at openjdk.org Mon Aug 19 07:32:49 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 19 Aug 2024 07:32:49 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: <1URWnKWj-n-cf_AfAAHfCbICIz4pbq3ATEtHu29h36Q=.2d7ff232-715f-4b8b-b859-f0dafdd73ee0@github.com> On Fri, 16 Aug 2024 11:57:09 GMT, Matthias Baesken wrote: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) > Doesn't this need to be 1.0 or infinity to preserve the existing behavior? So I would change the code to `min_freq = 1.0;` , right ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2295853203 From mbaesken at openjdk.org Mon Aug 19 07:32:50 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 19 Aug 2024 07:32:50 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 19:33:15 GMT, Vladimir Kozlov wrote: >First, I think Tier4MinInvocationThreshold and Tier3MinInvocationThreshold should be double flags. > I see that they are used only in double type expressions (usually with scaling value which is double). May be another RFE. I can open a new JBS issue 'Make Tier4MinInvocationThreshold and Tier3MinInvocationThreshold double flags' is that okay ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2295857064 From jbhateja at openjdk.org Mon Aug 19 07:36:15 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 19 Aug 2024 07:36:15 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v2] In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review suggestions incorporated. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20508/files - new: https://git.openjdk.org/jdk/pull/20508/files/82c0b0a2..055fb22f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=00-01 Stats: 31 lines in 31 files changed: 0 ins; 0 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/20508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508 PR: https://git.openjdk.org/jdk/pull/20508 From rcastanedalo at openjdk.org Mon Aug 19 08:53:30 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 19 Aug 2024 08:53:30 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v9] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Pass oldval to the pre-barrier in g1CompareAndExchange/SwapP ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/554de779..92112802 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=07-08 Stats: 28 lines in 3 files changed: 12 ins; 0 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Mon Aug 19 08:53:30 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 19 Aug 2024 08:53:30 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> Message-ID: On Fri, 16 Aug 2024 20:56:08 GMT, Martin Doerr wrote: >> Thanks, I will test and apply this one: it is simple enough and, in a way, even more intuitive than the current solution. For reference, I prototyped it [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `UseOldValInPreBarriers`). It should also apply to `g1CompareAndSwapP`, right? > > Exactly. Done (commit 9211280). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1721433361 From chagedorn at openjdk.org Mon Aug 19 10:31:52 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Aug 2024 10:31:52 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v7] In-Reply-To: References: Message-ID: On Tue, 9 Jul 2024 03:10:55 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: > > Add copyright. Generally, the fix idea looks good to me, too! As Tobias has already mentioned, you should add some braces for the existing one-liner if-statements, just to be safe. I have a few comment. src/hotspot/share/opto/memnode.cpp line 431: > 429: // control input of a memory operation predates (dominates) > 430: // an allocation it wants to look past. > 431: Node::DomResult MemNode::all_controls_dominate(Node* dom, Node* sub) { Now you have many checks of the form: all_controls_dominate(this, st_alloc) == DomResult::Dominate But actually, there is only one place in IGVN where you care about the third dead code result. Maybe you can abstract that away and do the following: - Rename this method to `maybe_all_control_dominate()`. - Add a new method `all_control_dominate()` which checks the result for `DomResult::Dominate`: bool MemNode::all_controls_dominate(Node* dom, Node* sub) { DomResult dom_result = maybe_all_controls_dominate(dom, sub); return dom_result == DomResult::Dominate } - The calls in `LoadNode::split_through_phi()` use `maybe_all_controls_dominate()`. - All other callers in existing code do not need to be updated since they call the new `all_controls_dominate()` method which mimics the old behavior without caring about dead code. Might be cleaner but it's just a thought. src/hotspot/share/opto/memnode.cpp line 1695: > 1693: if (dom_result != DomResult::Dominate) { > 1694: if (dom_result == DomResult::EncounteredDeadCode) { > 1695: // Wait for the dead code to be removed. You could extend the comment here to mention that it is guaranteed that the dead code will eventually be removed in IGVN such that we have an unambiguous result whether it's dominated or not. test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/ScalarReplacementWithGCBarrierTests.java line 107: > 105: @IR(phase = { CompilePhase.AFTER_PARSING }, counts = { IRNode.ALLOC, "1" }) > 106: @IR(phase = { CompilePhase.INCREMENTAL_BOXING_INLINE }, counts = { IRNode.ALLOC, "2" }) > 107: @IR(applyIf = { "UseG1GC", "true" }, phase = { CompilePhase.ITER_GVN_AFTER_ELIMINATION }, counts = { IRNode.ALLOC, "1" }) I think the `applyIf` is redundant and can be removed because you are only running the framework test with TestFramework.runWithFlags("-XX:+UseG1GC"); Suggestion: @IR(phase = { CompilePhase.ITER_GVN_AFTER_ELIMINATION }, counts = { IRNode.ALLOC, "1" }) test/hotspot/jtreg/compiler/c2/irTests/scalarReplacement/ScalarReplacementWithGCBarrierTests.java line 110: > 108: private int testScalarReplacementWithGCBarrier(List list) { > 109: Iter iter = list.iter(); > 110: for (;;) { I think it's cleaner to use `while (true)` instead: Suggestion: while (true) { ------------- PR Review: https://git.openjdk.org/jdk/pull/19496#pullrequestreview-2245027794 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721468222 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721477723 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721483868 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721484716 From chagedorn at openjdk.org Mon Aug 19 10:31:53 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Aug 2024 10:31:53 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v7] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 09:16:01 GMT, Christian Hagedorn wrote: >> Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: >> >> Add copyright. > > src/hotspot/share/opto/memnode.cpp line 431: > >> 429: // control input of a memory operation predates (dominates) >> 430: // an allocation it wants to look past. >> 431: Node::DomResult MemNode::all_controls_dominate(Node* dom, Node* sub) { > > Now you have many checks of the form: > > all_controls_dominate(this, st_alloc) == DomResult::Dominate > > But actually, there is only one place in IGVN where you care about the third dead code result. Maybe you can abstract that away and do the following: > - Rename this method to `maybe_all_control_dominate()`. > - Add a new method `all_control_dominate()` which checks the result for `DomResult::Dominate`: > > bool MemNode::all_controls_dominate(Node* dom, Node* sub) { > DomResult dom_result = maybe_all_controls_dominate(dom, sub); > return dom_result == DomResult::Dominate > } > > - The calls in `LoadNode::split_through_phi()` use `maybe_all_controls_dominate()`. > - All other callers in existing code do not need to be updated since they call the new `all_controls_dominate()` method which mimics the old behavior without caring about dead code. > > Might be cleaner but it's just a thought. Maybe also add a method comment about the implications of returning `DomResult::EncounteredDeadCode` (i.e. that this means, we are undecided as long as there is dead code but that at the end of IGVN, we know the definite result once the dead code is cleaned up). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1721474671 From chagedorn at openjdk.org Mon Aug 19 10:36:54 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Aug 2024 10:36:54 GMT Subject: RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 21:28:57 GMT, Dhamoder Nalla wrote: > > Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. > > For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories: > > > > 1. We have a real bug and by fixing it, we no longer create this many nodes. > > 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed). > > 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well). > > > > Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). > > You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch. > > Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option `EliminateAllocationArraySizeLimit` (in `c2_globals.hpp`) to a range between 0 and 1024, instead of the current range of 0 to `max_jint`, as the upper limit of `max_jint` may not be practical? Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20504#issuecomment-2296249618 From chagedorn at openjdk.org Mon Aug 19 12:11:53 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Aug 2024 12:11:53 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: On Wed, 7 Aug 2024 01:20:09 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) >> >> This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Check IR before macro expansion Interesting findings! I like that you unified the integer and long case with a template. A few comments. src/hotspot/share/opto/mulnode.cpp line 604: > 602: > 603: // If both are constants, we can calculate a precise result. > 604: if(r0->is_con() && r1->is_con()) { Suggestion: if (r0->is_con() && r1->is_con()) { src/hotspot/share/opto/mulnode.cpp line 611: > 609: if (r0->_lo >= 0 && r1->_lo >= 0) { > 610: return IntegerType::make(0, MIN2(r0->_hi, r1->_hi), widen); > 611: } Since you've already worked out the math in the PR comment, do you also want to add it here to the different cases? It could help to support the correctness of the code. src/hotspot/share/opto/mulnode.cpp line 626: > 624: > 625: assert(r0->_lo < 0 && r1->_lo < 0, "positive ranges should already be handled!"); > 626: static_assert(std::is_signed::value, "native type of IntegerType must be signed!"); Maybe you want to have that static assert on method entry already. src/hotspot/share/opto/mulnode.cpp line 635: > 633: // Since count_leading_zeros is undefined at 0 (~(-1)) the number of digits in the native type can be used instead, > 634: // as it returns 31 and 63 for signed integers and longs respectively. > 635: int shift_bits = sel_val == 0 ? std::numeric_limits::digits : count_leading_zeros(sel_val) - 1; `sel_val` can only be 0 if `r0->_lo` and `r1->_lo` are both -1. While I think it's correct how you handle the case here, wouldn't it be simpler/more readable if you handle this case separately by setting -1 as lower bound directly instead of using "`min >> #digits`"? ------------- PR Review: https://git.openjdk.org/jdk/pull/20066#pullrequestreview-2245217606 PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1721586313 PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1721632659 PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1721636891 PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1721685114 From chagedorn at openjdk.org Mon Aug 19 12:11:54 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 19 Aug 2024 12:11:54 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <8fVdLHTpCnPF-OMl9zFOgoCbENn2a_I06Iq1Ksapfs0=.056922a3-f56d-4ae0-ae63-c0a6d4463ba7@github.com> Message-ID: On Wed, 31 Jul 2024 17:12:47 GMT, Damon Fenacci wrote: >> I actually did experiment with this before coming up with the current approach, but I found that there are cases where it produces incorrect bounds. With a small example where `r0 = [-3, -1]` and `r1 = [-7, -1]`, using `r0->_lo & r1->_lo` results in `-3 & -7`, which equals `-7`. However, in the case where the value in `r0` is `-3` and `r1` is `-6`, we can get an out of bounds result as `-3 & -6` equals `-8`. Because of that I thought the best way to fix this was to find the common leading 1s. This approach does lead to us losing some information in the lower-order bits but I thought it was an acceptable compromise since the code to handle finding common bits across a range of integers becomes quite a bit more complicated. I hope this clarifies! > > You're totally right! It is not even related to the LSB (`-14 & -6` would have the same problem with `-12`). Finding the leading 1s is the right solution. Thanks a lot for the clarification! Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative? In the positive case, we have values from: 011...1 000...0 ``` while in the negative case, we have values from: 111...1 100...0 It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. `MIN2(r0->_hi, r1->_hi)`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1721669165 From rcastanedalo at openjdk.org Mon Aug 19 12:19:51 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 19 Aug 2024 12:19:51 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> Message-ID: On Fri, 16 Aug 2024 20:54:25 GMT, Martin Doerr wrote: > The null check can be integrated into the oop decoding, so all Compressed Oops Modes would benefit. But integrating the null check into the zero-based OOP decoding operation would require adding a conditional branch to OOP decoding (as prototyped [here](https://github.com/robcasloz/jdk/blob/ac71b1a02c8c1bd4989f762d27092fd3bf19ccd7/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L5859-L5875)), right? This would effectively mean moving the post-barrier null check above the post-barrier inter-region check, which I am not sure is beneficial. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1721697447 From rcastanedalo at openjdk.org Mon Aug 19 12:22:52 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 19 Aug 2024 12:22:52 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <4aS0KysKaIPue1D60nootRPb8m7pdP_2hlPSwGUIk8w=.6b68b278-62ce-4df0-a6c6-c7fee40557aa@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <4aS0KysKaIPue1D60nootRPb8m7pdP_2hlPSwGUIk8w=.6b68b278-62ce-4df0-a6c6-c7fee40557aa@github.com> Message-ID: On Fri, 16 Aug 2024 21:03:14 GMT, Martin Doerr wrote: >> Thanks, I also prototyped this [here](https://github.com/robcasloz/jdk/blob/JDK-8334060-g1-late-barrier-expansion-x64-optimizations/src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad) (guarded by `UseExchangedValueinPreBarriers`). The atomic barrier implementation in this PR is purposefully simple based on 1) the assumption that the cost of the atomic operations will dominate that of their barriers, and 2) the risk of introducing subtle bugs which can be difficult to catch by regular testing. Because of this, I feel hesitant to introduce this kind of optimizations for atomic operation barriers. But I am happy to reconsider, if you have any specific benchmark/configuration in mind where the benefit could outweigh the cost. > > Note that we had such an optimization already in C2: https://github.com/openjdk/jdk8u-dev/blob/4106121e0ae42d644e45c6eab9037874110ed670/hotspot/src/share/vm/opto/library_call.cpp#L3114 > But, it's probably not a big deal. Maybe I can try it on PPC64 which may be more sensitive to accesses on contended memory. Thanks for the reference, I would still prefer to keep this part as is for simplicity. We can always optimize the atomic barriers in follow-up work, if a need arises. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1721701794 From duke at openjdk.org Mon Aug 19 13:29:00 2024 From: duke at openjdk.org (duke) Date: Mon, 19 Aug 2024 13:29:00 GMT Subject: Withdrawn: 8324756: Test vmTestbase/vm/mlvm/meth/stress/compiler/deoptimize is too slow due to dependency verification In-Reply-To: References: Message-ID: On Wed, 1 May 2024 17:54:30 GMT, Ian Myers wrote: > This change removes dependency verification by passing -XX:-VerifyDependencies in the test. > > `vmTestbase/vm/mlvm/meth/stress/compiler/deoptimize/Test.java` takes 20min to run on linux-x86_64-server-fastdebug: > > time CONF=linux-x86_64-server-fastdebug make test TEST=vmTestbase/vm/mlvm/meth/stress/compiler/deoptimize/Test.java > CONF=linux-x86_64-server-fastdebug make test **1412.82s user 15.27s system 115% cpu 20:41.19 total** > > > Passing -XX:-VerifyDependencies flag speeds up the run time to 1min: > > time CONF=linux-x86_64-server-fastdebug make test TEST=vmTestbase/vm/mlvm/meth/stress/compiler/deoptimize/Test.java TEST_VM_OPTS="-XX:-VerifyDependencies" > CONF=linux-x86_64-server-fastdebug make test **287.27s user 16.19s system 496% cpu 1:01.10 total** > > > Adding -XX:-VerifyDependencies to the test file accomplishes the same run time of 1min: > > time CONF=linux-x86_64-server-fastdebug make test TEST=vmTestbase/vm/mlvm/meth/stress/compiler/deoptimize/Test.java > CONF=linux-x86_64-server-fastdebug make test **272.33s user 14.56s system 464% cpu 1:01.75 total** This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/19040 From mdoerr at openjdk.org Mon Aug 19 13:45:55 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 19 Aug 2024 13:45:55 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> Message-ID: On Mon, 19 Aug 2024 12:16:44 GMT, Roberto Casta?eda Lozano wrote: >> Thanks for trying! I think I should try it on PPC64. The null check can be integrated into the oop decoding, so all Compressed Oops Modes would benefit. It could be that x86 is less sensitive to such optimizations. > >> The null check can be integrated into the oop decoding, so all Compressed Oops Modes would benefit. > > But integrating the null check into the zero-based OOP decoding operation would require adding a conditional branch to OOP decoding (as prototyped [here](https://github.com/robcasloz/jdk/blob/ac71b1a02c8c1bd4989f762d27092fd3bf19ccd7/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L5859-L5875)), right? This would effectively mean moving the post-barrier null check above the post-barrier inter-region check, which I am not sure is beneficial. If case of heap base != null, a branch already exists which makes the other null check redundant. So, we have null check, region crossing check, another null check. Maybe this compressed oop mode is not important enough. For the other compressed oop modes, yes, this means moving the null check above the region crossing check. On PPC64, the null check can be combined with the shift instruction, so we save one compare instruction. Technically, it would even be possible to use only one branch instruction for both checks, but I'm not sure if it's worth the complexity. I'll think about it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1721813878 From rcastanedalo at openjdk.org Mon Aug 19 14:27:53 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 19 Aug 2024 14:27:53 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> Message-ID: <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> On Mon, 19 Aug 2024 13:43:04 GMT, Martin Doerr wrote: >>> The null check can be integrated into the oop decoding, so all Compressed Oops Modes would benefit. >> >> But integrating the null check into the zero-based OOP decoding operation would require adding a conditional branch to OOP decoding (as prototyped [here](https://github.com/robcasloz/jdk/blob/ac71b1a02c8c1bd4989f762d27092fd3bf19ccd7/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L5859-L5875)), right? This would effectively mean moving the post-barrier null check above the post-barrier inter-region check, which I am not sure is beneficial. > > If case of heap base != null, a branch already exists which makes the other null check redundant. So, we have null check, region crossing check, another null check. Maybe this compressed oop mode is not important enough. > > For the other compressed oop modes, yes, this means moving the null check above the region crossing check. On PPC64, the null check can be combined with the shift instruction, so we save one compare instruction. Technically, it would even be possible to use only one branch instruction for both checks, but I'm not sure if it's worth the complexity. I'll think about it. OK, thanks. I just ran some benchmarks with zero-based OOP compression ([prototype here](https://github.com/robcasloz/jdk/tree/JDK-8334060-g1-late-barrier-expansion-x64-optimizations)) and could not observe any significant performance effect on three different x64 implementations. I think I will keep the `g1StoreN` implementation as-is in the x64 and aarch64 backends, for simplicity. Again, we can revisit this in follow-up work if need be. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1721881065 From kvn at openjdk.org Mon Aug 19 17:39:53 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 19 Aug 2024 17:39:53 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v7] In-Reply-To: References: Message-ID: On Tue, 9 Jul 2024 03:10:55 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: > > Add copyright. src/hotspot/share/opto/memnode.cpp line 433: > 431: Node::DomResult MemNode::all_controls_dominate(Node* dom, Node* sub) { > 432: if (dom == nullptr || dom->is_top() || sub == nullptr || sub->is_top()) > 433: return DomResult::EncounteredDeadCode; // Conservative answer for dead code Please, fix style of this and other code you touched. Add `{}` for condition's body where it is missing. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1722125613 From kvn at openjdk.org Mon Aug 19 17:39:51 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 19 Aug 2024 17:39:51 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v7] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 09:20:58 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/memnode.cpp line 431: >> >>> 429: // control input of a memory operation predates (dominates) >>> 430: // an allocation it wants to look past. >>> 431: Node::DomResult MemNode::all_controls_dominate(Node* dom, Node* sub) { >> >> Now you have many checks of the form: >> >> all_controls_dominate(this, st_alloc) == DomResult::Dominate >> >> But actually, there is only one place in IGVN where you care about the third dead code result. Maybe you can abstract that away and do the following: >> - Rename this method to `maybe_all_control_dominate()`. >> - Add a new method `all_control_dominate()` which checks the result for `DomResult::Dominate`: >> >> bool MemNode::all_controls_dominate(Node* dom, Node* sub) { >> DomResult dom_result = maybe_all_controls_dominate(dom, sub); >> return dom_result == DomResult::Dominate >> } >> >> - The calls in `LoadNode::split_through_phi()` use `maybe_all_controls_dominate()`. >> - All other callers in existing code do not need to be updated since they call the new `all_controls_dominate()` method which mimics the old behavior without caring about dead code. >> >> Might be cleaner but it's just a thought. > > Maybe also add a method comment about the implications of returning `DomResult::EncounteredDeadCode` (i.e. that this means, we are undecided as long as there is dead code but that at the end of IGVN, we know the definite result once the dead code is cleaned up). I agree with Christian's suggestion about wrapper method which checks result. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1722134179 From sviswanathan at openjdk.org Mon Aug 19 18:06:52 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 19 Aug 2024 18:06:52 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: <9qMj55IMZFiI_Fgjn_VpYK601cqTbSe1iWCePgN1qFg=.4ea0faf5-56d7-4ddd-a7db-e4ce13096014@github.com> On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2749: > 2747: const XMMRegister ZT1 = xmm0, ZT2 = xmm1, ZT3 = xmm2, ZT4 = xmm3; > 2748: const XMMRegister ZT5 = xmm4, ZT6 = xmm5, ZT7 = xmm7, ZT8 = xmm8; > 2749: const XMMRegister T5 = xmm4; Why do we need T5 separately? Could we not use ZT5 itself? src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3433: > 3431: __ movl(pos, 0); > 3432: __ cmpl(len, 256); > 3433: __ jcc(Assembler::lessEqual, ENC_DEC_DONE); Could this handle be Assembler::less, in which case you can handle 256 byte length as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720478148 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1722162267 From qamai at openjdk.org Mon Aug 19 18:10:58 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 19 Aug 2024 18:10:58 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: On Wed, 7 Aug 2024 01:20:09 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) >> >> This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Check IR before macro expansion This is really nice. Given Christian's suggestion, what do you think about inferring the unsigned bounds and known bits from the signed bounds, do the calculation there and work back the signed bounds in the end? src/hotspot/share/opto/mulnode.cpp line 601: > 599: typedef typename IntegerType::NativeType NativeType; > 600: > 601: int widen = MAX2(r0->_widen,r1->_widen); Suggestion: int widen = MAX2(r0->_widen, r1->_widen); ------------- PR Review: https://git.openjdk.org/jdk/pull/20066#pullrequestreview-2246160281 PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1722165063 From qamai at openjdk.org Mon Aug 19 18:13:51 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 19 Aug 2024 18:13:51 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: <94xIgQwDqMDeu5NPWB58Q23Oc3rTuf1sKIIzPuq9hgM=.5a0667d7-05aa-4eef-b0ab-ad71ec40dcae@github.com> On Mon, 19 Aug 2024 18:06:02 GMT, Quan Anh Mai wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Check IR before macro expansion > > src/hotspot/share/opto/mulnode.cpp line 601: > >> 599: typedef typename IntegerType::NativeType NativeType; >> 600: >> 601: int widen = MAX2(r0->_widen,r1->_widen); > > Suggestion: > > int widen = MAX2(r0->_widen, r1->_widen); For example, since `res_uhi = min(uhi1, uhi2)`, your cases regarding one of the inputs being non-negative can be nicely arrived at because for a non-negative `TypeInt`, its unsigned upper bound is the same as its signed upper bound. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1722170192 From sviswanathan at openjdk.org Mon Aug 19 22:05:52 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 19 Aug 2024 22:05:52 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v2] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Mon, 19 Aug 2024 07:36:15 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review suggestions incorporated. @rose00 @PaulSandoz Please see the work in progress (https://github.com/openjdk/jdk/pull/20634) to make wrap indices as default for rearrange and selectFrom apis. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2297535442 From dlong at openjdk.org Tue Aug 20 00:03:48 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 20 Aug 2024 00:03:48 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: <1URWnKWj-n-cf_AfAAHfCbICIz4pbq3ATEtHu29h36Q=.2d7ff232-715f-4b8b-b859-f0dafdd73ee0@github.com> References: <1URWnKWj-n-cf_AfAAHfCbICIz4pbq3ATEtHu29h36Q=.2d7ff232-715f-4b8b-b859-f0dafdd73ee0@github.com> Message-ID: On Mon, 19 Aug 2024 07:27:55 GMT, Matthias Baesken wrote: > > Doesn't this need to be 1.0 or infinity to preserve the existing behavior? > > So I would change the code to `min_freq = 1.0;` , right ? I would do it slightly differently. See my suggested change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2297719367 From dlong at openjdk.org Tue Aug 20 00:03:49 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 20 Aug 2024 00:03:49 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 11:57:09 GMT, Matthias Baesken wrote: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) src/hotspot/share/opto/bytecodeInfo.cpp line 324: > 322: } else { > 323: min_freq = MAX2(MinInlineFrequencyRatio, 1.0 / cp_min_inv); > 324: } Suggestion: int cp_min_inv = MAX2(1, CompilationPolicy::min_invocations()); double min_freq = MAX2(MinInlineFrequencyRatio, 1.0 / cp_min_inv); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20615#discussion_r1722523432 From mbaesken at openjdk.org Tue Aug 20 07:30:23 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 20 Aug 2024 07:30:23 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 11:57:09 GMT, Matthias Baesken wrote: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) Thanks Dean, I added the suggestion. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2298158958 From rcastanedalo at openjdk.org Tue Aug 20 07:34:51 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 20 Aug 2024 07:34:51 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Thu, 8 Aug 2024 09:29:17 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove leftover CHUNK_SIZE reference src/hotspot/share/opto/regmask.hpp line 129: > 127: } else { > 128: assert(_RM_UP_EXT != nullptr, "sanity"); > 129: assert( i >= _RM_SIZE, "sanity"); This assertion is trivial (in my opinion) and can be removed. src/hotspot/share/opto/regmask.hpp line 144: > 142: > 143: // Return a suitable arena for (extended) register mask allocation. > 144: static Arena* _get_arena(); Maybe inline this function into its only user? src/hotspot/share/opto/regmask.hpp line 148: > 146: // Grow the register mask to ensure it can fit at least min_size words. > 147: void _grow(unsigned int min_size, bool init = true) { > 148: if(min_size > _rm_size) { Suggestion: if (min_size > _rm_size) { src/hotspot/share/opto/regmask.hpp line 165: > 163: if (init) { > 164: int fill = 0; > 165: if(is_AllStack()) { Suggestion: if (is_AllStack()) { src/hotspot/share/opto/regmask.hpp line 208: > 206: > 207: // Set a range of words in the register mask to a given value. > 208: void _set_range(unsigned int start, int value, unsigned int range) { Suggestion: `length`, `words`, or similar would be a more idiomatic name for the third parameter (`range`), in my opinion. src/hotspot/share/opto/regmask.hpp line 278: > 276: // itself as the _all_stack flag. We need to record this fact using the now > 277: // separate _all_stack flag. > 278: set_AllStack(_RM_UP[_RM_MAX] & (uintptr_t(1) << _WordBitMask)); As we discussed offline, I suggest simplifying this by updating the ADLC code as well to set `_all_stack` explicitly. Here is a patch that accomplishes that: https://github.com/robcasloz/jdk/commit/835628893ed5ea838c16afa49adbd57e4152a5dd, feel free to merge if you agree. The patch passes tier1-3 for all Oracle-supported platforms. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1721792317 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1722812982 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1722813539 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1722814189 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1722827247 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1722822957 From mbaesken at openjdk.org Tue Aug 20 07:30:23 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 20 Aug 2024 07:30:23 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v2] In-Reply-To: References: Message-ID: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: add suggestion from Dean Long ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20615/files - new: https://git.openjdk.org/jdk/pull/20615/files/370f8760..daa9ebf3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20615&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20615&range=00-01 Stats: 7 lines in 1 file changed: 0 ins; 5 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20615.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20615/head:pull/20615 PR: https://git.openjdk.org/jdk/pull/20615 From dlunden at openjdk.org Tue Aug 20 09:21:48 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 09:21:48 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <2wZdLgj3bxWUKOupkG2Wl6pImQorm-FSORgSnDqUHqo=.3e45f7d4-14eb-4d0c-ad4c-22dac3e5fcc1@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> <2wZdLgj3bxWUKOupkG2Wl6pImQorm-FSORgSnDqUHqo=.3e45f7d4-14eb-4d0c-ad4c-22dac3e5fcc1@github.com> Message-ID: On Tue, 13 Aug 2024 19:04:26 GMT, Vladimir Kozlov wrote: >> No, I'll have a look and see if it makes sense to use growable arrays in this case. Thanks! > > Yes, it is good suggestion. Please look. I have investigated using a `GrowableArray` now, including a discussion with people involved in their development. Currently, there are some limitations with `GrowableArray` that have a negative performance impact for our use case: - There is no efficient copy/clone operation. - `GrowableArray` construction currently default-initializes everything within their capacity, which is unnecessary in our case (and probably unnecessary in general as well). Yes, one option is to implement the above for `GrowableArray`. There is, however, already an ongoing discussion on how to best do this, and there are some pitfalls. I prefer to not start a more general and largely orthogonal discussion on `GrowableArray` changes here and suggest we update my current manual allocation to use a `GrowableArray` in a separate RFE, after `GrowableArray` has the required functionality. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1722983790 From dlunden at openjdk.org Tue Aug 20 09:46:24 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 09:46:24 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v3] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Add can_represent asserts ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/cbb2c251..4e7f4dbf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=01-02 Stats: 18 lines in 3 files changed: 13 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From rcastanedalo at openjdk.org Tue Aug 20 09:46:24 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 20 Aug 2024 09:46:24 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> <2wZdLgj3bxWUKOupkG2Wl6pImQorm-FSORgSnDqUHqo=.3e45f7d4-14eb-4d0c-ad4c-22dac3e5fcc1@github.com> Message-ID: <3sYWVqP5w-RV04QNkGkOHLwjxKu6MxXbKk9W8c4IMts=.bbd17391-cf3d-44cc-9d0f-17051aa94a83@github.com> On Tue, 20 Aug 2024 09:18:52 GMT, Daniel Lund?n wrote: >> Yes, it is good suggestion. Please look. > > I have investigated using a `GrowableArray` now, including a discussion with people involved in their development. Currently, there are some limitations with `GrowableArray` that have a negative performance impact for our use case: > - There is no efficient copy/clone operation. > - `GrowableArray` construction currently default-initializes everything within their capacity, which is unnecessary in our case (and probably unnecessary in general as well). > > Yes, one option is to implement the above for `GrowableArray`. There is, however, already an ongoing discussion on how to best do this, and there are some pitfalls. I prefer to not start a more general and largely orthogonal discussion on `GrowableArray` changes here and suggest we update my current manual allocation to use a `GrowableArray` in a separate RFE, after `GrowableArray` has the required functionality. Fair enough, thanks for investigating! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723001910 From rcastanedalo at openjdk.org Tue Aug 20 09:46:24 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 20 Aug 2024 09:46:24 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <3sYWVqP5w-RV04QNkGkOHLwjxKu6MxXbKk9W8c4IMts=.bbd17391-cf3d-44cc-9d0f-17051aa94a83@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> <2wZdLgj3bxWUKOupkG2Wl6pImQorm-FSORgSnDqUHqo=.3e45f7d4-14eb-4d0c-ad4c-22dac3e5fcc1@github.com> <3sYWVqP5w-RV04QNkGkOHLwjxKu6MxXbKk9W8c4IMts=.bbd17391-cf3d-44cc-9d0f-17051aa94a83@github.com> Message-ID: <7ZZ4D_qWsHXdlQ08dnnWtND-BuzKRc9HAUxN0W2_UzU=.f8204e66-52e2-43c5-9099-e295e419ec8f@github.com> On Tue, 20 Aug 2024 09:30:33 GMT, Roberto Casta?eda Lozano wrote: >> I have investigated using a `GrowableArray` now, including a discussion with people involved in their development. Currently, there are some limitations with `GrowableArray` that have a negative performance impact for our use case: >> - There is no efficient copy/clone operation. >> - `GrowableArray` construction currently default-initializes everything within their capacity, which is unnecessary in our case (and probably unnecessary in general as well). >> >> Yes, one option is to implement the above for `GrowableArray`. There is, however, already an ongoing discussion on how to best do this, and there are some pitfalls. I prefer to not start a more general and largely orthogonal discussion on `GrowableArray` changes here and suggest we update my current manual allocation to use a `GrowableArray` in a separate RFE, after `GrowableArray` has the required functionality. > > Fair enough, thanks for investigating! Maybe you can add a comment above the definition of `_RM_UP_EXT` motivating your choice, based on what you found. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723014204 From rcastanedalo at openjdk.org Tue Aug 20 09:46:24 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 20 Aug 2024 09:46:24 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Thu, 8 Aug 2024 09:29:17 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove leftover CHUNK_SIZE reference src/hotspot/share/opto/regmask.hpp line 120: > 118: // and between the two marks can still be 0. > 119: unsigned int _lwm; > 120: unsigned int _hwm; It seems `_hwm` does not take `_all_stack` into account, i.e. it is perfectly legal to have a register mask with `_hwm < _rm_max && _all_stack = true`. I think it would be worth noting it in a code comment. src/hotspot/share/opto/regmask.hpp line 195: > 193: // If the source is smaller than us, we need to set the gap according to > 194: // the sources all_stack flag. > 195: if (src._rm_size < _rm_size ) { Suggestion: if (src._rm_size < _rm_size) { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723019002 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1722958258 From rcastanedalo at openjdk.org Tue Aug 20 13:18:52 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 20 Aug 2024 13:18:52 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v3] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 09:46:24 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Add can_represent asserts src/hotspot/share/opto/regmask.hpp line 362: > 360: // Verify watermarks are sane, i.e., within bounds and that no > 361: // register words below or above the watermarks have bits set. > 362: bool valid_watermarks() const { For sanity, should we also assert here, and enforce across the code, that `_lwm <= _hwm`, even in the case of `Size() == 0` or is this invariant too expensive/difficult to maintain? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723297427 From dlunden at openjdk.org Tue Aug 20 14:05:53 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 14:05:53 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Tue, 20 Aug 2024 07:20:37 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove leftover CHUNK_SIZE reference > > src/hotspot/share/opto/regmask.hpp line 144: > >> 142: >> 143: // Return a suitable arena for (extended) register mask allocation. >> 144: static Arena* _get_arena(); > > Maybe inline this function into its only user? I would like to, but that results in a cyclic header inclusion problem between `regmask.hpp` and `compile.hpp`. It could maybe be solved, but it doesn't look trivial. Putting the definition in `regmask.cpp` seems like the simplest solution, but I'm open to suggestions. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723377552 From bkilambi at openjdk.org Tue Aug 20 14:26:50 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 20 Aug 2024 14:26:50 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: <_AcGpxU2tXImdvN3I65WLEFa5bDnLe6sdlHACNyxRUI=.7a79c223-254e-474a-bc40-fa861b9c1520@github.com> On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. src/hotspot/share/opto/addnode.hpp line 404: > 402: //------------------------------UMaxINode--------------------------------------- > 403: // Maximum of 2 unsigned integers. > 404: class UMaxINode : public Node { Would it be better to define `max_opcode()` and `min_opcode()` for `UMaxINode` and `UMinINode`? These are used to find commutative patterns in `AddNode::Ideal()` and `MulNode::Ideal()` and optimize them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1723414044 From dlunden at openjdk.org Tue Aug 20 14:32:25 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 14:32:25 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update after Roberto's comments and suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/4e7f4dbf..95396668 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=02-03 Stats: 56 lines in 2 files changed: 34 ins; 5 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From dlunden at openjdk.org Tue Aug 20 14:32:26 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 14:32:26 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Mon, 19 Aug 2024 13:29:18 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove leftover CHUNK_SIZE reference > > src/hotspot/share/opto/regmask.hpp line 129: > >> 127: } else { >> 128: assert(_RM_UP_EXT != nullptr, "sanity"); >> 129: assert( i >= _RM_SIZE, "sanity"); > > This assertion is trivial (in my opinion) and can be removed. Indeed, it's a leftover from a previous iteration. Thanks. > src/hotspot/share/opto/regmask.hpp line 208: > >> 206: >> 207: // Set a range of words in the register mask to a given value. >> 208: void _set_range(unsigned int start, int value, unsigned int range) { > > Suggestion: `length`, `words`, or similar would be a more idiomatic name for the third parameter (`range`), in my opinion. Thanks, updated. > src/hotspot/share/opto/regmask.hpp line 278: > >> 276: // itself as the _all_stack flag. We need to record this fact using the now >> 277: // separate _all_stack flag. >> 278: set_AllStack(_RM_UP[_RM_MAX] & (uintptr_t(1) << _WordBitMask)); > > As we discussed offline, I suggest simplifying this by updating the ADLC code as well to set `_all_stack` explicitly. Here is a patch that accomplishes that: https://github.com/robcasloz/jdk/commit/835628893ed5ea838c16afa49adbd57e4152a5dd, feel free to merge if you agree. The patch passes tier1-3 for all Oracle-supported platforms. Yes, nice cleanup. Now included! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723421527 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723423140 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723422754 From dlunden at openjdk.org Tue Aug 20 14:27:52 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 14:27:52 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v3] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 13:16:32 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Add can_represent asserts > > src/hotspot/share/opto/regmask.hpp line 362: > >> 360: // Verify watermarks are sane, i.e., within bounds and that no >> 361: // register words below or above the watermarks have bits set. >> 362: bool valid_watermarks() const { > > For sanity, should we also assert here, and enforce across the code, that `_lwm <= _hwm`, even in the case of `Size() == 0` or is this invariant too expensive/difficult to maintain? We could do that, but I'm not sure I see the benefit. Could you elaborate a bit? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723415718 From dlunden at openjdk.org Tue Aug 20 14:34:52 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 14:34:52 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:32:25 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Roberto's comments and suggestions Thanks to @robcasloz for the comments and contributions! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2299012443 From dlunden at openjdk.org Tue Aug 20 14:34:53 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 20 Aug 2024 14:34:53 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Tue, 20 Aug 2024 09:41:14 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove leftover CHUNK_SIZE reference > > src/hotspot/share/opto/regmask.hpp line 120: > >> 118: // and between the two marks can still be 0. >> 119: unsigned int _lwm; >> 120: unsigned int _hwm; > > It seems `_hwm` does not take `_all_stack` into account, i.e. it is perfectly legal to have a register mask with `_hwm < _rm_max && _all_stack = true`. I think it would be worth noting it in a code comment. Now incorporated as part of the nice diagram that you contributed! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723426887 From bkilambi at openjdk.org Tue Aug 20 14:55:52 2024 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 20 Aug 2024 14:55:52 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. src/hotspot/share/opto/vectornode.hpp line 150: > 148: class SaturatingVectorNode : public VectorNode { > 149: private: > 150: bool _is_unsigned; Would it be better to make it a `const bool`? src/hotspot/share/opto/vectornode.hpp line 172: > 170: class SaturatingAddVBNode : public SaturatingVectorNode { > 171: public: > 172: SaturatingAddVBNode(Node* in1, Node* in2, const TypeVect* vt, bool is_unsigned) : SaturatingVectorNode(in1,in2,vt,is_unsigned) {} Style: spaces after the commas in `SaturatingVectorNode(in1,in2,vt,is_unsigned)` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1723459735 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1723463554 From chagedorn at openjdk.org Tue Aug 20 15:49:58 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 20 Aug 2024 15:49:58 GMT Subject: Integrated: 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 11:11:54 GMT, Christian Hagedorn wrote: > The previous fix for preventing `Div/Mod` nodes to be split through iv phis when their divisor could become zero (see [JDK-8299259](https://bugs.openjdk.org/browse/JDK-8299259)) is not complete. Originally, we thought that it is enough to prevent a split through the phi if it belongs to a `BaseCountedLoop`: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/src/hotspot/share/opto/loopopts.cpp#L302-L304 > > The reasoning behind this was that the iv phi type of `BaseCountedLoops` can be improved in such a way that it does not contain zero but the backedge input can include zero: > https://github.com/openjdk/jdk/blob/da7311bbe37c2b9632b117d52a77c659047820b7/test/hotspot/jtreg/compiler/splitif/TestSplitDivisionThroughPhi.java#L69-L78 > > This optimization is not possible for `LoopNodes` where we know nothing about the number of iterations. > > However, there was an oversight: A `LongCountedLoop` is later split into an inner and an outer `LoopNode` where the inner loop is transformed into a `CountedLoop` while the outer loop stays a `LoopNode`. Both loops share the same optimized iv phi type as the original `LongCountedLoop`! Therefore, we can have the same situation as fixed in JDK-8299259 but with a `LoopNode` instead of a `CountedLoopNode` (see https://github.com/openjdk/jdk/pull/11900 for more details). > > The simplest way to fix this is to extend the bailout fix of JDK-8299259 to not only apply to `BaseCountedLoopNodes` but to `LoopNodes` in general which is what I propose with this patch. > > Thanks to @eme64 for extracting the original simpler reproducer to reproduce this problem easier. I've added an even simpler non-jasm reproducer in this patch in favor of the original jasm reproducer. > > Thanks, > Christian This pull request has now been integrated. Changeset: 55a97ec8 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/55a97ec8793242c0cacbafd3a4fead25cdce2934 Stats: 52 lines in 3 files changed: 45 ins; 0 del; 7 mod 8336729: C2: Div/Mod nodes without zero check could be split through iv phi of outer loop of long counted loop nest resulting in SIGFPE Co-authored-by: Emanuel Peter Reviewed-by: epeter, kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/20594 From kvn at openjdk.org Tue Aug 20 15:56:51 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 20 Aug 2024 15:56:51 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v2] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 07:30:23 GMT, Matthias Baesken wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > add suggestion from Dean Long I think we need a comment here explaining how min_freq is calculated and edge cases. ------------- PR Review: https://git.openjdk.org/jdk/pull/20615#pullrequestreview-2248411559 From kvn at openjdk.org Tue Aug 20 15:56:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 20 Aug 2024 15:56:52 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v2] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 07:30:21 GMT, Matthias Baesken wrote: > > First, I think Tier4MinInvocationThreshold and Tier3MinInvocationThreshold should be double flags. > > I see that they are used only in double type expressions (usually with scaling value which is double). May be another RFE. > > I can open a new JBS issue 'Make Tier4MinInvocationThreshold and Tier3MinInvocationThreshold double flags' is that okay ? @veresov, do you agree to this suggestion? Or we should restrict values of these flags to not allow 0? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2299189059 From iveresov at openjdk.org Tue Aug 20 16:04:51 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 20 Aug 2024 16:04:51 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v2] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 15:52:26 GMT, Vladimir Kozlov wrote: > > > First, I think Tier4MinInvocationThreshold and Tier3MinInvocationThreshold should be double flags. > > > I see that they are used only in double type expressions (usually with scaling value which is double). May be another RFE. > > > > > > I can open a new JBS issue 'Make Tier4MinInvocationThreshold and Tier3MinInvocationThreshold double flags' is that okay ? > > @veresov, do you agree to this suggestion? Or we should restrict values of these flags to not allow 0? I don't quite see a real need to do that but we can. Should we convert all threshold flags to double then? About 0. It's actually a valid threshold. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2299209889 From rcastanedalo at openjdk.org Tue Aug 20 16:07:53 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 20 Aug 2024 16:07:53 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v2] In-Reply-To: References: <7O0X1427MXg_MrnMb9qcH64x3lG3feYt47iZhBujztk=.3fee1482-8bcf-42d9-bc4d-0e074f03222e@github.com> Message-ID: On Tue, 20 Aug 2024 14:03:37 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 144: >> >>> 142: >>> 143: // Return a suitable arena for (extended) register mask allocation. >>> 144: static Arena* _get_arena(); >> >> Maybe inline this function into its only user? > > I would like to, but that results in a cyclic header inclusion problem between `regmask.hpp` and `compile.hpp`. It could maybe be solved, but it doesn't look trivial. Putting the definition in `regmask.cpp` seems like the simplest solution, but I'm open to suggestions. I see, fair enough, thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723578565 From kvn at openjdk.org Tue Aug 20 16:16:52 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 20 Aug 2024 16:16:52 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v2] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 16:02:20 GMT, Igor Veresov wrote: > I don't quite see a real need to do that but we can. Should we convert all threshold flags to double then? Yes for all threshold flags. We always use scaling for them. With double flags results of related expressions and conditions will be more predictable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2299237900 From rcastanedalo at openjdk.org Tue Aug 20 16:43:31 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 20 Aug 2024 16:43:31 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v3] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:25:22 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 362: >> >>> 360: // Verify watermarks are sane, i.e., within bounds and that no >>> 361: // register words below or above the watermarks have bits set. >>> 362: bool valid_watermarks() const { >> >> For sanity, should we also assert here, and enforce across the code, that `_lwm <= _hwm`, even in the case of `Size() == 0` or is this invariant too expensive/difficult to maintain? > > We could do that, but I'm not sure I see the benefit. Could you elaborate a bit? The benefit, in my opinion, is better comprehensibility due to a simpler model with stronger invariants. An alternative would be to extend the comment at the definition of `_lwm` and `_hwm` clarifying that the value of these variables is unspecified when the register set is empty. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1723624136 From sviswanathan at openjdk.org Tue Aug 20 18:27:04 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 20 Aug 2024 18:27:04 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 In-Reply-To: References: Message-ID: On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath wrote: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 src/hotspot/cpu/x86/assembler_x86.cpp line 8978: > 8976: void Assembler::vinserti64x2(XMMRegister dst, XMMRegister nds, XMMRegister src, uint8_t imm8) { > 8977: assert(VM_Version::supports_avx512dq(), ""); > 8978: InstructionAttr attributes(AVX_512bit, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true); This instruction supports both 256 bit and 512 bit vector length. You could take vector_len as input and use that instead of fixed AVX_512bit here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1722512827 From dhanalla at openjdk.org Tue Aug 20 23:55:03 2024 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Tue, 20 Aug 2024 23:55:03 GMT Subject: RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 21:28:57 GMT, Dhamoder Nalla wrote: >> Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. >> >> For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories: >> 1. We have a real bug and by fixing it, we no longer create this many nodes. >> 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed). >> 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well). >> >> Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). >> >> You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch. > >> Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. >> >> For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories: >> >> 1. We have a real bug and by fixing it, we no longer create this many nodes. >> 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed). >> 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well). >> >> Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). >> >> You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch. > > Thanks @chhagedorn for reviewing this PR. > This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. > As an alternative solution, could we consider limiting the JVM option `EliminateAllocationArraySizeLimit` (in `c2_globals.hpp`) to a range between 0 and 1024, instead of the current range of 0 to `max_jint`, as the upper limit of `max_jint` may not be practical? > > > Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. > > > For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories: > > > > > > 1. We have a real bug and by fixing it, we no longer create this many nodes. > > > 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed). > > > 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well). > > > > > > Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). > > > You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch. > > > > > > Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option `EliminateAllocationArraySizeLimit` (in `c2_globals.hpp`) to a range between 0 and 1024, instead of the current range of 0 to `max_jint`, as the upper limit of `max_jint` may not be practical? > > Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes? The test case (ReductionPerf.java) involves multiple arrays, each with a size of 8k. Using the JVM option -XX:EliminateAllocationArraySizeLimit=10240 (which is larger than array size 8k) will enable scalar replacement for all array elements. This, in turn, may result in constructing a graph with over 80k live nodes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20504#issuecomment-2299947813 From chagedorn at openjdk.org Wed Aug 21 07:05:04 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 21 Aug 2024 07:05:04 GMT Subject: RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 23:52:36 GMT, Dhamoder Nalla wrote: > > > > Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. > > > > For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories: > > > > > > > > 1. We have a real bug and by fixing it, we no longer create this many nodes. > > > > 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed). > > > > 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well). > > > > > > > > Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). > > > > You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch. > > > > > > > > > Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option `EliminateAllocationArraySizeLimit` (in `c2_globals.hpp`) to a range between 0 and 1024, instead of the current range of 0 to `max_jint`, as the upper limit of `max_jint` may not be practical? > > > > > > Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes? > > The test case (ReductionPerf.java) involves multiple arrays, each with a size of 8k. Using the JVM option -XX:EliminateAllocationArraySizeLimit=10240 (which is larger than array size 8k) will enable scalar replacement for all array elements. This, in turn, may result in constructing a graph with over 80k live nodes. I see, thanks for explaining the test behavior. > As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical? I think that is just a mitigation which makes it less likelier. You could probably still just come up with a test with a lot more arrays of size 1024 and hit the node limit again. I suggest to first extract a simpler minimal test case which isolates the problem. Then you can also play around with different values for `EliminateAllocationArraySizeLimit`. I could imagine that you can also trigger this problem with just one huge array when you set the limit large enough. This could make it easier to understand and explain where the nodes are exactly created, what kind of nodes those are etc. Once we know that, we can try to implement a bailout right there which is independent of how big `EliminateAllocationArraySizeLimit` is. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20504#issuecomment-2301286532 From rcastanedalo at openjdk.org Wed Aug 21 12:09:06 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 21 Aug 2024 12:09:06 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:32:25 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Roberto's comments and suggestions Regarding testing of these changes, I am concerned that the current standard tests in our tiers do not cover the case when RegMasks are extended with a dynamically allocated array that much (since there are very few methods with a very large number of of arguments). Would it be possible to 1) extend `test/hotspot/gtest/opto/test_regmask.cpp` with tests that exercise extended RegMasks and 2) re-run the standard test tiers with a (temporary) `RM_SIZE` value that is low enough to also exercise the new logic more often? ------------- PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-2250613513 From jbhateja at openjdk.org Wed Aug 21 16:42:44 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 21 Aug 2024 16:42:44 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Pass explicit wrap argument to selectFrom API with default value set to true. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20508/files - new: https://git.openjdk.org/jdk/pull/20508/files/055fb22f..e24632cb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=01-02 Stats: 491 lines in 40 files changed: 430 ins; 1 del; 60 mod Patch: https://git.openjdk.org/jdk/pull/20508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508 PR: https://git.openjdk.org/jdk/pull/20508 From jbhateja at openjdk.org Wed Aug 21 16:52:06 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 21 Aug 2024 16:52:06 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Wed, 21 Aug 2024 16:42:44 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Pass explicit wrap argument to selectFrom API with default value set to true. Hi @rose00 , @sviswa7 , @PaulSandoz , As suggested, now passing explicit 'wrap' argument to new selectFrom API. Following are the performance number of modified JMH micro included with the patch. Baseline:- Benchmark (size) Mode Cnt Score Error Units SelectFromBenchmark.rearrangeFromByteVector 4096 thrpt 2 5849.771 ops/ms SelectFromBenchmark.rearrangeFromDoubleVector 4096 thrpt 2 430.712 ops/ms SelectFromBenchmark.rearrangeFromFloatVector 4096 thrpt 2 942.737 ops/ms SelectFromBenchmark.rearrangeFromIntVector 4096 thrpt 2 1057.695 ops/ms SelectFromBenchmark.rearrangeFromLongVector 4096 thrpt 2 616.360 ops/ms SelectFromBenchmark.rearrangeFromShortVector 4096 thrpt 2 2146.465 ops/ms With Patch:- Benchmark (size) Mode Cnt Score Error Units SelectFromBenchmark.selectFromByteVector 4096 thrpt 2 9543.775 ops/ms SelectFromBenchmark.selectFromDoubleVector 4096 thrpt 2 558.195 ops/ms SelectFromBenchmark.selectFromFloatVector 4096 thrpt 2 1325.059 ops/ms SelectFromBenchmark.selectFromIntVector 4096 thrpt 2 1418.748 ops/ms SelectFromBenchmark.selectFromLongVector 4096 thrpt 2 687.231 ops/ms SelectFromBenchmark.selectFromShortVector 4096 thrpt 2 4782.395 ops/ms With WIP wrap index acceleration PR#20634: Benchmark (size) Mode Cnt Score Error Units SelectFromBenchmark.rearrangeFromByteVector 4096 thrpt 2 7602.645 ops/ms SelectFromBenchmark.rearrangeFromDoubleVector 4096 thrpt 2 441.684 ops/ms SelectFromBenchmark.rearrangeFromFloatVector 4096 thrpt 2 926.112 ops/ms SelectFromBenchmark.rearrangeFromIntVector 4096 thrpt 2 1061.695 ops/ms SelectFromBenchmark.rearrangeFromLongVector 4096 thrpt 2 644.058 ops/ms SelectFromBenchmark.rearrangeFromShortVector 4096 thrpt 2 2777.735 ops/ms ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2302541724 From sviswanathan at openjdk.org Wed Aug 21 17:51:06 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 21 Aug 2024 17:51:06 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Wed, 21 Aug 2024 16:49:40 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Pass explicit wrap argument to selectFrom API with default value set to true. > > Hi @rose00 , @sviswa7 , @PaulSandoz , > As suggested, now passing explicit 'wrap' argument to new selectFrom API. > > Following are the performance number of modified JMH micro included with the patch. > > > > Baseline:- > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 4096 thrpt 2 5849.771 ops/ms > SelectFromBenchmark.rearrangeFromDoubleVector 4096 thrpt 2 430.712 ops/ms > SelectFromBenchmark.rearrangeFromFloatVector 4096 thrpt 2 942.737 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 4096 thrpt 2 1057.695 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 4096 thrpt 2 616.360 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 4096 thrpt 2 2146.465 ops/ms > > With Patch:- > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.selectFromByteVector 4096 thrpt 2 9543.775 ops/ms > SelectFromBenchmark.selectFromDoubleVector 4096 thrpt 2 558.195 ops/ms > SelectFromBenchmark.selectFromFloatVector 4096 thrpt 2 1325.059 ops/ms > SelectFromBenchmark.selectFromIntVector 4096 thrpt 2 1418.748 ops/ms > SelectFromBenchmark.selectFromLongVector 4096 thrpt 2 687.231 ops/ms > SelectFromBenchmark.selectFromShortVector 4096 thrpt 2 4782.395 ops/ms > > > With WIP wrap index acceleration PR#20634: > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 4096 thrpt 2 7602.645 ops/ms > SelectFromBenchmark.rearrangeFromDoubleVector 4096 thrpt 2 441.684 ops/ms > SelectFromBenchmark.rearrangeFromFloatVector 4096 thrpt 2 926.112 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 4096 thrpt 2 1061.695 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 4096 thrpt 2 644.058 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 4096 thrpt 2 2777.735 ops/ms @jatin-bhateja Thanks, the PR ((https://github.com/openjdk/jdk/pull/20634) is still work in progress and can be simplified much further. The changes I am currently working on are do wrap by default for rearrange and selectFrom as suggested by John and Paul, no additional api with boolean wrap as parameter, and no changes to shuffle constructors. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2302641840 From psandoz at openjdk.org Wed Aug 21 18:30:06 2024 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 21 Aug 2024 18:30:06 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Wed, 21 Aug 2024 16:42:44 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Pass explicit wrap argument to selectFrom API with default value set to true. Is it possible for the intrinsic to be responsible for wrapping, if needed? If was looking at [`vpermi2b`](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpermi2b&ig_expand=4917,4982,5004,5010,5014&techs=AVX_512) and AFAICT it implicitly wraps, operating on the lower N bits. Is that correct? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2302707611 From john.r.rose at oracle.com Wed Aug 21 18:40:15 2024 From: john.r.rose at oracle.com (John Rose) Date: Wed, 21 Aug 2024 11:40:15 -0700 Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On 21 Aug 2024, at 10:51, Sandhya Viswanathan wrote: > @jatin-bhateja Thanks, the PR ((https://github.com/openjdk/jdk/pull/20634) is still work in progress and can be simplified much further. The changes I am currently working on are do wrap by default for rearrange and selectFrom as suggested by John and Paul, no additional api with boolean wrap as parameter, and no changes to shuffle constructors. Yes, thank you Sandhya; this is the destination I hope to arrive at. Not necessarily 100% in this PR, but this PR should be consistent with it. ?To review: Shuffles store their indexes ?partially wrapped? so as to preserve information about which indexes were out of bounds, but they also preserve all index values mod VLEN. It?s always an option, though not a requirement, to fully wrap, removing the OOB info and reducing every index down to 0..VLEN-1. When using a vector instead of a shuffle for steering, we think of this as creating a temp shuffle first, then doing the appropriate operation(s). But for best instruction selection, we have found that it?s fastest to force everything down to 0..VLEN-1 immediately, at least in the vector case, and to a doubled dynamic range, mod 2VLEN, for the two-input case. There?s always an equivalent expression which uses an explicit shuffle to carry either VLEN (fully wrapped) or 2VLEN (partially wrapped) indexes. For the vector-steered version we implement only the most favorable pattern of shuffle usage, one which never throws. And of course we don?t build a temp shuffle either. From john.r.rose at oracle.com Wed Aug 21 18:51:25 2024 From: john.r.rose at oracle.com (John Rose) Date: Wed, 21 Aug 2024 11:51:25 -0700 Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On 21 Aug 2024, at 11:30, Paul Sandoz wrote: > Is it possible for the intrinsic to be responsible for wrapping, if needed? If was looking at [`vpermi2b`](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpermi2b&ig_expand=4917,4982,5004,5010,5014&techs=AVX_512) and AFAICT it implicitly wraps, operating on the lower N bits. Is that correct? That?s not a bad idea. But it is also possible (and routine) for the JIT to take an expression like (i >> (j&31)) down to (i >> j) if the hardware takes care of the (j&31) inside its >> operation. I think that some hardware permutation operations do something similar to >> in that they simply ignore irrelevant bits in the steering indexes. (Other operations do exotic things with irrelevant bits, such as interpreting the sign bit as a command to ?force this one to zero?.) If the wrapping operation for steering indexes is just a vpand against a simple constant, then maybe (maybe!) the JIT can easily drop that vpand, when the input is passed to a friendly auto-masking instruction, just like with (i >> (j&31)). On the other hand, Paul?s idea might be more robust. It would require that the permutation intrinsics would apply vpand at the right places, and omit vpand when possible. On the other other hand (the first hand) the classic way of doing it doesn?t introduce vpand inside of intrinsics, which has a routine advantage: The vpands introduced outside of the intrinsic can be user-introduced or framework-introduced or both. In all cases, the JIT treats them uniformly and can collapse them together. Putting magic fixup instructions inside of intrinsic expansion risks making them invisible to the routine optimizations of the JIT. So, assuming the vpand gets good optimization, putting it outside of the intrinsic is the most robust option, as long as ?good optimization? includes the >>(j&31) trick for auto-masking instructions. So the intrinsic should look for a vpand in its steering input, and pop off the IR node if the hardware masking is found to produce the same result. From sviswanathan at openjdk.org Wed Aug 21 19:34:06 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 21 Aug 2024 19:34:06 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Wed, 21 Aug 2024 18:27:09 GMT, Paul Sandoz wrote: > Is it possible for the intrinsic to be responsible for wrapping, if needed? If was looking at [`vpermi2b`](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpermi2b&ig_expand=4917,4982,5004,5010,5014&techs=AVX_512) and AFAICT it implicitly wraps, operating on the lower N bits. Is that correct? It is good to keep wrapping separate. Two reasons: 1) Not all permute instructions do wrapping e.g. pshufb has a different behavior if MSB is set. 2) By keeping wrapping separate it can move out of the loop if shuffle is loop invariant. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2302865908 From kvn at openjdk.org Wed Aug 21 21:22:02 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 21 Aug 2024 21:22:02 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: Message-ID: On Wed, 21 Aug 2024 20:11:05 GMT, Markus Gr?nlund wrote: > Greetings, > > Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). > > This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. > > Testing: jdk_jfr, loom testing > > Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. > > Thanks > Markus src/hotspot/share/opto/library_call.cpp line 3815: > 3813: bool LibraryCallKit::inline_native_Continuation_unpin() { > 3814: return inline_native_Continuation_pinning_shared_impl(true); > 3815: } I don't see the need these 2 methods. You can directly call inline_native_Continuation_pinning_shared_impl() in switch in try_to_inline() and pass true or false there. You may also rename it to `inline_native_Continuation_pinning()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1725766345 From dlong at openjdk.org Wed Aug 21 23:58:06 2024 From: dlong at openjdk.org (Dean Long) Date: Wed, 21 Aug 2024 23:58:06 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:32:25 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Roberto's comments and suggestions src/hotspot/share/adlc/formsopt.cpp line 368: > 366: fprintf(fp," 0x%x,", regs_in_word(i, false)); > 367: } > 368: fprintf(fp," 0x%x );\n", regs_in_word(i, false)); Does use of RegMask_Size() above for the length still make sense? IIUC, RegMask_Size() is what previously determined the max size of frames and register masks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1725973452 From sviswanathan at openjdk.org Thu Aug 22 01:20:16 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 22 Aug 2024 01:20:16 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v3] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Wed, 21 Aug 2024 16:42:44 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Pass explicit wrap argument to selectFrom API with default value set to true. @rose00 @PaulSandoz I have updated https://github.com/openjdk/jdk/pull/20634. Please take a look if it meets your expectations for the existing rearrange/selectFrom apis. Jatin can then base the new two vector selectFrom api in this PR on similar lines. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2303383784 From thartmann at openjdk.org Thu Aug 22 05:52:09 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 22 Aug 2024 05:52:09 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: Message-ID: On Wed, 21 Aug 2024 20:11:05 GMT, Markus Gr?nlund wrote: > Greetings, > > Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). > > This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. > > Testing: jdk_jfr, loom testing > > Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. > > Thanks > Markus src/hotspot/share/opto/library_call.cpp line 3787: > 3785: set_all_memory(input_memory_state); > 3786: uncommon_trap_exact(Deoptimization::Reason_intrinsic, > 3787: Deoptimization::Action_reinterpret); Why do you use `Action_reinterpret` and not `Action_make_not_entrant` here? You may need a check to avoid endless re-compilation of a method that always hits the trap: if (too_many_traps(Deoptimization::Reason_intrinsic)) { return false; } See other intrinsics. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726376702 From chagedorn at openjdk.org Thu Aug 22 06:02:07 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 22 Aug 2024 06:02:07 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v11] In-Reply-To: References: <8gXt0Adth6IJw_QitnY53W_4Ouup1DuUHZmkoM8ytuY=.ccb05c52-cdec-4d89-a993-c005b8aa3d0d@github.com> Message-ID: On Thu, 15 Aug 2024 06:39:20 GMT, Emanuel Peter wrote: > In a later and separate RFE, we can then adjust the regex for all nodes, in a bulk update. Good idea, I filed [JDK-8338809](https://bugs.openjdk.org/browse/JDK-8338809) to follow up on this after this PR gets integrated. But for now, I suggest to use an explicit regex as suggested above with `@IR(counts = {IRNode.CMP_U + "\b", "1"}`. We can then follow up and change this IR test once JDK-8338809 is tackled. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2303846952 From dlong at openjdk.org Thu Aug 22 07:17:03 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 22 Aug 2024 07:17:03 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: Message-ID: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> On Thu, 22 Aug 2024 05:47:23 GMT, Tobias Hartmann wrote: >> Greetings, >> >> Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). >> >> This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. >> >> Testing: jdk_jfr, loom testing >> >> Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. >> >> Thanks >> Markus > > src/hotspot/share/opto/library_call.cpp line 3787: > >> 3785: set_all_memory(input_memory_state); >> 3786: uncommon_trap_exact(Deoptimization::Reason_intrinsic, >> 3787: Deoptimization::Action_reinterpret); > > Why do you use `Action_reinterpret` and not `Action_make_not_entrant` here? > > You may need a check to avoid endless re-compilation of a method that always hits the trap: > > if (too_many_traps(Deoptimization::Reason_intrinsic)) { > return false; > } > > > See other intrinsics. A pin_count overflow/underflow should be a per-thread condition, not global. If there is nothing in the nmethod to be invalidated, maybe this should be `Action_none`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726482790 From qxing at openjdk.org Thu Aug 22 09:02:47 2024 From: qxing at openjdk.org (Qizheng Xing) Date: Thu, 22 Aug 2024 09:02:47 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v5] In-Reply-To: References: Message-ID: On Thu, 4 Jul 2024 06:47:40 GMT, Tobias Hartmann wrote: >> Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: >> >> Add `@requires` for G1 GC. > > Thanks, test results look good now. I'll pass the detailed review to someone else. @TobiHartmann @chhagedorn @vnkozlov Thanks for the suggestions! I've updated this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19496#issuecomment-2304142364 From qxing at openjdk.org Thu Aug 22 09:02:47 2024 From: qxing at openjdk.org (Qizheng Xing) Date: Thu, 22 Aug 2024 09:02:47 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v8] In-Reply-To: References: Message-ID: > This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. > > The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. > > For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. > > This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: > > > Benchmark (nkeys) Mode Cnt Score Error Units > Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline > Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch > > > Testing: tier1-4. Qizheng Xing has updated the pull request incrementally with four additional commits since the last revision: - Add wrapper method for checking `DomResult` of `all_controls_dominate`. - Remove redundant `applyIf` and fix style for IR test. - Fix style. - Add brackets around modified if/else branches. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19496/files - new: https://git.openjdk.org/jdk/pull/19496/files/35e7a0d8..8ee38498 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19496&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19496&range=06-07 Stats: 62 lines in 4 files changed: 28 ins; 1 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/19496.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19496/head:pull/19496 PR: https://git.openjdk.org/jdk/pull/19496 From thartmann at openjdk.org Thu Aug 22 09:15:04 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 22 Aug 2024 09:15:04 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> References: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> Message-ID: On Thu, 22 Aug 2024 07:14:29 GMT, Dean Long wrote: >> src/hotspot/share/opto/library_call.cpp line 3787: >> >>> 3785: set_all_memory(input_memory_state); >>> 3786: uncommon_trap_exact(Deoptimization::Reason_intrinsic, >>> 3787: Deoptimization::Action_reinterpret); >> >> Why do you use `Action_reinterpret` and not `Action_make_not_entrant` here? >> >> You may need a check to avoid endless re-compilation of a method that always hits the trap: >> >> if (too_many_traps(Deoptimization::Reason_intrinsic)) { >> return false; >> } >> >> >> See other intrinsics. > > A pin_count overflow/underflow should be a per-thread condition, not global. If there is nothing in the nmethod to be invalidated, maybe this should be `Action_none`? Right, that would make sense to me because `Deoptimization::Action_reinterpret` might also invalidate the nmethod. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726664123 From mgronlun at openjdk.org Thu Aug 22 09:15:04 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 09:15:04 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> Message-ID: On Thu, 22 Aug 2024 09:09:15 GMT, Tobias Hartmann wrote: >> A pin_count overflow/underflow should be a per-thread condition, not global. If there is nothing in the nmethod to be invalidated, maybe this should be `Action_none`? > > Right, that would make sense to me because `Deoptimization::Action_reinterpret` might also invalidate the nmethod. The functional requirement I have is that the branch takes an uncommon trap and restarts / re-executes the same method the interpreter, because that version enters the VM where an IllegalStateException is thrown. I don't need the compiled method to be invalidated, only that an attempt that over/underflows (thread-local) restarts in the interpeter for the exception to be thrown. Is Action_none better suited for this purpose? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726665198 From mgronlun at openjdk.org Thu Aug 22 09:15:04 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 09:15:04 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> Message-ID: On Thu, 22 Aug 2024 09:09:52 GMT, Markus Gr?nlund wrote: >> Right, that would make sense to me because `Deoptimization::Action_reinterpret` might also invalidate the nmethod. > > The functional requirement I have is that the branch takes an uncommon trap and restarts / re-executes the same method the interpreter, because that version enters the VM where an IllegalStateException is thrown. > > I don't need the compiled method to be invalidated, only that an attempt that over/underflows (thread-local) restarts in the interpeter for the exception to be thrown. Is Action_none better suited for this purpose? The pattern of the uncommon trap construct was taken from the precedent in LibraryCallKit::inline_profileBoolean(). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726667330 From thartmann at openjdk.org Thu Aug 22 09:16:07 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 22 Aug 2024 09:16:07 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v8] In-Reply-To: References: Message-ID: <0BpAllKEaewoCZRHaouNoRFzduFgvTsh5cOQBzif_V4=.68759a0e-3320-4400-9773-f39ce40e914b@github.com> On Thu, 22 Aug 2024 09:02:47 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with four additional commits since the last revision: > > - Add wrapper method for checking `DomResult` of `all_controls_dominate`. > - Remove redundant `applyIf` and fix style for IR test. > - Fix style. > - Add brackets around modified if/else branches. Still good. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19496#pullrequestreview-2253945939 From mgronlun at openjdk.org Thu Aug 22 09:15:05 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 09:15:05 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: Message-ID: On Wed, 21 Aug 2024 21:19:03 GMT, Vladimir Kozlov wrote: >> Greetings, >> >> Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). >> >> This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. >> >> Testing: jdk_jfr, loom testing >> >> Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. >> >> Thanks >> Markus > > src/hotspot/share/opto/library_call.cpp line 3815: > >> 3813: bool LibraryCallKit::inline_native_Continuation_unpin() { >> 3814: return inline_native_Continuation_pinning_shared_impl(true); >> 3815: } > > I don't see the need these 2 methods. You can directly call inline_native_Continuation_pinning_shared_impl() in switch in try_to_inline() and pass true or false there. > You may also rename it to `inline_native_Continuation_pinning()`. Ok. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726670122 From mgronlun at openjdk.org Thu Aug 22 09:27:03 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 09:27:03 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> Message-ID: On Thu, 22 Aug 2024 09:11:03 GMT, Markus Gr?nlund wrote: >> The functional requirement I have is that the branch takes an uncommon trap and restarts / re-executes the same method the interpreter, because that version enters the VM where an IllegalStateException is thrown. >> >> I don't need the compiled method to be invalidated, only that an attempt that over/underflows (thread-local) restarts in the interpeter for the exception to be thrown. Is Action_none better suited for this purpose? > > The pattern of the uncommon trap construct was taken from the precedent in LibraryCallKit::inline_profileBoolean(). Is it an implicit invariant that execution always continues in the interpreter after an uncommon trap? I.e., I don't need to explicitly tell it to "re-execute" there? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726689737 From chagedorn at openjdk.org Thu Aug 22 09:31:08 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 22 Aug 2024 09:31:08 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v8] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 09:02:47 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with four additional commits since the last revision: > > - Add wrapper method for checking `DomResult` of `all_controls_dominate`. > - Remove redundant `applyIf` and fix style for IR test. > - Fix style. > - Add brackets around modified if/else branches. Otherwise, the update looks good, thanks! src/hotspot/share/opto/memnode.cpp line 1717: > 1715: // Wait for the dead code to be removed. > 1716: // The dead code will eventually be removed in IGVN, > 1717: // so we have an unambiguous result whether it's dominated or not. Suggestion: // There is some dead code which eventually will be removed in IGVN. Once this is the case, we get an unambiguous // dominance result. Push the node to the worklist again until the dead code is removed. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19496#pullrequestreview-2253980782 PR Review Comment: https://git.openjdk.org/jdk/pull/19496#discussion_r1726697616 From qxing at openjdk.org Thu Aug 22 09:40:22 2024 From: qxing at openjdk.org (Qizheng Xing) Date: Thu, 22 Aug 2024 09:40:22 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v9] In-Reply-To: References: Message-ID: > This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. > > The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. > > For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. > > This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: > > > Benchmark (nkeys) Mode Cnt Score Error Units > Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline > Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch > > > Testing: tier1-4. Qizheng Xing has updated the pull request incrementally with two additional commits since the last revision: - Fix whitespace. - Update comments in method `split_through_phi`. Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19496/files - new: https://git.openjdk.org/jdk/pull/19496/files/8ee38498..f8118454 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19496&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19496&range=07-08 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/19496.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19496/head:pull/19496 PR: https://git.openjdk.org/jdk/pull/19496 From chagedorn at openjdk.org Thu Aug 22 10:47:07 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 22 Aug 2024 10:47:07 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v9] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 09:40:22 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with two additional commits since the last revision: > > - Fix whitespace. > - Update comments in method `split_through_phi`. > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/19496#pullrequestreview-2254171973 From mgronlun at openjdk.org Thu Aug 22 11:13:17 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 11:13:17 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v2] In-Reply-To: References: Message-ID: > Greetings, > > Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). > > This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. > > Testing: jdk_jfr, loom testing > > Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. > > Thanks > Markus Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: Deoptimization::Action_none for no deopt ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20664/files - new: https://git.openjdk.org/jdk/pull/20664/files/ab3af53a..fa9a737a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20664&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20664&range=00-01 Stats: 20 lines in 2 files changed: 1 ins; 13 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/20664.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20664/head:pull/20664 PR: https://git.openjdk.org/jdk/pull/20664 From mgronlun at openjdk.org Thu Aug 22 11:24:05 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 11:24:05 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v2] In-Reply-To: References: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> Message-ID: On Thu, 22 Aug 2024 09:24:18 GMT, Markus Gr?nlund wrote: >> The pattern of the uncommon trap construct was taken from the precedent in LibraryCallKit::inline_profileBoolean(). > > Is it an implicit invariant that execution always continues in the interpreter after an uncommon trap? I.e., I don't need to explicitly tell it to "re-execute" there? It is updated to use Action::none to keep the nmethod. The trap code picks up the correct bytecode (invokestatic) from the trap scope. So after unrolling the host method (the inliner), the trap bytecode (i.e., the invokestatic call to Continuation.pin() or unpin()) is re-executed in the interpreter. This is also without setting the explicit re-execute state (which may mean something else). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726868890 From yzheng at openjdk.org Thu Aug 22 11:30:05 2024 From: yzheng at openjdk.org (Yudi Zheng) Date: Thu, 22 Aug 2024 11:30:05 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v2] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 11:13:17 GMT, Markus Gr?nlund wrote: >> Greetings, >> >> Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). >> >> This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. >> >> Testing: jdk_jfr, loom testing >> >> Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. >> >> Thanks >> Markus > > Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: > > Deoptimization::Action_none for no deopt src/hotspot/share/opto/library_call.cpp line 3733: > 3731: // TLS > 3732: Node* tls_ptr = _gvn.transform(new ThreadLocalNode()); > 3733: Node* last_continuation_offset = basic_plus_adr(top(), tls_ptr, in_bytes(JavaThread::cont_entry_offset())); Could you please export `JavaThread::_cont_entry` and `ContinuationEntry::_pin_count` to JVMCI? Thanks! diff --git a/src/hotspot/share/jvmci/vmStructs_jvmci.cpp b/src/hotspot/share/jvmci/vmStructs_jvmci.cpp index 688691fb976..a25ecd2bab5 100644 --- a/src/hotspot/share/jvmci/vmStructs_jvmci.cpp +++ b/src/hotspot/share/jvmci/vmStructs_jvmci.cpp @@ -35,6 +35,7 @@ #include "oops/methodCounters.hpp" #include "oops/objArrayKlass.hpp" #include "prims/jvmtiThreadState.hpp" +#include "runtime/continuationEntry.hpp" #include "runtime/deoptimization.hpp" #include "runtime/flags/jvmFlag.hpp" #include "runtime/osThread.hpp" @@ -244,10 +245,13 @@ nonstatic_field(JavaThread, _held_monitor_count, intx) \ nonstatic_field(JavaThread, _lock_stack, LockStack) \ nonstatic_field(JavaThread, _om_cache, OMCache) \ + nonstatic_field(JavaThread, _cont_entry, ContinuationEntry*) \ JVMTI_ONLY(nonstatic_field(JavaThread, _is_in_VTMS_transition, bool)) \ JVMTI_ONLY(nonstatic_field(JavaThread, _is_in_tmp_VTMS_transition, bool)) \ JVMTI_ONLY(nonstatic_field(JavaThread, _is_disable_suspend, bool)) \ \ + nonstatic_field(ContinuationEntry, _pin_count, uint32_t) \ + \ nonstatic_field(LockStack, _top, uint32_t) \ \ JVMTI_ONLY(static_field(JvmtiVTMSTransitionDisabler, _VTMS_notify_jvmti_events, bool)) \ diff --git a/src/hotspot/share/runtime/continuationEntry.hpp b/src/hotspot/share/runtime/continuationEntry.hpp index 459321f444c..ac76cd6f088 100644 --- a/src/hotspot/share/runtime/continuationEntry.hpp +++ b/src/hotspot/share/runtime/continuationEntry.hpp @@ -39,6 +39,7 @@ class RegisterMap; // Metadata stored in the continuation entry frame class ContinuationEntry { + friend class JVMCIVMStructs; ContinuationEntryPD _pd; #ifdef ASSERT private: ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726876549 From mgronlun at openjdk.org Thu Aug 22 12:10:37 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 12:10:37 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v3] In-Reply-To: References: Message-ID: > Greetings, > > Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). > > This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. > > Testing: jdk_jfr, loom testing > > Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. > > Thanks > Markus Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: JVMCI exportation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20664/files - new: https://git.openjdk.org/jdk/pull/20664/files/fa9a737a..ba68f2c2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20664&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20664&range=01-02 Stats: 4 lines in 2 files changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20664.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20664/head:pull/20664 PR: https://git.openjdk.org/jdk/pull/20664 From mgronlun at openjdk.org Thu Aug 22 12:10:38 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 22 Aug 2024 12:10:38 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v3] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 11:27:27 GMT, Yudi Zheng wrote: >> Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: >> >> JVMCI exportation > > src/hotspot/share/opto/library_call.cpp line 3733: > >> 3731: // TLS >> 3732: Node* tls_ptr = _gvn.transform(new ThreadLocalNode()); >> 3733: Node* last_continuation_offset = basic_plus_adr(top(), tls_ptr, in_bytes(JavaThread::cont_entry_offset())); > > Could you please export `JavaThread::_cont_entry` and `ContinuationEntry::_pin_count` to JVMCI? Thanks! > > diff --git a/src/hotspot/share/jvmci/vmStructs_jvmci.cpp b/src/hotspot/share/jvmci/vmStructs_jvmci.cpp > index 688691fb976..a25ecd2bab5 100644 > --- a/src/hotspot/share/jvmci/vmStructs_jvmci.cpp > +++ b/src/hotspot/share/jvmci/vmStructs_jvmci.cpp > @@ -35,6 +35,7 @@ > #include "oops/methodCounters.hpp" > #include "oops/objArrayKlass.hpp" > #include "prims/jvmtiThreadState.hpp" > +#include "runtime/continuationEntry.hpp" > #include "runtime/deoptimization.hpp" > #include "runtime/flags/jvmFlag.hpp" > #include "runtime/osThread.hpp" > @@ -244,10 +245,13 @@ > nonstatic_field(JavaThread, _held_monitor_count, intx) \ > nonstatic_field(JavaThread, _lock_stack, LockStack) \ > nonstatic_field(JavaThread, _om_cache, OMCache) \ > + nonstatic_field(JavaThread, _cont_entry, ContinuationEntry*) \ > JVMTI_ONLY(nonstatic_field(JavaThread, _is_in_VTMS_transition, bool)) \ > JVMTI_ONLY(nonstatic_field(JavaThread, _is_in_tmp_VTMS_transition, bool)) \ > JVMTI_ONLY(nonstatic_field(JavaThread, _is_disable_suspend, bool)) \ > \ > + nonstatic_field(ContinuationEntry, _pin_count, uint32_t) \ > + \ > nonstatic_field(LockStack, _top, uint32_t) \ > ... Done (only tested that it builds). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1726932124 From roland at openjdk.org Thu Aug 22 14:46:06 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Aug 2024 14:46:06 GMT Subject: RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v2] In-Reply-To: References: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> <6kgDCB1rxZyn1JEX-8hvKyOQE07oMs8_kr7Cjbix3Gg=.61a77a4f-394c-48d6-a482-fe73f3314f5b@github.com> Message-ID: On Wed, 26 Jun 2024 15:02:18 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > Looks good. @vnkozlov what do you think of the comment above about performance regressions? In your opinion, is it enough data to proceed with the change as is? ------------- PR Comment: https://git.openjdk.org/jdk/pull/19831#issuecomment-2304853536 From roland at openjdk.org Thu Aug 22 15:09:19 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Aug 2024 15:09:19 GMT Subject: RFR: 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop [v2] In-Reply-To: References: Message-ID: > A store is sunk from a counted loop into an enclosing infinite > loop. The assert fires because: > > > get_loop(lca)->_nest < n_loop->_nest > > > is false. That happens because the outer loop was found to be infinite > in the current loop opts pass. When that happens, it's not properly > attached to the loop tree. The second part of the assert was added to > cover a similar case: > > > lca->in(0)->is_NeverBranch() > > > but it doesn't work in this case bcause lca is not a projection of the > `NeverBranch`. It's the exit projection of the counted loop. The fix I > propose changes that part of the assert to test that lca is, indeed, > in an infinite loop in a way that's robust. > > I also removed some code that I believe to be useless following > 8335709. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: undo unrelated change ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20334/files - new: https://git.openjdk.org/jdk/pull/20334/files/420873b7..6d0444ba Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20334&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20334&range=00-01 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20334.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20334/head:pull/20334 PR: https://git.openjdk.org/jdk/pull/20334 From roland at openjdk.org Thu Aug 22 15:09:19 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Aug 2024 15:09:19 GMT Subject: RFR: 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop [v2] In-Reply-To: References: Message-ID: On Tue, 30 Jul 2024 16:52:31 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/opto/loopopts.cpp line 1275: >> >>> 1273: for (;;) { >>> 1274: Node* dom = idom(useblock); >>> 1275: if (loop->is_member(get_loop(dom))) { >> >> This would not be backported, right? Do you think we should do it in a separate RFE? Or is it necessary for the fix? > > Yes, it should be separate RFE or you back port [JDK-8335709](https://bugs.openjdk.org/browse/JDK-8335709) first. Right. That makes sense. I removed it in latest commit and will create a separate PR for it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20334#discussion_r1727259517 From roland at openjdk.org Thu Aug 22 15:31:16 2024 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 22 Aug 2024 15:31:16 GMT Subject: RFR: 8338844: C2: remove useless code in PhaseIdealLoop::place_outside_loop() after 8335709 Message-ID: This removes code that shouldn't be necessary now that the NeverBranch/CProj nodes are assigned the correct loop. I proposed this initially as part of https://github.com/openjdk/jdk/pull/20334 where the recommendation was to take care of it separately to make backports easier. ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/20678/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20678&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338844 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20678.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20678/head:pull/20678 PR: https://git.openjdk.org/jdk/pull/20678 From kvn at openjdk.org Thu Aug 22 15:54:08 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 22 Aug 2024 15:54:08 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v3] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 12:10:37 GMT, Markus Gr?nlund wrote: >> Greetings, >> >> Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). >> >> This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. >> >> Testing: jdk_jfr, loom testing >> >> Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. >> >> Thanks >> Markus > > Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: > > JVMCI exportation Marked as reviewed by kvn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20664#pullrequestreview-2255034428 From kvn at openjdk.org Thu Aug 22 15:54:08 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 22 Aug 2024 15:54:08 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v3] In-Reply-To: References: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> Message-ID: On Thu, 22 Aug 2024 11:21:46 GMT, Markus Gr?nlund wrote: >> Is it an implicit invariant that execution always continues in the interpreter after an uncommon trap? I.e., I don't need to explicitly tell it to "re-execute" there? > > It is updated to use Action::none to keep the nmethod. The trap code picks up the correct bytecode (invokestatic) from the trap scope. So after unrolling the host method (the inliner), the trap bytecode (i.e., the invokestatic call to Continuation.pin() or unpin()) is re-executed in the interpreter. This is also without setting the explicit re-execute state (which may mean something else). Yes, we should not throw out compiled nmethod if one thread need to got into Interpreter and throw an exception. Other threads will continue to use this nmethod. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1727404499 From chagedorn at openjdk.org Thu Aug 22 15:56:07 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 22 Aug 2024 15:56:07 GMT Subject: RFR: 8338844: C2: remove useless code in PhaseIdealLoop::place_outside_loop() after 8335709 In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 15:26:40 GMT, Roland Westrelin wrote: > This removes code that shouldn't be necessary now that the > NeverBranch/CProj nodes are assigned the correct loop. I proposed this > initially as part of https://github.com/openjdk/jdk/pull/20334 where > the recommendation was to take care of it separately to make > backports easier. Looks good to me. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20678#pullrequestreview-2255039207 From kvn at openjdk.org Thu Aug 22 16:08:08 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 22 Aug 2024 16:08:08 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v9] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 09:40:22 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with two additional commits since the last revision: > > - Fix whitespace. > - Update comments in method `split_through_phi`. > > Co-authored-by: Christian Hagedorn Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19496#pullrequestreview-2255068785 From kvn at openjdk.org Thu Aug 22 16:26:05 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 22 Aug 2024 16:26:05 GMT Subject: RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v2] In-Reply-To: <6kgDCB1rxZyn1JEX-8hvKyOQE07oMs8_kr7Cjbix3Gg=.61a77a4f-394c-48d6-a482-fe73f3314f5b@github.com> References: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> <6kgDCB1rxZyn1JEX-8hvKyOQE07oMs8_kr7Cjbix3Gg=.61a77a4f-394c-48d6-a482-fe73f3314f5b@github.com> Message-ID: <8QYwwkPgR9grcCttNV0KHliNjH9kmBGUt7kc2b5wPW0=.b10e5195-eb8e-4148-960d-670554b57ace@github.com> On Mon, 24 Jun 2024 12:29:38 GMT, Roland Westrelin wrote: >> I propose removing `PhaseIdealLoop::cast_incr_before_loop()` and the >> `CastII` nodes that it adds at counted loop heads. >> >> They were added to prevent nodes to float above the zero trip guard >> when the backedge of a counted loop is removed. In particular, when a >> range check is hoisted by predication, pre/main/post loops are created >> and if one of the main or post loops lose its backedge, an array load >> that's control dependent on a predicate above the pre loop could float >> above the zero trip guard of the main or post loop. That can no longer >> happen AFAICT with changes related to assert predicates. The array >> load is now updated to have a control dependency that's below the zero >> trip guard. >> >> The reason I'm revisiting this is that I noticed that >> `PhaseIdealLoop::cast_incr_before_loop()` has a bug. When it adds the >> `CastII`, it looks for the loop phi and picks input 1 of the phi it >> finds as input to the `CastII`. To find the loop phi, it starts from >> the loop incremement and loop for a use that's a phi and has the loop >> head as control. It never checks that the phi it finds is the loop >> phi. There can be more than one phi as uses of the increment at the >> loop head and it can pick the wrong one. I tried to write a test case >> where this would cause a bug but couldn't actually find any use for >> the `CastII` anymore. >> >> In my testing, the only issue when the `CastII` are not added is that >> some IR tests for vectorization fails: >> >> compiler/vectorization/TestPopulateIndex.java >> compiler/vectorization/runner/ArrayShiftOpTest.java >> compiler/vectorization/runner/LoopArrayIndexComputeTest.java >> >> because removing the `CastII` causes split if to occur with some nodes >> that take the loop phi as input. That then causes pattern matching >> during superword to break. I added logic to prevent split if for those >> cases. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review I am not comfortable with big regression on MacOSX aarch64 even if you can't reproduce it locally. We need to rerun that testing to make sure it is random as you said. Please, merge latest JDK. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19831#issuecomment-2305168673 From dlong at openjdk.org Thu Aug 22 19:51:07 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 22 Aug 2024 19:51:07 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v3] In-Reply-To: References: <6bsD-W2hquUgwdv7sxyEZHABzNqTNLyxm3d-GMMOR1g=.6d8c102b-e1c7-4408-aadf-6823c52f2ce0@github.com> Message-ID: On Thu, 22 Aug 2024 15:50:32 GMT, Vladimir Kozlov wrote: >> It is updated to use Action::none to keep the nmethod. The trap code picks up the correct bytecode (invokestatic) from the trap scope. So after unrolling the host method (the inliner), the trap bytecode (i.e., the invokestatic call to Continuation.pin() or unpin()) is re-executed in the interpreter. This is also without setting the explicit re-execute state (which may mean something else). > > Yes, we should not throw out compiled nmethod if one thread need to got into Interpreter and throw an exception. > Other threads will continue to use this nmethod. > This is also without setting the explicit re-execute state (which may mean something else). I think reexecute is implicitly set for uncommon traps, and the explicit flag is for deoptimization safepoints. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20664#discussion_r1727725892 From jkarthikeyan at openjdk.org Thu Aug 22 20:27:06 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 22 Aug 2024 20:27:06 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <8fVdLHTpCnPF-OMl9zFOgoCbENn2a_I06Iq1Ksapfs0=.056922a3-f56d-4ae0-ae63-c0a6d4463ba7@github.com> Message-ID: On Mon, 19 Aug 2024 11:51:53 GMT, Christian Hagedorn wrote: >> You're totally right! It is not even related to the LSB (`-14 & -6` would have the same problem with `-12`). Finding the leading 1s is the right solution. Thanks a lot for the clarification! > > Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative? > In the positive case, we have values from: > > 011...1 > 000...0 > ``` > while in the negative case, we have values from: > > 111...1 > 100...0 > > It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. `MIN2(r0->_hi, r1->_hi)`? This is a great observation! Since bitwise-and can only remove bits, the largest possible value is the smaller of each range's `hi` value so I think it's correct to use the minimum here rather than the maximum. I didn't look into this case too deeply initially since I didn't find any bitwise-and nodes of two negative ranges in my investigation, but I think we should include it since it's a simple enough condition to check. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1727766730 From jkarthikeyan at openjdk.org Thu Aug 22 20:33:04 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 22 Aug 2024 20:33:04 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: On Mon, 19 Aug 2024 11:18:04 GMT, Christian Hagedorn wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Check IR before macro expansion > > src/hotspot/share/opto/mulnode.cpp line 611: > >> 609: if (r0->_lo >= 0 && r1->_lo >= 0) { >> 610: return IntegerType::make(0, MIN2(r0->_hi, r1->_hi), widen); >> 611: } > > Since you've already worked out the math in the PR comment, do you also want to add it here to the different cases? It could help to support the correctness of the code. I agree, I think it would be good to include the math in the code to make it easier to understand. I'll add it in the next commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1727774283 From jkarthikeyan at openjdk.org Thu Aug 22 20:39:06 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 22 Aug 2024 20:39:06 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: On Mon, 19 Aug 2024 12:06:16 GMT, Christian Hagedorn wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Check IR before macro expansion > > src/hotspot/share/opto/mulnode.cpp line 635: > >> 633: // Since count_leading_zeros is undefined at 0 (~(-1)) the number of digits in the native type can be used instead, >> 634: // as it returns 31 and 63 for signed integers and longs respectively. >> 635: int shift_bits = sel_val == 0 ? std::numeric_limits::digits : count_leading_zeros(sel_val) - 1; > > `sel_val` can only be 0 if `r0->_lo` and `r1->_lo` are both -1. While I think it's correct how you handle the case here, wouldn't it be simpler/more readable if you handle this case separately by setting -1 as lower bound directly instead of using "`min >> #digits`"? This is a good point! The logic becomes a lot easier to parse when we don't need to work around `count_leading_zeros` being undefined at 0 and just set the lower bound manually. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1727781384 From jkarthikeyan at openjdk.org Fri Aug 23 03:36:08 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 23 Aug 2024 03:36:08 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: On Mon, 19 Aug 2024 18:07:43 GMT, Quan Anh Mai wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Check IR before macro expansion > > This is really nice. Given Christian's suggestion, what do you think about inferring the unsigned bounds and known bits from the signed bounds, do the calculation there and work back the signed bounds in the end? Thanks for taking a look, @merykitty! With regard to unsigned and bitwise bounds I was thinking for this patch it might be better to keep everything in the signed domain to keep the patch simple, since AFAIK the analysis to go between the signed domain and the unsigned/bitwise domain can become quite complicated. I think it might be better to do that sort of analysis more generally in the type system, such as what you're doing with #17508. I was thinking this PR could be a sort of baseline for any future improvements using that system. Let me know what you think about this approach, or if you think I should explore unsigned and bitwise bounds in this patch. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20066#issuecomment-2306115893 From jbhateja at openjdk.org Fri Aug 23 05:46:29 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 23 Aug 2024 05:46:29 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v4] In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Defaulting to index wrapping scheme. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20508/files - new: https://git.openjdk.org/jdk/pull/20508/files/e24632cb..d7ad6887 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=02-03 Stats: 424 lines in 39 files changed: 0 ins; 361 del; 63 mod Patch: https://git.openjdk.org/jdk/pull/20508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508 PR: https://git.openjdk.org/jdk/pull/20508 From jbhateja at openjdk.org Fri Aug 23 05:58:05 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 23 Aug 2024 05:58:05 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v4] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Fri, 23 Aug 2024 05:46:29 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Defaulting to index wrapping scheme. Hi @rose00 , @PaulSandoz , @sviswa7, Latest patch removed explicit wrap argument passed to selectFrom API, instead uses wrapping scheme as a mitigation strategy to handle OOB partially wrapped indexes. Summarizing the new scheme for index wrapping:- - Shuffle always holds indexes in valid vector index range or partially wraps OOB indexes. - Following are the shuffle creation intercepts - VectorShuffle.fromArray - Partially wraps OOB indexes - iotaShuffle - Accepts explicit wrap argument to chooses b/w wrapping vs partial wrapping of OOB indexes. - Vector.toShuffle - Partially wraps OOB indexes. - Partial wrapping generate -ve indexes for OOB indices after wrapping them into valid index range. - Objective is to delegate mitigation strategy to subsequent APIs which can either generate a IndexOutOfBounds exception or create valid index by adding vector length. - An important point to mention here is that partially wrapped indexing schemes first wraps OOB index ( index < 0 OR index >= VECLEN) into valid index range and then subtracts VECLEN from wrapped index to generate a -ve number in [-VECLEN: -1] range. - With new scheme we are choosing wrapping as a default mitigation strategy hence only client which make effective use of a partially wrapped indexes is two vector re-arrange, which uses it to compute the mask for blending two permuted vectors. - Two vector re-arrange and selectFrom API differ in terms of acceptable index range, while former accepts shuffle indices in single vector index range [0:VECLEN) latter operates on two vector index range [0:2*VECLEN). Best Regards, Jatin ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2306344606 From jbhateja at openjdk.org Fri Aug 23 06:09:48 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 23 Aug 2024 06:09:48 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v5] In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: <7e5pWnvjqk-dQYNeaZjFzXcd5WlzniZPl5T4l1rKQGE=.0882bcd4-e307-4a29-aa41-5496ee029a60@github.com> > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Removing redundant checkIndex routine ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20508/files - new: https://git.openjdk.org/jdk/pull/20508/files/d7ad6887..6cb1a46d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=03-04 Stats: 35 lines in 7 files changed: 0 ins; 35 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508 PR: https://git.openjdk.org/jdk/pull/20508 From qxing at openjdk.org Fri Aug 23 06:40:06 2024 From: qxing at openjdk.org (Qizheng Xing) Date: Fri, 23 Aug 2024 06:40:06 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v5] In-Reply-To: References: Message-ID: <4_W_2kSYRHFwaDrdGK2v1nY-K_kjPBCYDPbswB4FjfM=.cf64d77f-e509-4969-b241-53e85061d696@github.com> On Thu, 4 Jul 2024 06:47:40 GMT, Tobias Hartmann wrote: >> Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: >> >> Add `@requires` for G1 GC. > > Thanks, test results look good now. I'll pass the detailed review to someone else. @TobiHartmann @chhagedorn @vnkozlov Thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/19496#issuecomment-2306391675 From duke at openjdk.org Fri Aug 23 06:40:07 2024 From: duke at openjdk.org (duke) Date: Fri, 23 Aug 2024 06:40:07 GMT Subject: RFR: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement [v9] In-Reply-To: References: Message-ID: <9u5-EOoeyN_Lc9JBnXSmMCyc0dydOZIh02AyrRes9kc=.d8993bcf-71a6-4720-a685-6aac2d0cf7d6@github.com> On Thu, 22 Aug 2024 09:40:22 GMT, Qizheng Xing wrote: >> This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. >> >> The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. >> >> For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. >> >> This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: >> >> >> Benchmark (nkeys) Mode Cnt Score Error Units >> Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline >> Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch >> >> >> Testing: tier1-4. > > Qizheng Xing has updated the pull request incrementally with two additional commits since the last revision: > > - Fix whitespace. > - Update comments in method `split_through_phi`. > > Co-authored-by: Christian Hagedorn @MaxXSoft Your change (at version f8118454e4dcd665fe2c2337a7044e3cbaf4ad7e) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19496#issuecomment-2306392844 From mgronlun at openjdk.org Fri Aug 23 09:21:07 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Fri, 23 Aug 2024 09:21:07 GMT Subject: RFR: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() [v3] In-Reply-To: References: Message-ID: <7kDAWMRy5cWRV_WFMhcQJ0i6gkVNMNmvutUQNyUvSiQ=.95990a93-0346-4e65-b617-ff6b3598780c@github.com> On Thu, 22 Aug 2024 15:51:49 GMT, Vladimir Kozlov wrote: >> Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: >> >> JVMCI exportation > > Marked as reviewed by kvn (Reviewer). Thank you for your reviews and comments, @vnkozlov, @dean-long, @TobiHartmann and @mur47x111. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20664#issuecomment-2306664937 From mgronlun at openjdk.org Fri Aug 23 09:29:08 2024 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Fri, 23 Aug 2024 09:29:08 GMT Subject: Integrated: 8338745: Intrinsify Continuation.pin() and Continuation.unpin() In-Reply-To: References: Message-ID: On Wed, 21 Aug 2024 20:11:05 GMT, Markus Gr?nlund wrote: > Greetings, > > Please help review this change set that implements C2 intrinsics for jdk.internal.vm.Continuation.pin() and jdk.internal.vm.Continuation.unpin(). > > This work is a consequence of [JDK-8338417](https://bugs.openjdk.org/browse/JDK-8338417), which required us to introduce explicit pin constructs for Virtual threads in a relatively performance-sensitive path. > > Testing: jdk_jfr, loom testing > > Comment: I changed the type of the ContinuationEntry::_pin_count field from uint to uin32_t to make the size explicit and to access it uniformly from the intrinsic code using T_INT. > > Thanks > Markus This pull request has now been integrated. Changeset: fead3cf5 Author: Markus Gr?nlund URL: https://git.openjdk.org/jdk/commit/fead3cf54130e3ab10f94a94dfbd382e4cb1e597 Stats: 109 lines in 9 files changed: 105 ins; 0 del; 4 mod 8338745: Intrinsify Continuation.pin() and Continuation.unpin() Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/20664 From qxing at openjdk.org Fri Aug 23 09:33:18 2024 From: qxing at openjdk.org (Qizheng Xing) Date: Fri, 23 Aug 2024 09:33:18 GMT Subject: Integrated: 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement In-Reply-To: References: Message-ID: On Fri, 31 May 2024 09:01:38 GMT, Qizheng Xing wrote: > This patch changes the algorithm of `Node::dominates` to make the result more precise, and allows the iterators of `ConcurrentHashMap` to be scalar replaced. > > The previous algorithm will return a conservative result when encountering a dead control flow, and only try the first two input paths of a multi-input Region node, which may prevent the scalar replacement in some cases. > > For example, with G1 GC enabled, C2 generates GC barriers for `ConcurrentHashMap` iteration operations at some early phases, and then eliminates them in a later IGVN, but `LoadNode` is also idealized in the same IGVN. This causes `LoadNode::Ideal` to see some dead barrier control flows, and refuse to split some instance field loads through Phi due to the conservative result of `Node::dominates`, and thus the scalar replacement can not be applied to iterators in the later macro elimination phase. > > This patch allows `Node::dominates` to try other paths of the last multi-input Region node when the first path is dead, and makes `ConcurrentHashMap` iteration ~30% faster: > > > Benchmark (nkeys) Mode Cnt Score Error Units > Maps.testConcurrentHashMapIterators 10000 avgt 15 414099.085 ? 33230.945 ns/op # baseline > Maps.testConcurrentHashMapIterators 10000 avgt 15 315490.281 ? 3037.056 ns/op # patch > > > Testing: tier1-4. This pull request has now been integrated. Changeset: 965dd1ac Author: Qizheng Xing URL: https://git.openjdk.org/jdk/commit/965dd1acd0ce5b225d85e2c55cc097856e0e9f3c Stats: 240 lines in 6 files changed: 190 ins; 4 del; 46 mod 8333334: C2: Make result of `Node::dominates` more precise to enhance scalar replacement Reviewed-by: chagedorn, kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/19496 From rcastanedalo at openjdk.org Fri Aug 23 09:47:05 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 23 Aug 2024 09:47:05 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:32:25 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Roberto's comments and suggestions I think `RegMask::is_UP()` needs to be updated to handle extended RegMasks (unless we can prove that it will never be called on an extended one). According to its comment and current behavior for non-extended RegMasks, when I run the following: RegMask rm; rm.Insert(OptoReg::Name(REGISTER_THAT_DOESNT_FIT_IN_RM_UP)); I expect `rm.is_UP()` to return false (since `REGISTER_THAT_DOESNT_FIT_IN_RM_UP` is necessarily a stack location), but it returns true (see the test case `up_reproducer` in the work-in-progress branch https://github.com/robcasloz/jdk/blob/626fac6aa882f738f969ab2fb4813386547216c8/test/hotspot/gtest/opto/test_regmask.cpp#L524-L528). The reason is that `is_UP()` checks the presence of stack locations by testing the RegMask's overlap with `Matcher::STACK_ONLY_mask`, which is only filled up to `RM_SIZE`. Other uses of `Matcher::STACK_ONLY_mask` might also need to be revisited, in particular the line `tmp_rm.SUBTRACT(Matcher::STACK_ONLY_mask)` in `PhaseChaitin::Split`. ------------- PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-2256825515 From dlunden at openjdk.org Fri Aug 23 11:17:07 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 23 Aug 2024 11:17:07 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Wed, 21 Aug 2024 23:55:15 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after Roberto's comments and suggestions > > src/hotspot/share/adlc/formsopt.cpp line 368: > >> 366: fprintf(fp," 0x%x,", regs_in_word(i, false)); >> 367: } >> 368: fprintf(fp," 0x%x );\n", regs_in_word(i, false)); > > Does use of RegMask_Size() above for the length still make sense? IIUC, RegMask_Size() is what previously determined the max size of frames and register masks. `RegMask_Size()` [determines](https://github.com/dlunde/jdk/blob/953966686ef7b86972bb2c156c931c60c4e5cb70/src/hotspot/share/adlc/output_h.cpp#L101) the base, statically allocated, chunk for register masks. If I'm not mistaken, it does not affect max frame size. The "all-stack" bit of current register masks already allows frame growth beyond what a register mask can represent. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1728804110 From dlunden at openjdk.org Fri Aug 23 11:17:08 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 23 Aug 2024 11:17:08 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v3] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 16:34:40 GMT, Roberto Casta?eda Lozano wrote: >> We could do that, but I'm not sure I see the benefit. Could you elaborate a bit? > > The benefit, in my opinion, is better comprehensibility due to a simpler model with stronger invariants. An alternative would be to extend the comment at the definition of `_lwm` and `_hwm` clarifying that the value of these variables is unspecified when the register set is empty. Recapping what we discussed in-person here: we should not try to enforce `_lwm <= _hwm`, but describe how they work and what they guarantee in more detail through comments in `regmask.hpp`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1728806664 From rcastanedalo at openjdk.org Fri Aug 23 11:21:06 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 23 Aug 2024 11:21:06 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:32:25 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Roberto's comments and suggestions src/hotspot/share/opto/regmask.hpp line 261: > 259: > 260: bool is_offset() const { return _offset > 0; } > 261: unsigned int offset() const { return _offset; } This function seems unused. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r1728810898 From dlunden at openjdk.org Fri Aug 23 11:39:06 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 23 Aug 2024 11:39:06 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Wed, 21 Aug 2024 12:05:57 GMT, Roberto Casta?eda Lozano wrote: > Would it be possible to 1) extend test/hotspot/gtest/opto/test_regmask.cpp with tests that exercise extended RegMasks and 2) re-run the standard test tiers with a (temporary) RM_SIZE value that is low enough to also exercise the new logic more often? Yes, good idea. I'm adding 2) to my TODO list (and as we discussed in-person, you offered to work a bit on 1) yourself). > I think RegMask::is_UP() needs to be updated to handle extended RegMasks (unless we can prove that it will never be called on an extended one). Thanks, good catch. The problem is that `overlap` (intentionally) does not consider the all-stack flag. I'll experiment by adding some `assert`s to see if this is also a more widespread problem. For `is_UP`, I think the best solution is to rewrite it using `RegMask::find_last_elem` and `OptoReg::is_reg` instead. > Other uses of Matcher::STACK_ONLY_mask might also need to be revisited, in particular the line tmp_rm.SUBTRACT(Matcher::STACK_ONLY_mask) in PhaseChaitin::Split. Yes, I'll have a look. `SUBTRACT` should be fine though, I have rewritten it to handle such cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2306907174 From mdoerr at openjdk.org Fri Aug 23 13:31:06 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 23 Aug 2024 13:31:06 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> Message-ID: On Mon, 19 Aug 2024 14:25:13 GMT, Roberto Casta?eda Lozano wrote: >> If case of heap base != null, a branch already exists which makes the other null check redundant. So, we have null check, region crossing check, another null check. Maybe this compressed oop mode is not important enough. >> >> For the other compressed oop modes, yes, this means moving the null check above the region crossing check. On PPC64, the null check can be combined with the shift instruction, so we save one compare instruction. Technically, it would even be possible to use only one branch instruction for both checks, but I'm not sure if it's worth the complexity. I'll think about it. > > OK, thanks. I just ran some benchmarks with zero-based OOP compression ([prototype here](https://github.com/robcasloz/jdk/tree/JDK-8334060-g1-late-barrier-expansion-x64-optimizations)) and could not observe any significant performance effect on three different x64 implementations. I think I will keep the `g1StoreN` implementation as-is in the x64 and aarch64 backends, for simplicity. Again, we can revisit this in follow-up work if need be. I have an experimental implementation for PPC64. I have moved the oop decoding into `G1BarrierSetAssembler::g1_write_barrier_post_c2`: https://github.com/TheRealMDoerr/jdk/blob/a48598075862f17e7b1cfbec29af4c2431809257/src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp#L476 This has 2 advantages: - Reduce replicated code in the .ad file. - Make the discussed optimization easy. Please take a look. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1728978594 From mdoerr at openjdk.org Fri Aug 23 13:36:06 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 23 Aug 2024 13:36:06 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <4aS0KysKaIPue1D60nootRPb8m7pdP_2hlPSwGUIk8w=.6b68b278-62ce-4df0-a6c6-c7fee40557aa@github.com> Message-ID: <6DcMr9PUa8OZEhO861hkhqTMYlKDs96tgPs4Fu1u72I=.9dd2d0d4-9075-4c36-9ab6-ff886e257b4b@github.com> On Mon, 19 Aug 2024 12:20:21 GMT, Roberto Casta?eda Lozano wrote: >> Note that we had such an optimization already in C2: https://github.com/openjdk/jdk8u-dev/blob/4106121e0ae42d644e45c6eab9037874110ed670/hotspot/src/share/vm/opto/library_call.cpp#L3114 >> But, it's probably not a big deal. Maybe I can try it on PPC64 which may be more sensitive to accesses on contended memory. > > Thanks for the reference, I would still prefer to keep this part as is for simplicity. We can always optimize the atomic barriers in follow-up work, if a need arises. After thinking more about this, I figured out that we can optimize more when moving the pre_barrier after the cmpxchg. We can skip all G1 barriers if the cmpxchg fails: https://github.com/TheRealMDoerr/jdk/blob/a48598075862f17e7b1cfbec29af4c2431809257/src/hotspot/cpu/ppc/gc/g1/g1_ppc.ad#L171 This may reduce load on GC queue handling and related work for GC threads. I'm testing this version and I actually like it more than the version I had before. Please take a look. (Note that my final version will need https://github.com/openjdk/jdk/pull/20689 to be integrated and merged into your PR.) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1728987173 From kurashige.taizo at fujitsu.com Fri Aug 23 14:41:30 2024 From: kurashige.taizo at fujitsu.com (Taizo Kurashige (Fujitsu)) Date: Fri, 23 Aug 2024 14:41:30 +0000 Subject: Question about JDK-8221092 In-Reply-To: References: Message-ID: Hi all, I'm sorry to bother you again. If possible, could anyone please give me some insight on the following? Is there a specification for what the stepping value is for a particular processor? For example, is it defined in any documentation that CascadeLake processors have stepping >=5? I searched the documentation provided by Intel but couldn't find it. I want some evidence that the following is true. ?All Skylake processors have stepping < 5 ?CascadeLake processors have stepping >=5. Thanks. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ________________________________ Hi all, I'm sorry to bother you again. If possible, could anyone please give me some insight? Thanks. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ________________________________ Hi all, If possible, could anyone give me some insight? Thanks. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ________________________________ Hi Sandhya, Thank you for your response. Thanks to you, I understood the following. ?All Skylake processors have cupid family=6, model=0x55, stepping < 5 ?CascadeLake processors have cupid family=6, model=0x55, stepping >=5. If possible, I would like you to tell me about the following. Is there a specification for what the stepping value is for a particular processor? For example, is it defined in any documentation that CascadeLake processors have stepping >=5? I searched the documentation provided by Intel but couldn't find it. I want some evidence that the following is true. ?All Skylake processors have stepping < 5 ?CascadeLake processors have stepping >=5. If stepping per processor is specified somewhere, please let me know. Thanks. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ________________________________ Hi Taizo, UseAVX is set to 2 for all Skylake processors (cupid family=6, model=0x55, stepping < 5) , not just Skylake X. CascadeLake processors have cupid family=6, model=0x55, stepping >=5. Hope this helps. Best Regards, Sandhya -------- Forwarded Message -------- Subject: Re: Question about JDK-8221092 Date: Wed, 10 Jul 2024 06:39:43 +0000 From: Taizo Kurashige (Fujitsu) To: hotspot-compiler-dev at openjdk.org Hi all, Could someone please respond to this question if possible? Thank you. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 ------------------------------------------------------------------------------------------------------------------------ *???:* Kurashige, Taizo/?? ?? *????:* 2024?7?3? 15:09 *??:* hotspot-compiler-dev at openjdk.org *??:* Question about JDK-8221092 Hi all, I have a question about https://bugs.openjdk.org/browse/JDK-8221092. If possible, could someone please provide some insight? Here's what I would like to know: 1. Is it correct to understand that "Skylake (X7) processors" refers to the Skylake processors listed at https://ark.intel.com/content/www/us/en/ark/products/codename/37572/products-formerly-skylake.html, specifically those in the 7000 series with an "X" or "XE" in their names? For example, "Intel? Core? i9-7920X X-series Processor (16.5M Cache, up to 4.30 GHz)" or "Intel? Core? i9-7980XE Extreme Edition Processor (24.75M Cache, up to 4.20 GHz)". 2. In the fix for JDK-8221092, if the stepping is less than 5, the processor is considered to be of Skylake (X7) or an earlier version. In such cases, UseAVX is set to 2. Is there any documentation that the stepping for Skylake (X7) is 5? Thank you. ---------------------------------------------------------------------------------------- Taizo Kurashige GitHub account : https://github.com/kurashige23 kurashige23 - Overview kurashige23 has 5 repositories available. Follow their code on GitHub. github.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rcastanedalo at openjdk.org Fri Aug 23 14:43:05 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 23 Aug 2024 14:43:05 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Fri, 23 Aug 2024 11:36:37 GMT, Daniel Lund?n wrote: > (and as we discussed in-person, you offered to work a bit on 1) yourself). Yes, here is an extension of the current test set: https://github.com/openjdk/jdk/commit/4fc5282e24a8449107a66f534a794b8e680537f7, feel free to merge into this PR, and of course make any change you find necessary. The patch introduces new tests for extended RegMasks but also for basic RegMask operations that were not covered. I have also added a few tests that cover working with offsets (rollovers, insertions, deletions, find operations, and different combinations of `SUBTRACT_inner`). Finally, it introduces a minimal debug-only extension of RegMask which I found necessary to write the tests. Beware that I have only run the new tests locally (linux-x64), so please check that they work on other platforms. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2307234407 From duke at openjdk.org Fri Aug 23 15:06:07 2024 From: duke at openjdk.org (Daniel Skantz) Date: Fri, 23 Aug 2024 15:06:07 GMT Subject: RFR: 8330157: C2: Add a stress flag for bailouts [v9] In-Reply-To: References: Message-ID: <1v-WNWigdtAWl6wS1BE3S4kikAZo6zuyOc9Q9KxxmZo=.1b5c9937-3043-440d-ab77-839e7d152bf3@github.com> On Tue, 2 Jul 2024 06:42:32 GMT, Daniel Skantz wrote: >> This patch adds a diagnostic/stress flag for C2 bailouts. It can be used to support testing of existing bailouts to prevent issues like [JDK-8318445](https://bugs.openjdk.org/browse/JDK-8318445), and can test for issues only seen at runtime such as [JDK-8326376](https://bugs.openjdk.org/browse/JDK-8326376). It can also be useful if we want to add more bailouts ([JDK-8318900](https://bugs.openjdk.org/browse/JDK-8318900)). >> >> We check two invariants. >> a) Bailouts should be successful starting from any given `failing()` check. >> b) The VM should not record a bailout when one is pending (in which case we have continued to optimize for too long). >> >> a), b) are checked by randomly starting a bailout at calls to `failing()` with a user-given probability. >> >> The added flag should not have any effect in debug mode. >> >> Testing: >> >> T1-5, with flag and without it. We want to check that this does not cause any test failures without the flag set, and no unexpected failures with it. Tests failing because of timeout or because an error is printed to output when compilation fails can be expected in some cases. > > Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: > > +1 whitespace Comment to avoid timeout. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19646#issuecomment-2307276311 From dlunden at openjdk.org Fri Aug 23 17:17:13 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 23 Aug 2024 17:17:13 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:32:25 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Roberto's comments and suggestions After doing some simple performance testing, it looks like the best option may, after all, be to simply use bigger static register masks. Please hold off a bit with reviewing! I'm running some more rigorous tests for more architectures over the weekend to make sure there are no caveats. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2307494194 From qamai at openjdk.org Fri Aug 23 17:48:10 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 23 Aug 2024 17:48:10 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: On Wed, 7 Aug 2024 01:20:09 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) >> >> This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Check IR before macro expansion I agree, I think your math is correct. ------------- Marked as reviewed by qamai (Committer). PR Review: https://git.openjdk.org/jdk/pull/20066#pullrequestreview-2257781634 From qamai at openjdk.org Fri Aug 23 17:56:08 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 23 Aug 2024 17:56:08 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <8fVdLHTpCnPF-OMl9zFOgoCbENn2a_I06Iq1Ksapfs0=.056922a3-f56d-4ae0-ae63-c0a6d4463ba7@github.com> Message-ID: <2N1kqQ66DJXQP2E66oTFa7-pQE2XvC-KBYyef61Ks4s=.29af55be-da2d-4bb6-b73b-407c2744613e@github.com> On Thu, 22 Aug 2024 20:24:14 GMT, Jasmine Karthikeyan wrote: >> Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative? >> In the positive case, we have values from: >> >> 011...1 >> 000...0 >> ``` >> while in the negative case, we have values from: >> >> 111...1 >> 100...0 >> >> It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. `MIN2(r0->_hi, r1->_hi)`? > > This is a great observation! Since bitwise-and can only remove bits, the largest possible value is the smaller of each range's `hi` value so I think it's correct to use the minimum here rather than the maximum. I didn't look into this case too deeply initially since I didn't find any bitwise-and nodes of two negative ranges in my investigation, but I think we should include it since it's a simple enough condition to check. Another point of view, you can decompose `[lo, hi]` into `[lo, -1] v [0, hi]`. Then `[lo1, hi1] & [lo2, hi2]` can be calculated as: ([lo1, -1] & [lo2, -1]) v ([lo1, -1] & [0, hi2]) v ([0, hi1] & [lo2, -1]) v ([0, hi1] & [0, hi2]) = [lo, -1] v [0, hi2] v [0, h1] v [0, min(hi1, hi2)] = [lo, max(hi1, hi2)]` with `lo` being the lower bound you calculated right above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1729328074 From qamai at openjdk.org Fri Aug 23 17:56:08 2024 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 23 Aug 2024 17:56:08 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: <2N1kqQ66DJXQP2E66oTFa7-pQE2XvC-KBYyef61Ks4s=.29af55be-da2d-4bb6-b73b-407c2744613e@github.com> References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <8fVdLHTpCnPF-OMl9zFOgoCbENn2a_I06Iq1Ksapfs0=.056922a3-f56d-4ae0-ae63-c0a6d4463ba7@github.com> <2N1kqQ66DJXQP2E66oTFa7-pQE2XvC-KBYyef61Ks4s=.29af55be-da2d-4bb6-b73b-407c2744613e@github.com> Message-ID: On Fri, 23 Aug 2024 17:52:19 GMT, Quan Anh Mai wrote: >> This is a great observation! Since bitwise-and can only remove bits, the largest possible value is the smaller of each range's `hi` value so I think it's correct to use the minimum here rather than the maximum. I didn't look into this case too deeply initially since I didn't find any bitwise-and nodes of two negative ranges in my investigation, but I think we should include it since it's a simple enough condition to check. > > Another point of view, you can decompose `[lo, hi]` into `[lo, -1] v [0, hi]`. Then `[lo1, hi1] & [lo2, hi2]` can be calculated as: > > ([lo1, -1] & [lo2, -1]) v ([lo1, -1] & [0, hi2]) v ([0, hi1] & [lo2, -1]) v ([0, hi1] & [0, hi2]) = > [lo, -1] v [0, hi2] v [0, h1] v [0, min(hi1, hi2)] = > [lo, max(hi1, hi2)]` > > with `lo` being the lower bound you calculated right above. Just to be clear I think `MAX2` is the correct thing here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20066#discussion_r1729329256 From psandoz at openjdk.org Fri Aug 23 22:33:09 2024 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 23 Aug 2024 22:33:09 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v5] In-Reply-To: <7e5pWnvjqk-dQYNeaZjFzXcd5WlzniZPl5T4l1rKQGE=.0882bcd4-e307-4a29-aa41-5496ee029a60@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> <7e5pWnvjqk-dQYNeaZjFzXcd5WlzniZPl5T4l1rKQGE=.0882bcd4-e307-4a29-aa41-5496ee029a60@github.com> Message-ID: <2_P1qPMS46tgh4RUSuitcjXYnd0koS_BxfRRRmj79EY=.c3baeeaa-87f7-47d4-bc70-ae2afd9de745@github.com> On Fri, 23 Aug 2024 06:09:48 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Removing redundant checkIndex routine API changes look good. (Note at the moment we are not proposing to change how shuffles works - as you point out the two vector `selectFrom` and `rearrange` differ in the index representation.) IIUC if the more direct two-table instruction is not available you fall back to calling two single arg rearranges with a blend, as a lowering transformation, similar to the fallback Java expression. The float/double conversion bothers me, not suggesting we do something about it here, noting down for any future conversation on shuffles. Ideally we would want the equivalent integral vector (int or long) to represent the index, tricky to express in the API, or alternative treat as a bitwise no-op conversion (there is also impact on `toShuffle` too). ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2307886044 From sviswanathan at openjdk.org Fri Aug 23 23:33:14 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 23 Aug 2024 23:33:14 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. src/java.base/share/classes/java/lang/Byte.java line 647: > 645: */ > 646: public static byte subSaturating(byte a, byte b) { > 647: byte res = (byte)(a - b); Could we not do subSaturating as an int operation on similar lines as addSaturating? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1729570756 From duke at openjdk.org Sat Aug 24 06:27:26 2024 From: duke at openjdk.org (Shaojin Wen) Date: Sat, 24 Aug 2024 06:27:26 GMT Subject: RFR: 8333893: Optimization for StringBuilder append boolean & null [v17] In-Reply-To: References: Message-ID: > After PR https://github.com/openjdk/jdk/pull/16245, C2 optimizes stores into primitive arrays by combining values ??into larger stores. > > This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2. Shaojin Wen has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 - replace unsafe with putChar - Merge remote-tracking branch 'upstream/master' into optim_str_builder_append_202406 - private static final field `UNSAFE` - Utf16 case remove `append first utf16 char` - `delete` -> `setLength` - copyright 2024 - optimization for x64 - ... and 9 more: https://git.openjdk.org/jdk/compare/d881a7b4...61196ecd ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19626/files - new: https://git.openjdk.org/jdk/pull/19626/files/d2dcc24d..61196ecd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19626&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19626&range=15-16 Stats: 46886 lines in 1334 files changed: 26149 ins; 14297 del; 6440 mod Patch: https://git.openjdk.org/jdk/pull/19626.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19626/head:pull/19626 PR: https://git.openjdk.org/jdk/pull/19626 From sroy at openjdk.org Sun Aug 25 16:53:35 2024 From: sroy at openjdk.org (Suchismith Roy) Date: Sun, 25 Aug 2024 16:53:35 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize Message-ID: JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. Also, call_c is adapted as per endianess of system. We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. ------------- Commit messages: - spaces - spaces - spaces - remove Endianess check - define call_c_runtime - c1 Changes: https://git.openjdk.org/jdk/pull/19947/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8332423 Stats: 151 lines in 10 files changed: 103 ins; 37 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/19947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19947/head:pull/19947 PR: https://git.openjdk.org/jdk/pull/19947 From sroy at openjdk.org Sun Aug 25 17:00:33 2024 From: sroy at openjdk.org (Suchismith Roy) Date: Sun, 25 Aug 2024 17:00:33 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v2] In-Reply-To: References: Message-ID: > JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) > C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. > Also, call_c is adapted as per endianess of system. > We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. Suchismith Roy has updated the pull request incrementally with one additional commit since the last revision: remove this-> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19947/files - new: https://git.openjdk.org/jdk/pull/19947/files/20f8f106..2b07a3bc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/19947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19947/head:pull/19947 PR: https://git.openjdk.org/jdk/pull/19947 From thartmann at openjdk.org Mon Aug 26 05:22:11 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 26 Aug 2024 05:22:11 GMT Subject: RFR: 8338844: C2: remove useless code in PhaseIdealLoop::place_outside_loop() after 8335709 In-Reply-To: References: Message-ID: <1oaalHVBy5D8Rmphp7YCHYjPFjB9q_A3zVqg2pqTB9w=.a3bf7351-27a3-4f9d-a15a-d4c0c48fa3fc@github.com> On Thu, 22 Aug 2024 15:26:40 GMT, Roland Westrelin wrote: > This removes code that shouldn't be necessary now that the > NeverBranch/CProj nodes are assigned the correct loop. I proposed this > initially as part of https://github.com/openjdk/jdk/pull/20334 where > the recommendation was to take care of it separately to make > backports easier. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20678#pullrequestreview-2259764961 From thartmann at openjdk.org Mon Aug 26 05:23:10 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 26 Aug 2024 05:23:10 GMT Subject: RFR: 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop [v2] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 15:09:19 GMT, Roland Westrelin wrote: >> A store is sunk from a counted loop into an enclosing infinite >> loop. The assert fires because: >> >> >> get_loop(lca)->_nest < n_loop->_nest >> >> >> is false. That happens because the outer loop was found to be infinite >> in the current loop opts pass. When that happens, it's not properly >> attached to the loop tree. The second part of the assert was added to >> cover a similar case: >> >> >> lca->in(0)->is_NeverBranch() >> >> >> but it doesn't work in this case bcause lca is not a projection of the >> `NeverBranch`. It's the exit projection of the counted loop. The fix I >> propose changes that part of the assert to test that lca is, indeed, >> in an infinite loop in a way that's robust. >> >> I also removed some code that I believe to be useless following >> 8335709. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > undo unrelated change Still good. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20334#pullrequestreview-2259766162 From thartmann at openjdk.org Mon Aug 26 05:37:06 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 26 Aug 2024 05:37:06 GMT Subject: RFR: 8330157: C2: Add a stress flag for bailouts [v9] In-Reply-To: <1v-WNWigdtAWl6wS1BE3S4kikAZo6zuyOc9Q9KxxmZo=.1b5c9937-3043-440d-ab77-839e7d152bf3@github.com> References: <1v-WNWigdtAWl6wS1BE3S4kikAZo6zuyOc9Q9KxxmZo=.1b5c9937-3043-440d-ab77-839e7d152bf3@github.com> Message-ID: <0NVULwL13VG5MGVef1Qjcp6HPhMQJtXeIdCIJGYavT8=.1501f68e-d15a-48fa-bba4-4d03817f73bd@github.com> On Fri, 23 Aug 2024 15:03:14 GMT, Daniel Skantz wrote: >> Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: >> >> +1 whitespace > > Comment to avoid timeout. @danielogh is this ready for further reviews or are you still working on the suggestion that @eme64 had? ------------- PR Comment: https://git.openjdk.org/jdk/pull/19646#issuecomment-2309355395 From chagedorn at openjdk.org Mon Aug 26 05:52:04 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 26 Aug 2024 05:52:04 GMT Subject: RFR: 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop [v2] In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 15:09:19 GMT, Roland Westrelin wrote: >> A store is sunk from a counted loop into an enclosing infinite >> loop. The assert fires because: >> >> >> get_loop(lca)->_nest < n_loop->_nest >> >> >> is false. That happens because the outer loop was found to be infinite >> in the current loop opts pass. When that happens, it's not properly >> attached to the loop tree. The second part of the assert was added to >> cover a similar case: >> >> >> lca->in(0)->is_NeverBranch() >> >> >> but it doesn't work in this case bcause lca is not a projection of the >> `NeverBranch`. It's the exit projection of the counted loop. The fix I >> propose changes that part of the assert to test that lca is, indeed, >> in an infinite loop in a way that's robust. >> >> I also removed some code that I believe to be useless following >> 8335709. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > undo unrelated change Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20334#pullrequestreview-2259801469 From thartmann at openjdk.org Mon Aug 26 06:29:07 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 26 Aug 2024 06:29:07 GMT Subject: RFR: 8032218: Emit single post-constructor barrier for chain of superclass constructors [v4] In-Reply-To: References: <9Gvg3HswFVpqRnVZQ4HLf6WJKcM1_St_iVTC0GKhMgk=.b1081e57-b73e-43ee-ab1b-d8d6b0bf9cc5@github.com> Message-ID: <8Lnc2ef2a_dQDtrX5zRSSB5qt6onOWaAdwGkJQawai4=.02c7e56a-3485-42c8-bd2c-9de3a58a6826@github.com> On Sun, 18 Aug 2024 07:35:41 GMT, Joshua Cao wrote: >> [C2 emits a StoreStore barrier for each constructor call](https://github.com/openjdk/jdk/blob/72ca7bafcd49a98c1fe09da72e4e47683f052e9d/src/hotspot/share/opto/parse1.cpp#L1016) in a chain of superclass constructor calls. It is unnecessary. We only need to emit a single barrier for each object allocation / each pair of `Allocation/InitializeNode`. >> >> [Macro expansion emits a trailing StoreStore after an InitializeNode](https://github.com/openjdk/jdk/blob/32946e1882e9b22c983cbba3c6bda3cc7295946a/src/hotspot/share/opto/macro.cpp#L1610-L1628). This `StoreStore` is sufficient as the post-constructor barrier. From the [InitializeNode definition](https://github.com/openjdk/jdk/blob/32946e1882e9b22c983cbba3c6bda3cc7295946a/src/hotspot/share/opto/memnode.cpp#L3639-L3642): >> >>> // An InitializeNode collects and isolates object initialization after >> // an AllocateNode and before the next possible safepoint. As a >> // memory barrier (MemBarNode), it keeps critical stores from drifting >> // down past any safepoint or any publication of the allocation. >> >> This PR modifies `Parse::do_exits()` such that it only emits a barrier for a constructor if we find that the constructed object does not have an `InitializeNode`. It is possible that we cannot find an `InitializeNode` i.e. if the outermost method of the compilation unit is the constructor. We still need to emit a barrier in these cases. >> >> Passes hotspot tier1 locally on x86 linux machine. New tests make sure that there is a single `StoreStore` for chained constructors. > > Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 27 commits: > > - Add tests for Stable fields > - Fix typo in comment block > - Merge branch 'master' into chainstorestore > - Attempt2: Only emit StoreStore in do_exits when there is no parent > caller > - Merge branch 'master' of https://git.openjdk.org/jdk into chainstorestore > - 8032218: Emit single post-constructor barrier for chain of superclass constructors > - Add riscv64 to test > - Merge branch 'master' into storestore > - Merge branch 'master' into storestore > - Apply suggestions from code review > > some formatting suggestions from @shipilev > > Co-authored-by: Aleksey Shipil?v > - ... and 17 more: https://git.openjdk.org/jdk/compare/8635642d...acca7a26 `compiler/stringopts/TestStringObjectInitialization.java` fails on Linux AArch64 with `-XX:-TieredCompilation -XX:+AlwaysIncrementalInline`: java.lang.NullPointerException: Cannot read the array length because "this.value" is null at java.base/java.lang.String.length(String.java:1593) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:590) at java.base/java.lang.StringBuilder.append(StringBuilder.java:179) at compiler.stringopts.TestStringObjectInitialization.add(TestStringObjectInitialization.java:62) at compiler.stringopts.TestStringObjectInitialization.run(TestStringObjectInitialization.java:67) at compiler.stringopts.TestStringObjectInitialization$Runner.run(TestStringObjectInitialization.java:85) at java.base/java.lang.Thread.run(Thread.java:1575) STATUS:Failed.`main' threw exception: java.lang.NullPointerException: Cannot read the array length because "this.value" is null java.lang.NullPointerException: Cannot read the array length because "this.value" is null at java.base/java.lang.String.length(String.java:1593) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:590) at java.base/java.lang.StringBuilder.append(StringBuilder.java:179) at compiler.stringopts.TestStringObjectInitialization.add(TestStringObjectInitialization.java:62) at compiler.stringopts.TestStringObjectInitialization.run(TestStringObjectInitialization.java:67) at compiler.stringopts.TestStringObjectInitialization$Runner.run(TestStringObjectInitialization.java:85) at java.base/java.lang.Thread.run(Thread.java:1575) STATUS:Failed.`main' threw exception: java.lang.NullPointerException: Cannot read the array length because "this.value" is null java.lang.NullPointerException: Cannot read the array length because "this.value" is null at java.base/java.lang.String.length(String.java:1593) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:590) at java.base/java.lang.StringBuilder.append(StringBuilder.java:179) at compiler.stringopts.TestStringObjectInitialization.add(TestStringObjectInitialization.java:62) at compiler.stringopts.TestStringObjectInitialization.run(TestStringObjectInitialization.java:67) at compiler.stringopts.TestStringObjectInitialization$Runner.run(TestStringObjectInitialization.java:85) at java.base/java.lang.Thread.run(Thread.java:1575) I also see this with an internal stress test: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/share/opto/escape.cpp:4674), pid=22930, tid=42355 # assert(n->is_Mem()) failed: memory node required. # # JRE version: Java(TM) SE Runtime Environment (24.0) (fastdebug build 24-internal-2024-08-26-0524112.tobias.hartmann.jdk3) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 24-internal-2024-08-26-0524112.tobias.hartmann.jdk3, mixed mode, tiered, compressed class ptrs, z gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xbcc316] ConnectionGraph::split_unique_types(GrowableArray&, GrowableArray&, GrowableArray&, Unique_Node_List&)+0x3796 # Current CompileTask: C2:617119 124774 4 sun.util.locale.LocaleExtensions::toID (136 bytes) Stack: [0x00007fd262f9b000,0x00007fd26309b000], sp=0x00007fd263095a70, free space=1002k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xbcc316] ConnectionGraph::split_unique_types(GrowableArray&, GrowableArray&, GrowableArray&, Unique_Node_List&)+0x3796 (escape.cpp:4674) V [libjvm.so+0xbd46b9] ConnectionGraph::compute_escape()+0x20e9 (escape.cpp:397) V [libjvm.so+0xbd4f11] ConnectionGraph::do_analysis(Compile*, PhaseIterGVN*)+0xf1 (escape.cpp:119) V [libjvm.so+0x9fbeaa] Compile::Optimize()+0x63a (compile.cpp:2324) V [libjvm.so+0x9ffe13] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1b43 (compile.cpp:852) V [libjvm.so+0x84f2c5] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1d5 (c2compiler.cpp:142) V [libjvm.so+0xa0bad8] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x928 (compileBroker.cpp:2303) V [libjvm.so+0xa0c768] CompileBroker::compiler_thread_loop()+0x478 (compileBroker.cpp:1961) V [libjvm.so+0xeb67bc] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:758) V [libjvm.so+0x17e2ab6] Thread::call_run()+0xb6 (thread.cpp:225) V [libjvm.so+0x14cc747] thread_native_entry(Thread*)+0x127 (os_linux.cpp:858) ------------- PR Comment: https://git.openjdk.org/jdk/pull/18870#issuecomment-2309423452 From rcastanedalo at openjdk.org Mon Aug 26 07:26:08 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 26 Aug 2024 07:26:08 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> Message-ID: <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> On Fri, 23 Aug 2024 13:28:03 GMT, Martin Doerr wrote: >> OK, thanks. I just ran some benchmarks with zero-based OOP compression ([prototype here](https://github.com/robcasloz/jdk/tree/JDK-8334060-g1-late-barrier-expansion-x64-optimizations)) and could not observe any significant performance effect on three different x64 implementations. I think I will keep the `g1StoreN` implementation as-is in the x64 and aarch64 backends, for simplicity. Again, we can revisit this in follow-up work if need be. > > I have an experimental implementation for PPC64. I have moved the oop decoding into `G1BarrierSetAssembler::g1_write_barrier_post_c2`: > https://github.com/TheRealMDoerr/jdk/blob/0aedfb0aa1c545319257c0e613066b91404a07ca/src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp#L476 > This has 2 advantages: > - Reduce replicated code in the .ad file. > - Make the discussed optimization easy. Please take a look. Great that you already have an experimental port! Thanks for the heads-up, I agree that the OOP decoding + null check fusion becomes less intrusive, but I still prefer the current decoupled implementation for x64 and aarch64 (even simpler, IMO). In the benchmarks I have run (admittedly, only on x64), I could not observe any positive effect, whereas I found a slight regression in one case using zero-based OOP compression. I have not investigated further, but I wonder if hoisting the null check above the region-crossing test could have a negative impact on branch predictability. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730806021 From roland at openjdk.org Mon Aug 26 07:34:13 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 07:34:13 GMT Subject: RFR: 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop [v2] In-Reply-To: References: Message-ID: On Mon, 26 Aug 2024 05:20:43 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> undo unrelated change > > Still good. @TobiHartmann @chhagedorn @eme64 thanks for the reviews ------------- PR Comment: https://git.openjdk.org/jdk/pull/20334#issuecomment-2309531567 From roland at openjdk.org Mon Aug 26 07:34:14 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 07:34:14 GMT Subject: Integrated: 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop In-Reply-To: References: Message-ID: On Thu, 25 Jul 2024 15:16:39 GMT, Roland Westrelin wrote: > A store is sunk from a counted loop into an enclosing infinite > loop. The assert fires because: > > > get_loop(lca)->_nest < n_loop->_nest > > > is false. That happens because the outer loop was found to be infinite > in the current loop opts pass. When that happens, it's not properly > attached to the loop tree. The second part of the assert was added to > cover a similar case: > > > lca->in(0)->is_NeverBranch() > > > but it doesn't work in this case bcause lca is not a projection of the > `NeverBranch`. It's the exit projection of the counted loop. The fix I > propose changes that part of the assert to test that lca is, indeed, > in an infinite loop in a way that's robust. > > I also removed some code that I believe to be useless following > 8335709. This pull request has now been integrated. Changeset: 0c14579f Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/0c14579fef902f0501d0510bdc32e8cece34834a Stats: 58 lines in 2 files changed: 57 ins; 0 del; 1 mod 8336830: C2: assert(get_loop(lca)->_nest < n_loop->_nest || lca->in(0)->is_NeverBranch()) failed: must not be moved into inner loop Co-authored-by: Emanuel Peter Reviewed-by: thartmann, chagedorn, epeter ------------- PR: https://git.openjdk.org/jdk/pull/20334 From roland at openjdk.org Mon Aug 26 07:35:07 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 07:35:07 GMT Subject: RFR: 8338844: C2: remove useless code in PhaseIdealLoop::place_outside_loop() after 8335709 In-Reply-To: <1oaalHVBy5D8Rmphp7YCHYjPFjB9q_A3zVqg2pqTB9w=.a3bf7351-27a3-4f9d-a15a-d4c0c48fa3fc@github.com> References: <1oaalHVBy5D8Rmphp7YCHYjPFjB9q_A3zVqg2pqTB9w=.a3bf7351-27a3-4f9d-a15a-d4c0c48fa3fc@github.com> Message-ID: On Mon, 26 Aug 2024 05:19:36 GMT, Tobias Hartmann wrote: >> This removes code that shouldn't be necessary now that the >> NeverBranch/CProj nodes are assigned the correct loop. I proposed this >> initially as part of https://github.com/openjdk/jdk/pull/20334 where >> the recommendation was to take care of it separately to make >> backports easier. > > Looks good to me too. @TobiHartmann @chhagedorn thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20678#issuecomment-2309535403 From roland at openjdk.org Mon Aug 26 07:35:08 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 07:35:08 GMT Subject: Integrated: 8338844: C2: remove useless code in PhaseIdealLoop::place_outside_loop() after 8335709 In-Reply-To: References: Message-ID: On Thu, 22 Aug 2024 15:26:40 GMT, Roland Westrelin wrote: > This removes code that shouldn't be necessary now that the > NeverBranch/CProj nodes are assigned the correct loop. I proposed this > initially as part of https://github.com/openjdk/jdk/pull/20334 where > the recommendation was to take care of it separately to make > backports easier. This pull request has now been integrated. Changeset: ce83f6af Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/ce83f6af64efd673b83c945765f68e8a3bf89774 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod 8338844: C2: remove useless code in PhaseIdealLoop::place_outside_loop() after 8335709 Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/20678 From mdoerr at openjdk.org Mon Aug 26 07:46:06 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 26 Aug 2024 07:46:06 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> Message-ID: On Mon, 26 Aug 2024 07:23:40 GMT, Roberto Casta?eda Lozano wrote: >> I have an experimental implementation for PPC64. I have moved the oop decoding into `G1BarrierSetAssembler::g1_write_barrier_post_c2`: >> https://github.com/TheRealMDoerr/jdk/blob/0aedfb0aa1c545319257c0e613066b91404a07ca/src/hotspot/cpu/ppc/gc/g1/g1BarrierSetAssembler_ppc.cpp#L476 >> This has 2 advantages: >> - Reduce replicated code in the .ad file. >> - Make the discussed optimization easy. Please take a look. > > Great that you already have an experimental port! Thanks for the heads-up, I agree that the OOP decoding + null check fusion becomes less intrusive, but I still prefer the current decoupled implementation for x64 and aarch64 (even simpler, IMO). In the benchmarks I have run (admittedly, only on x64), I could not observe any positive effect, whereas I found a slight regression in one case using zero-based OOP compression. I have not investigated further, but I wonder if hoisting the null check above the region-crossing test could have a negative impact on branch predictability. It can be implemented like this: - If oop decoding requires a null check, redirect the branch to jump over the barrier code. - Else insert the null check after the region crossing check. This way, I don't see how it can have a negative effect. But I leave you free to decide about x86 and aarch64. Optimizations could be done later if needed as you already mentioned. Did you also see my comment https://github.com/openjdk/jdk/pull/19746#discussion_r1728987173 ? It's in the "resolved" discussion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730832653 From rcastanedalo at openjdk.org Mon Aug 26 08:32:06 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 26 Aug 2024 08:32:06 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <6DcMr9PUa8OZEhO861hkhqTMYlKDs96tgPs4Fu1u72I=.9dd2d0d4-9075-4c36-9ab6-ff886e257b4b@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <4aS0KysKaIPue1D60nootRPb8m7pdP_2hlPSwGUIk8w=.6b68b278-62ce-4df0-a6c6-c7fee40557aa@github.com> <6DcMr9PUa8OZEhO861hkhqTMYlKDs96tgPs4Fu1u72I=.9dd2d0d4-9075-4c36-9ab6-ff886e257b4b@github.com> Message-ID: On Fri, 23 Aug 2024 13:33:09 GMT, Martin Doerr wrote: >> Thanks for the reference, I would still prefer to keep this part as is for simplicity. We can always optimize the atomic barriers in follow-up work, if a need arises. > > After thinking more about this, I figured out that we can optimize more when moving the pre_barrier after the cmpxchg. We can skip all G1 barriers if the cmpxchg fails: > https://github.com/TheRealMDoerr/jdk/blob/0aedfb0aa1c545319257c0e613066b91404a07ca/src/hotspot/cpu/ppc/gc/g1/g1_ppc.ad#L171 > The cmpxchg jumps to no_update on failure. This may reduce load on GC queue handling and related work for GC threads. I'm testing this version and I actually like it more than the version I had before. Please take a look. > > (Note that my final version will need https://github.com/openjdk/jdk/pull/20689 to be integrated and merged into your PR.) Right, that makes sense since for PPC's cmpxchg implementation (unlike x64 or aarch64+LSE) you are already explicitly branching on failure anyway. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730891297 From rcastanedalo at openjdk.org Mon Aug 26 08:41:08 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 26 Aug 2024 08:41:08 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> Message-ID: <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> On Mon, 26 Aug 2024 07:43:39 GMT, Martin Doerr wrote: > This way, I don't see how it can have a negative effect. I agree, this is the implementation I tried out originally (https://github.com/openjdk/jdk/pull/19746#discussion_r1719811953). > Did you also see my comment https://github.com/openjdk/jdk/pull/19746#discussion_r1728987173 ? It's in the "resolved" discussion. Yes, thanks, I "unresolved" it now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730905873 From roland at openjdk.org Mon Aug 26 08:46:41 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 08:46:41 GMT Subject: RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v3] In-Reply-To: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> References: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> Message-ID: <_J7rxi43QXdtD_aJuhnbnoxDFUIi7HMFsHUutkpijw8=.324943d4-001a-4833-a106-ee46f7fbc6e6@github.com> > I propose removing `PhaseIdealLoop::cast_incr_before_loop()` and the > `CastII` nodes that it adds at counted loop heads. > > They were added to prevent nodes to float above the zero trip guard > when the backedge of a counted loop is removed. In particular, when a > range check is hoisted by predication, pre/main/post loops are created > and if one of the main or post loops lose its backedge, an array load > that's control dependent on a predicate above the pre loop could float > above the zero trip guard of the main or post loop. That can no longer > happen AFAICT with changes related to assert predicates. The array > load is now updated to have a control dependency that's below the zero > trip guard. > > The reason I'm revisiting this is that I noticed that > `PhaseIdealLoop::cast_incr_before_loop()` has a bug. When it adds the > `CastII`, it looks for the loop phi and picks input 1 of the phi it > finds as input to the `CastII`. To find the loop phi, it starts from > the loop incremement and loop for a use that's a phi and has the loop > head as control. It never checks that the phi it finds is the loop > phi. There can be more than one phi as uses of the increment at the > loop head and it can pick the wrong one. I tried to write a test case > where this would cause a bug but couldn't actually find any use for > the `CastII` anymore. > > In my testing, the only issue when the `CastII` are not added is that > some IR tests for vectorization fails: > > compiler/vectorization/TestPopulateIndex.java > compiler/vectorization/runner/ArrayShiftOpTest.java > compiler/vectorization/runner/LoopArrayIndexComputeTest.java > > because removing the `CastII` causes split if to occur with some nodes > that take the loop phi as input. That then causes pattern matching > during superword to break. I added logic to prevent split if for those > cases. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8334724 - review - fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19831/files - new: https://git.openjdk.org/jdk/pull/19831/files/51c093da..bc08cae3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19831&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19831&range=01-02 Stats: 86779 lines in 2390 files changed: 49952 ins; 24670 del; 12157 mod Patch: https://git.openjdk.org/jdk/pull/19831.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19831/head:pull/19831 PR: https://git.openjdk.org/jdk/pull/19831 From rcastanedalo at openjdk.org Mon Aug 26 08:49:06 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 26 Aug 2024 08:49:06 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> Message-ID: On Mon, 26 Aug 2024 08:38:39 GMT, Roberto Casta?eda Lozano wrote: >> It can be implemented like this: >> >> - If oop decoding requires a null check, redirect the branch to jump over the barrier code. >> - Else insert the null check after the region crossing check. >> >> This way, I don't see how it can have a negative effect. But I leave you free to decide about x86 and aarch64. Optimizations could be done later if needed as you already mentioned. >> >> Did you also see my comment https://github.com/openjdk/jdk/pull/19746#discussion_r1728987173 ? It's in the "resolved" discussion. > >> This way, I don't see how it can have a negative effect. > > I agree, this is the implementation I tried out originally (https://github.com/openjdk/jdk/pull/19746#discussion_r1719811953). > >> Did you also see my comment https://github.com/openjdk/jdk/pull/19746#discussion_r1728987173 ? It's in the "resolved" discussion. > > Yes, thanks, I "unresolved" it now. > I have an experimental implementation for PPC64. An unrelated comment about your PPC64 implementation: did you try running `test/hotspot/jtreg/compiler/gcbarriers/TestG1BarrierGeneration.java`? It expects the ADL instructions that implement `GetAndSetP` and `GetAndSetN` to be called `g1XChgP` and `g1XChgN`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730916202 From dfenacci at openjdk.org Mon Aug 26 08:50:10 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 26 Aug 2024 08:50:10 GMT Subject: RFR: 8335444: Generalize implementation of AndNode mul_ring [v3] In-Reply-To: <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> References: <7l_zt0Q884dkzJbP5kjhfvZPmFQ4CMRgakWoUCopfvw=.fc1307d0-6a2c-40cc-adc6-c1b21424d0dd@github.com> <6lNmXhkDKMNu7r8yYrBYs4JUJQ-drbj9Gj1_EaAqQj0=.c699170b-0551-4c04-8322-0182cf9b5866@github.com> Message-ID: On Wed, 7 Aug 2024 01:20:09 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for `java_base` and `java_base_2` and printed out the differences in this gist [here](https://gist.github.com/jaskarth/b45260d81ab621656f4a55cc51cf5292). While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :) >> >> This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Check IR before macro expansion Thanks @jaskarth! ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/20066#pullrequestreview-2260151664 From mdoerr at openjdk.org Mon Aug 26 09:45:06 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 26 Aug 2024 09:45:06 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> Message-ID: <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> On Mon, 26 Aug 2024 08:46:10 GMT, Roberto Casta?eda Lozano wrote: >>> This way, I don't see how it can have a negative effect. >> >> I agree, this is the implementation I tried out originally (https://github.com/openjdk/jdk/pull/19746#discussion_r1719811953). >> >>> Did you also see my comment https://github.com/openjdk/jdk/pull/19746#discussion_r1728987173 ? It's in the "resolved" discussion. >> >> Yes, thanks, I "unresolved" it now. > >> I have an experimental implementation for PPC64. > > An unrelated comment about your PPC64 implementation: did you try running `test/hotspot/jtreg/compiler/gcbarriers/TestG1BarrierGeneration.java`? It expects the ADL instructions that implement `GetAndSetP` and `GetAndSetN` to be called `g1XChgP` and `g1XChgN`. That one is among the failing tests. Can we agree on better names than `g1XChgP` and `g1XChgN`? They are not readable very well IMHO. All the other nodes have nice names. Regardless if you implement the compressed oops optimization or not, I'd reconsider moving the oop decoding into `G1BarrierSetAssembler::g1_write_barrier_post_c2` because it makes the .ad file shorter because you can get rid of the replicated `decode_heap_oop`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730991651 From duke at openjdk.org Mon Aug 26 12:32:10 2024 From: duke at openjdk.org (Shaojin Wen) Date: Mon, 26 Aug 2024 12:32:10 GMT Subject: [jdk23] RFR: 8335390: C2 MergeStores: wrong result with Unsafe In-Reply-To: References: Message-ID: On Tue, 2 Jul 2024 09:06:46 GMT, Emanuel Peter wrote: > Hi all, > > This pull request contains a backport of commit [9046d7ae](https://github.com/openjdk/jdk/commit/9046d7aee3082b6cbf79876efc1c508cb893caad) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Emanuel Peter on 2 Jul 2024 and was reviewed by Tobias Hartmann, Christian Hagedorn and Vladimir Kozlov. > > Thanks! It's been a month, any progress? ------------- PR Comment: https://git.openjdk.org/jdk/pull/19985#issuecomment-2310093529 From roland at openjdk.org Mon Aug 26 12:38:04 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 12:38:04 GMT Subject: RFR: 8320308: C2 compilation crashes in LibraryCallKit::inline_unsafe_access In-Reply-To: References: Message-ID: On Thu, 4 Jul 2024 12:17:51 GMT, Tobias Holenstein wrote: > We failed in `LibraryCallKit::inline_unsafe_access()` while trying to inline `Unsafe::getShortUnaligned`. > https://github.com/openjdk/jdk/blob/34c6e0deac567c0f4ed08aa2824671551d843e95/test/hotspot/jtreg/compiler/parsing/TestUnsafeArrayAccessWithNullBase.java#L86 > The reason is that base (the array) is `ConP #null` hidden behind two `CheckCastPP` with `speculative=byte[int:>=0]` > > We call `Node* adr = make_unsafe_address(base, offset, type, kind == Relaxed);` > https://github.com/openjdk/jdk/blob/34c6e0deac567c0f4ed08aa2824671551d843e95/src/hotspot/share/opto/library_call.cpp#L2361 > - with **base** = `147 CheckCastPP` > - `118 ConP === 0 [[[ 106 101 71 ] #null` > type > > Depending on the **offset** we go two different paths in `LibraryCallKit::make_unsafe_address` which both lead to the same error in the end. > 1. For `UNSAFE.getShortUnaligned(array, 1_049_000)` we get kind = `Type::AnyPtr` because `offset >= os::vm_page_size()`. Since we assume base can't be null we insert an assert: > https://github.com/openjdk/jdk/blob/34c6e0deac567c0f4ed08aa2824671551d843e95/src/hotspot/share/opto/library_call.cpp#L2111 > > 2. whereas for `UNSAFE.getShortUnaligned(array, 1)` we get kind = `Type:: OopPtr` > https://github.com/openjdk/jdk/blob/c17fa910cf3bad48547a3f0d68a30795ec3194e6/src/hotspot/share/opto/library_call.cpp#L2078 > and insert a null check > https://github.com/openjdk/jdk/blob/34c6e0deac567c0f4ed08aa2824671551d843e95/src/hotspot/share/opto/library_call.cpp#L2090 > In both cases we return call `basic_plus_adr(..)` on a base being `top()` which returns **adr** = `1 Con === 0 [[ ]] #top` > > https://github.com/openjdk/jdk/blob/3d5d51e228c19aa216451f647023101ae8bdbc79/src/hotspot/share/opto/library_call.cpp#L2386 => `_gvn.type(adr)` is _top_ > > https://github.com/openjdk/jdk/blob/3d5d51e228c19aa216451f647023101ae8bdbc79/src/hotspot/share/opto/library_call.cpp#L2394 => `adr_type` is _nullptr_ > > https://github.com/openjdk/jdk/blob/3d5d51e228c19aa216451f647023101ae8bdbc79/src/hotspot/share/opto/library_call.cpp#L2405-L2406 => `BasicType bt` is _T_ILLEGAL_ > > https://github.com/openjdk/jdk/blob/3d5d51e228c19aa216451f647023101ae8bdbc79/src/hotspot/share/opto/library_call.cpp#L2424 => we fail here with `SIGSEGV: null pointer dereference` because `alias_type->adr_type()` is _nullptr_ > > ### Fix > the fix is to bailed out in this case > https://github.com/openjdk/jdk/blob/3d5d51e228c19a... Thanks for the extra details. Is igvn run between incremental inlining and the crash? Or is that all part of a single incremental inlining sequence? In `LibraryCallKit::make_unsafe_address`, `base` is the `CheckCastPP`. What I don't quite understand is how we can get `top` out of `basic_plus_adr` if the `base` input is a `CheckCastPP`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20033#issuecomment-2310106130 From chagedorn at openjdk.org Mon Aug 26 12:50:09 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 26 Aug 2024 12:50:09 GMT Subject: [jdk23] RFR: 8335390: C2 MergeStores: wrong result with Unsafe In-Reply-To: References: Message-ID: On Mon, 26 Aug 2024 12:29:30 GMT, Shaojin Wen wrote: >> Hi all, >> >> This pull request contains a backport of commit [9046d7ae](https://github.com/openjdk/jdk/commit/9046d7aee3082b6cbf79876efc1c508cb893caad) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. >> >> The commit being backported was authored by Emanuel Peter on 2 Jul 2024 and was reviewed by Tobias Hartmann, Christian Hagedorn and Vladimir Kozlov. >> >> Thanks! > > It's been a month, any progress? @wenshao just FYI, he is currently out of the office and returns in two weeks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19985#issuecomment-2310131739 From roland at openjdk.org Mon Aug 26 13:11:50 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 13:11:50 GMT Subject: RFR: 8333258: C2: high memory usage in PhaseCFG::insert_anti_dependences() [v4] In-Reply-To: References: Message-ID: <6HZFG2hoQCIkKaeps2lZF4Xim5J9OR0C_PaxMdfkoY0=.0aa9a862-21d7-41a0-9f7e-6dd1781a1e2b@github.com> > In a debug build, `PhaseCFG::insert_anti_dependences()` is called > twice for a single node: once for actual processing, once for > verification. > > In TestAntiDependenciesHighMemUsage, the test has a `Region` that > merges 337 incoming path. It also has one `Phi` per memory slice that > are stored to: 1000 `Phi` nodes. Each `Phi` node has 337 inputs that > are identical except for one. The common input is the memory state on > method entry. The test has 60 `Load` that needs to be processed for > anti dependences. All `Load` share the same memory input: the memory > state on method entry. For each `Load`, all `Phi` nodes are pushed 336 > times on the work lists for anti dependence processing because all of > them appear multiple times as uses of each `Load`s memory state: `Phi`s > are pushed 336 000 on 2 work lists. Memory is not reclaimed on exit > from `PhaseCFG::insert_anti_dependences()` so memory usage grows as > `Load` nodes are processed: > > 336000 * 2 work lists * 60 loads * 8 bytes pointer = 322 MB. > > The fix I propose for this is to not push `Phi` nodes more than once > when they have the same inputs multiple times. > > In TestAntiDependenciesHighMemUsage2, the test has 4000 loads. For > each of them, when processed for anti dependences, all 4000 loads are > pushed on the work lists because they share the same memory > input. Then when they are popped from the work list, they are > discarded because only stores are of interest: > > 4000 loads processed * 4000 loads pushed * 2 work lists * 8 bytes pointer = 256 MB. > > The fix I propose for this is to test before pushing on the work list > whether a node is a store or not. > > Finally, I propose adding a `ResourceMark` so memory doesn't > accumulate over calls to `PhaseCFG::insert_anti_dependences()`. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - more review - Merge branch 'master' into JDK-8333258 - review - Merge branch 'master' into JDK-8333258 - refactoring - Merge branch 'master' into JDK-8333258 - review - Merge branch 'master' into JDK-8333258 - whitespaces - tests & fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19791/files - new: https://git.openjdk.org/jdk/pull/19791/files/c44775bc..157f8381 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19791&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19791&range=02-03 Stats: 64190 lines in 1861 files changed: 37616 ins; 17558 del; 9016 mod Patch: https://git.openjdk.org/jdk/pull/19791.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19791/head:pull/19791 PR: https://git.openjdk.org/jdk/pull/19791 From roland at openjdk.org Mon Aug 26 13:14:11 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 13:14:11 GMT Subject: RFR: 8333258: C2: high memory usage in PhaseCFG::insert_anti_dependences() In-Reply-To: References: <1923E_Gv9WiGUXISiuvMvAS3czkTGShKu2k0k7b042Y=.50d46e9b-4301-4a4d-a0e9-c9bf5b88e4e4@github.com> Message-ID: On Tue, 23 Jul 2024 16:18:26 GMT, Emanuel Peter wrote: > Hmm. Personally I'd rather have something more vague than misleading. Maybe it could be `store -> use_mem_state` and `mem -> def_mem_state`? I guess that is then a bit wordy. Up to you. I went with that. Thanks for the suggestion. > > The assert is on loop entry. There's a if between the new assert and the condition that was removed but the if block ends with a continue. So the assert is guaranteed to be executed every time the removed was executed. > > I understand, and agree on a technical level. Someone in the future may break things, and that is why I would prefer the assert to be there. But up to you. The assert on loop entry is executed more often (and on every element of the queue) than if it was where you recommend it to be which is why I'd like to leave it there. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19791#issuecomment-2310183296 From rcastanedalo at openjdk.org Mon Aug 26 13:26:07 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 26 Aug 2024 13:26:07 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> Message-ID: On Mon, 26 Aug 2024 09:42:29 GMT, Martin Doerr wrote: > That one is among the failing tests. Can we agree on better names than g1XChgP and g1XChgN? They are not readable very well IMHO. Sure, I agree that `g1GetAndSetP` and `g1GetAndSetN` are more consistent with the corresponding Ideal operation names, will rename the instructions in x64 and aarch64 and update the test expectations. > Regardless if you implement the compressed oops optimization or not, I'd reconsider moving the oop decoding into G1BarrierSetAssembler::g1_write_barrier_post_c2 because it makes the .ad file shorter because you can get rid of the replicated decode_heap_oop. Thanks, will try it out. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1731240303 From roland at openjdk.org Mon Aug 26 13:34:41 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 13:34:41 GMT Subject: RFR: 8333258: C2: high memory usage in PhaseCFG::insert_anti_dependences() [v5] In-Reply-To: References: Message-ID: <4SiD1hBSzfbqKvKT8t6D8WDxQqH4XmAWAMllX929M6E=.8ee795df-2d1e-4b1c-84be-d396cd3274c1@github.com> > In a debug build, `PhaseCFG::insert_anti_dependences()` is called > twice for a single node: once for actual processing, once for > verification. > > In TestAntiDependenciesHighMemUsage, the test has a `Region` that > merges 337 incoming path. It also has one `Phi` per memory slice that > are stored to: 1000 `Phi` nodes. Each `Phi` node has 337 inputs that > are identical except for one. The common input is the memory state on > method entry. The test has 60 `Load` that needs to be processed for > anti dependences. All `Load` share the same memory input: the memory > state on method entry. For each `Load`, all `Phi` nodes are pushed 336 > times on the work lists for anti dependence processing because all of > them appear multiple times as uses of each `Load`s memory state: `Phi`s > are pushed 336 000 on 2 work lists. Memory is not reclaimed on exit > from `PhaseCFG::insert_anti_dependences()` so memory usage grows as > `Load` nodes are processed: > > 336000 * 2 work lists * 60 loads * 8 bytes pointer = 322 MB. > > The fix I propose for this is to not push `Phi` nodes more than once > when they have the same inputs multiple times. > > In TestAntiDependenciesHighMemUsage2, the test has 4000 loads. For > each of them, when processed for anti dependences, all 4000 loads are > pushed on the work lists because they share the same memory > input. Then when they are popped from the work list, they are > discarded because only stores are of interest: > > 4000 loads processed * 4000 loads pushed * 2 work lists * 8 bytes pointer = 256 MB. > > The fix I propose for this is to test before pushing on the work list > whether a node is a store or not. > > Finally, I propose adding a `ResourceMark` so memory doesn't > accumulate over calls to `PhaseCFG::insert_anti_dependences()`. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: more review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19791/files - new: https://git.openjdk.org/jdk/pull/19791/files/157f8381..15a33090 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19791&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19791&range=03-04 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/19791.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19791/head:pull/19791 PR: https://git.openjdk.org/jdk/pull/19791 From roland at openjdk.org Mon Aug 26 13:34:42 2024 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 26 Aug 2024 13:34:42 GMT Subject: RFR: 8333258: C2: high memory usage in PhaseCFG::insert_anti_dependences() [v3] In-Reply-To: References: Message-ID: On Tue, 23 Jul 2024 16:20:23 GMT, Emanuel Peter wrote: > But I would like you to fix the comments here: > > ``` > // The relevant stores "nearby" the load consist of a tree rooted > // at initial_mem, with internal nodes of type MergeMem. > // Therefore, the branches visited by the worklist are of this form: > // initial_mem -> (MergeMem ->)* store > // The anti-dependence constraints apply only to the fringe of this tree. > ``` > > There are not just `MergeMem` but also `Phi` nodes. I updated the comment. Does that look better to you? @eme64 can you have a look at the updated change? ------------- PR Comment: https://git.openjdk.org/jdk/pull/19791#issuecomment-2310224862 PR Comment: https://git.openjdk.org/jdk/pull/19791#issuecomment-2310225420 From dnsimon at openjdk.org Mon Aug 26 15:54:03 2024 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 26 Aug 2024 15:54:03 GMT Subject: RFR: 8338538: [JVMCI] Allow HotSpotJVMCIRuntime#getJObjectValue to be called by a HotSpot CompileBroker compiler thread [v2] In-Reply-To: References: Message-ID: <9knDvTaTOK5E2imn4UPh8Gwp5iWykTXmB9CDXreKHa4=.830431c5-2cc4-48b4-baf5-bae53c50a9aa@github.com> On Sun, 18 Aug 2024 17:20:26 GMT, Tom?? Zezula wrote: >> The `HotSpotJVMCIRuntime#getJObjectValue` method is currently invoked in two distinct scenarios: >> >> Truffle Compiler: In this scenario, the method is called by a Truffle compiler thread. This thread is an ordinary Java thread that enters the shared library compiler (libgraal) via a Java native method call. Consequently, it always has a valid `JavaFrameAnchor` when invoking `HotSpotJVMCIRuntime#getJObjectValue` within the shared library compiler. >> >> Host Compiler: In the second scenario, the method is called by a HotSpot CompileBroker compiler thread while inlining a Truffle call target into a host method. Here, the compiler thread is a JavaThread in the `_thread_in_vm` state before entering the shared library compiler (libgraal) and does not have a `JavaFrameAnchor`. >> >> The `HotSpotJVMCIRuntime#getJObjectValue` method currently supports only the first scenario by asserting that the caller has a `JavaFrameAnchor`. However, this method should be adapted to also support the second scenario, where the caller thread lacks a `JavaFrameAnchor` but has an explicitly pushed JNI handle block. It is crucial that the `HotSpotJVMCIRuntime#getJObjectValue` method ensures it does not use the top-most `JNIHandleBlock`, which is never released. Utilizing this block for Java constants could potentially lead to memory leaks in the Java heap. To accommodate both scenarios, the method should be modified to allow execution also by threads without a `JavaFrameAnchor` provided they have an explicitly pushed JNI handle block. >> >> Implementation Details: The method determines whether the caller thread has pushed a JNI handle block by using `THREAD->active_handles()->pop_frame_link()`. The `pop_frame_link` is set when [JavaThread::push_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1360) is called and is reset in [JavaThread::pop_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1371). Each active JavaThread has a non-null `_active_handles` pointer, which is initialized in [JavaThread::run](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L730). > > Tom?? Zezula has updated the pull request incrementally with one additional commit since the last revision: > > Updated comment in getObjectValue. > > Co-authored-by: Douglas Simon Marked as reviewed by dnsimon (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20620#pullrequestreview-2261027154 From duke at openjdk.org Mon Aug 26 16:53:07 2024 From: duke at openjdk.org (=?UTF-8?B?VG9tw6HFoQ==?= Zezula) Date: Mon, 26 Aug 2024 16:53:07 GMT Subject: Integrated: 8338538: [JVMCI] Allow HotSpotJVMCIRuntime#getJObjectValue to be called by a HotSpot CompileBroker compiler thread In-Reply-To: References: Message-ID: On Sun, 18 Aug 2024 13:11:24 GMT, Tom?? Zezula wrote: > The `HotSpotJVMCIRuntime#getJObjectValue` method is currently invoked in two distinct scenarios: > > Truffle Compiler: In this scenario, the method is called by a Truffle compiler thread. This thread is an ordinary Java thread that enters the shared library compiler (libgraal) via a Java native method call. Consequently, it always has a valid `JavaFrameAnchor` when invoking `HotSpotJVMCIRuntime#getJObjectValue` within the shared library compiler. > > Host Compiler: In the second scenario, the method is called by a HotSpot CompileBroker compiler thread while inlining a Truffle call target into a host method. Here, the compiler thread is a JavaThread in the `_thread_in_vm` state before entering the shared library compiler (libgraal) and does not have a `JavaFrameAnchor`. > > The `HotSpotJVMCIRuntime#getJObjectValue` method currently supports only the first scenario by asserting that the caller has a `JavaFrameAnchor`. However, this method should be adapted to also support the second scenario, where the caller thread lacks a `JavaFrameAnchor` but has an explicitly pushed JNI handle block. It is crucial that the `HotSpotJVMCIRuntime#getJObjectValue` method ensures it does not use the top-most `JNIHandleBlock`, which is never released. Utilizing this block for Java constants could potentially lead to memory leaks in the Java heap. To accommodate both scenarios, the method should be modified to allow execution also by threads without a `JavaFrameAnchor` provided they have an explicitly pushed JNI handle block. > > Implementation Details: The method determines whether the caller thread has pushed a JNI handle block by using `THREAD->active_handles()->pop_frame_link()`. The `pop_frame_link` is set when [JavaThread::push_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1360) is called and is reset in [JavaThread::pop_jni_handle_block](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L1371). Each active JavaThread has a non-null `_active_handles` pointer, which is initialized in [JavaThread::run](https://github.com/openjdk/jdk/blob/bd4160cea8b6b0fcf0507199ed76a12f5d0aaba9/src/hotspot/share/runtime/javaThread.cpp#L730). This pull request has now been integrated. Changeset: a15af699 Author: Tom?? Zezula Committer: Doug Simon URL: https://git.openjdk.org/jdk/commit/a15af6998e8f7adac2ded94ef5a47e22ddb53452 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod 8338538: [JVMCI] Allow HotSpotJVMCIRuntime#getJObjectValue to be called by a HotSpot CompileBroker compiler thread Reviewed-by: dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/20620 From kxu at openjdk.org Mon Aug 26 17:33:16 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 26 Aug 2024 17:33:16 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v13] In-Reply-To: References: Message-ID: On Mon, 8 Jul 2024 17:24:59 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 23 additional commits since the last revision: > > - Merge branch 'master' into boolnode-refactor > - spread boolean AND and OR into subcases, update number of expected CMP_U nodes > - Merge branch 'master' into boolnode-refactor > - Merge branch 'master' into boolnode-refactor > - update test values, @run directive, and remove an empty line > - Merge branch 'master' into boolnode-refactor > - move test location, add negative test case, simplify imports > - Merge branch 'master' into boolnode-refactor > - refactor BoolNode::Value() and extract code to ::Value_cmpu_and_mask > - update comments > - ... and 13 more: https://git.openjdk.org/jdk/compare/460de6e1...29655d35 I sincerely apologize for not following up on this timely. This won't happen again. > [...] change the IR rule to something like > `@IR(counts = {IRNode.CMP_U + "\b", "1"}` First, I assume you mean `... + "\\b"` (double escaped for regex). Unfortunately this does not work. `IRNode.CMP_U` has a postfix `#_`, making the expression `_#CMP_U#_\\b`. https://github.com/openjdk/jdk/blob/a15af6998e8f7adac2ded94ef5a47e22ddb53452/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java#L440-L443 Even if we explicitly make it `@IR(counts = {"_#CMP_U\\b#_", "1"}`, it is still the one without the `\b` registered by `beforeMatchingNameRegex(irNodePlaceholder, irNodeRegex)`, leading to unexpected node type during assertion: Violations (4) -------------- - IR Node "_#CMP_U\b#_" defined in class IRNode has no regex/compiler phase mapping (i.e. no static initializer block that adds a mapping entry to IRNode.IR_NODE_MAPPINGS). Have you just created the entry "_#CMP_U\b#_" in class IRNode and forgot to add a mapping? Violation for IR rule 1 at public static boolean compiler.c2.gvn.TestBoolNodeGVN.testShouldHaveCpmUCase1(int,int). - [repeated violations omitted] I think it's the second argument to `beforeMatchingNameRegex(irNodePlaceholder, irNodeRegex)` you want to add the word break to. Not the placeholder. This can only be done in `IRNode.java`. I propose the following change: public static final String CMP_U = PREFIX + "CMP_U" + POSTFIX; static { - beforeMatchingNameRegex(CMP_U, "CmpU"); + beforeMatchingNameRegex(CMP_U, "CmpU\\b"); } The three existing tests currently referencing `IRNode.CMP_U`: `compiler.c2.irTests.CmpUWithZero`, `compiler.intrinsics.TestCompareUnsigned`, `compiler.c2.irTests.TestUnsignedComparison`, are all passing w/o this change. It does not break existing tests. I'm going to push a commit to do so. If you think it's not appropriate to change `IRNode.java` with the scope of this issue, I can revert it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2310711029 From kxu at openjdk.org Mon Aug 26 17:46:51 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 26 Aug 2024 17:46:51 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v14] In-Reply-To: References: Message-ID: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: - Merge branch 'master' into boolnode-refactor - add a word break to IRNode.CMP_U - Merge branch 'master' into boolnode-refactor - spread boolean AND and OR into subcases, update number of expected CMP_U nodes - Merge branch 'master' into boolnode-refactor - Merge branch 'master' into boolnode-refactor - update test values, @run directive, and remove an empty line - Merge branch 'master' into boolnode-refactor - move test location, add negative test case, simplify imports - Merge branch 'master' into boolnode-refactor - ... and 15 more: https://git.openjdk.org/jdk/compare/898a7c02...719199c2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18198/files - new: https://git.openjdk.org/jdk/pull/18198/files/29655d35..719199c2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18198&range=12-13 Stats: 60241 lines in 1888 files changed: 34337 ins; 16939 del; 8965 mod Patch: https://git.openjdk.org/jdk/pull/18198.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18198/head:pull/18198 PR: https://git.openjdk.org/jdk/pull/18198 From kxu at openjdk.org Mon Aug 26 18:09:09 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 26 Aug 2024 18:09:09 GMT Subject: RFR: 8328528: C2 should optimize long-typed parallel iv in an int counted loop [v8] In-Reply-To: References: <4kAmwISMCRKsdUxHZsI9SdO-8rLy7a3GbFXd2ERlpi0=.8f87e9e4-cb37-43b6-b17e-a6b424e60c83@github.com> Message-ID: On Mon, 22 Jul 2024 17:40:48 GMT, Kangcheng Xu wrote: >> Currently, parallel iv optimization only happens in an int counted loop with int-typed parallel iv's. This PR adds support for long-typed iv to be optimized. >> >> Additionally, this ticket contributes to the resolution of [JDK-8275913](https://bugs.openjdk.org/browse/JDK-8275913). Meanwhile, I'm working on adding support for parallel IV replacement for long counted loops which will depend on this PR. > > Kangcheng Xu has updated the pull request incrementally with one additional commit since the last revision: > > use @run driver and Argument.RANDOM_ONCE Currently blocked by the same problem as #18198. Solution proposed over there. Will update once the approach is approved. See https://github.com/openjdk/jdk/pull/18198#issuecomment-2310711029 ------------- PR Comment: https://git.openjdk.org/jdk/pull/18489#issuecomment-2310774271 From darcy at openjdk.org Mon Aug 26 22:16:04 2024 From: darcy at openjdk.org (Joe Darcy) Date: Mon, 26 Aug 2024 22:16:04 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: <3hk6EiDY3Qxq_sjSBoL7SBsk_5_FsuRa7iZ0caxSs8s=.6db958ed-8f74-49d1-b949-a7da94357592@github.com> On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. Some general impressions of the API change in the `java.lang` classes. I don't think the change as-is, especially the new constant fields, are a great fit for the current API and I think those constant would look worse in a future where there was an "UnsignedInt" value class, so similar fuller platform support. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20507#issuecomment-2311195882 From duke at openjdk.org Mon Aug 26 22:48:10 2024 From: duke at openjdk.org (duke) Date: Mon, 26 Aug 2024 22:48:10 GMT Subject: Withdrawn: 8333891: Method excluded with directive is not compiled after removal of directive In-Reply-To: <2xstE3V0PD8FGcijx_THSX1YgIJ7fZLponoL7b96TiY=.04ecae5f-9e3a-4c26-9893-72822f31c753@github.com> References: <2xstE3V0PD8FGcijx_THSX1YgIJ7fZLponoL7b96TiY=.04ecae5f-9e3a-4c26-9893-72822f31c753@github.com> Message-ID: On Mon, 10 Jun 2024 20:05:03 GMT, Evgeny Astigeevich wrote: > A Java method can become non-compilable if there are issues with its compilation or if its compiled version causes problems. Additionally, a method can be marked as non-compilable using a compile command or a compiler directive. Since compiler directives can be updated, a directive that disables a method's compilation can be changed or removed. > > Currently, when a Java method is marked as non-compilable, the reason for this status is unknown. If a change in a compiler directive makes the method compilable again, we cannot clear the non-compilable status because we don't know if the directive initially caused the method to become non-compilable. > > To resolve the issue two method flags are introduced: `is_c1_exclude` and `is_c2_excluded`. They mean a Java method is excluded from compilation by a directive. With these flags we can find out a Java method has been excluded with a directive. If the directive changes and allows compilation of the method we can detect this and clear the non-compilable status. > > As accesses to flags must be race free we have to remove getting a directive from `CompileBroker::compile_method`. We combine two `CompileBroker::compile_method` into one. We also move calculation whether compilation is blocking into `CompileBroker::compile_method_base`. The directive used for that calculation is passed to a compile task, so a compile task does not need to get it again. > > A regression test is added. > > Tested fastdebug build on Linux AArch64, Linux x86_64 and Windows Server 2019 x86_64: > - Tier1/2/3 passed. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/19637 From sviswanathan at openjdk.org Mon Aug 26 23:15:11 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 26 Aug 2024 23:15:11 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. src/hotspot/cpu/x86/assembler_x86.cpp line 3454: > 3452: > 3453: void Assembler::evmovdquw(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) { > 3454: assert(VM_Version::supports_avx512vlbw(), ""); vl not needed for 512 bit. src/hotspot/cpu/x86/assembler_x86.cpp line 4583: > 4581: void Assembler::evpcmpgtb(KRegister kdst, XMMRegister nds, Address src, int vector_len) { > 4582: assert(VM_Version::supports_avx512vlbw(), ""); > 4583: assert(vector_len == Assembler::AVX_512bit || VM_Version::supports_avx512vl(), ""); The check for supports_avx512vlbw() at previous line in this function need to be changed to supports_avx512bw(). If it helps, the vl check is already happening in vex_prefix() if we use the higher bank registers for length < 512 bit. src/hotspot/cpu/x86/assembler_x86.cpp line 4596: > 4594: void Assembler::evpcmpgtb(KRegister kdst, KRegister mask, XMMRegister nds, Address src, int vector_len) { > 4595: assert(VM_Version::supports_avx512vlbw(), ""); > 4596: assert(vector_len == Assembler::AVX_512bit || VM_Version::supports_avx512vl(), ""); The check for supports_avx512vlbw() at previous line in this function need to be changed to supports_avx512bw(). src/hotspot/cpu/x86/assembler_x86.cpp line 4611: > 4609: void Assembler::evpcmpub(KRegister kdst, XMMRegister nds, XMMRegister src, ComparisonPredicate vcc, int vector_len) { > 4610: assert(vector_len == Assembler::AVX_512bit || VM_Version::supports_avx512vl(), ""); > 4611: assert(VM_Version::supports_avx512vlbw(), ""); I think you meant this to be supports_avx512bw(). src/hotspot/cpu/x86/assembler_x86.cpp line 4620: > 4618: void Assembler::evpcmpuw(KRegister kdst, XMMRegister nds, XMMRegister src, ComparisonPredicate vcc, int vector_len) { > 4619: assert(vector_len == Assembler::AVX_512bit || VM_Version::supports_avx512vl(), ""); > 4620: assert(VM_Version::supports_avx512vlbw(), ""); The check for supports_avx512vlbw() in this function need to be changed to supports_avx512bw(). src/hotspot/cpu/x86/assembler_x86.cpp line 4645: > 4643: void Assembler::evpcmpuw(KRegister kdst, XMMRegister nds, Address src, ComparisonPredicate vcc, int vector_len) { > 4644: assert(VM_Version::supports_avx512vlbw(), ""); > 4645: assert(vector_len == Assembler::AVX_512bit || VM_Version::supports_avx512vl(), ""); The check for supports_avx512vlbw() at previous line in this function need to be changed to supports_avx512bw(). src/hotspot/cpu/x86/assembler_x86.cpp line 4672: > 4670: void Assembler::evpcmpeqb(KRegister kdst, KRegister mask, XMMRegister nds, Address src, int vector_len) { > 4671: assert(VM_Version::supports_avx512vlbw(), ""); > 4672: assert(vector_len == Assembler::AVX_512bit || VM_Version::supports_avx512vl(), ""); The check for supports_avx512vlbw() at previous line in this function need to be changed to supports_avx512bw(). src/hotspot/cpu/x86/assembler_x86.cpp line 8191: > 8189: void Assembler::vpminub(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 8190: assert(UseAVX > 0 && (vector_len == Assembler::AVX_512bit || (!needs_evex(dst, nds, src) || VM_Version::supports_avx512vl())), ""); > 8191: assert(!needs_evex(dst, nds, src) || VM_Version::supports_avx512bw(), ""); It will be good to keep the assert similar to vpaddsb for new vmin/vmax instructions. src/hotspot/cpu/x86/assembler_x86.cpp line 8311: > 8309: } > 8310: > 8311: void Assembler::evpminud(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) { assert(VM_Version::supports_evex(), "") check missing. src/hotspot/cpu/x86/assembler_x86.cpp line 8340: > 8338: > 8339: void Assembler::evpminuq(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) { > 8340: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), ""); assert(VM_Version::supports_evex(), "") check missing. src/hotspot/cpu/x86/assembler_x86.cpp line 8402: > 8400: void Assembler::vpmaxuw(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) { > 8401: assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : > 8402: (vector_len == AVX_256bit ? VM_Version::supports_avx2() : VM_Version::supports_avx512bw()), ""); Why support_avx() check here only and not in other newly added v* integral instructions? For avx1 platforms, integral vector width supported is only 128bit. src/hotspot/cpu/x86/assembler_x86.cpp line 8478: > 8476: > 8477: void Assembler::evpmaxud(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) { > 8478: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), ""); assert(VM_Version::supports_evex(), "") is missing. src/hotspot/cpu/x86/assembler_x86.cpp line 8506: > 8504: > 8505: void Assembler::evpmaxuq(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) { > 8506: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), ""); assert(VM_Version::supports_evex(), "") is missing. src/hotspot/cpu/x86/assembler_x86.cpp line 10229: > 10227: InstructionMark im(this); > 10228: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10229: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 10256: > 10254: InstructionMark im(this); > 10255: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10256: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 10283: > 10281: InstructionMark im(this); > 10282: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10283: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 10310: > 10308: InstructionMark im(this); > 10309: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10310: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 10337: > 10335: InstructionMark im(this); > 10336: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10337: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 10364: > 10362: InstructionMark im(this); > 10363: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10364: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 10391: > 10389: InstructionMark im(this); > 10390: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10391: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. src/hotspot/cpu/x86/assembler_x86.cpp line 10419: > 10417: InstructionMark im(this); > 10418: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); > 10419: InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true); vex_w could be false here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731912227 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731608860 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731609177 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731917735 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731612730 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731726012 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731726337 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731748671 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731769490 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731771330 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731823750 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731870793 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731870288 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731888852 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731889468 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731890265 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731909994 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731910246 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731910516 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731910755 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1731911129 From kxu at openjdk.org Mon Aug 26 23:25:47 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 26 Aug 2024 23:25:47 GMT Subject: RFR: 8328528: C2 should optimize long-typed parallel iv in an int counted loop [v9] In-Reply-To: <4kAmwISMCRKsdUxHZsI9SdO-8rLy7a3GbFXd2ERlpi0=.8f87e9e4-cb37-43b6-b17e-a6b424e60c83@github.com> References: <4kAmwISMCRKsdUxHZsI9SdO-8rLy7a3GbFXd2ERlpi0=.8f87e9e4-cb37-43b6-b17e-a6b424e60c83@github.com> Message-ID: > Currently, parallel iv optimization only happens in an int counted loop with int-typed parallel iv's. This PR adds support for long-typed iv to be optimized. > > Additionally, this ticket contributes to the resolution of [JDK-8275913](https://bugs.openjdk.org/browse/JDK-8275913). Meanwhile, I'm working on adding support for parallel IV replacement for long counted loops which will depend on this PR. Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 27 additional commits since the last revision: - Merge branch 'master' into long-typed-parallel-iv - use @run driver and Argument.RANDOM_ONCE - Merge branch 'master' into long-typed-parallel-iv - add random strides to tests - fix tests on larger strides - add more expressive comments and test cases - Merge branch 'master' into long-typed-parallel-iv - update comments to clarify on type casting - add pseudocode for subgraphs before/after the transformation - remove WIP support for long counted loops - ... and 17 more: https://git.openjdk.org/jdk/compare/93db32b9...20bdc791 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/18489/files - new: https://git.openjdk.org/jdk/pull/18489/files/596dbf9a..20bdc791 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=18489&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18489&range=07-08 Stats: 60236 lines in 1886 files changed: 34337 ins; 16939 del; 8960 mod Patch: https://git.openjdk.org/jdk/pull/18489.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/18489/head:pull/18489 PR: https://git.openjdk.org/jdk/pull/18489 From duke at openjdk.org Mon Aug 26 23:29:13 2024 From: duke at openjdk.org (Yagmur Eren) Date: Mon, 26 Aug 2024 23:29:13 GMT Subject: RFR: 8330159: [C2] Remove or clarify Compile::init_start Message-ID: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> Compile::init_start method contained only an assertion. To cleanup, this method is removed and the locations where this method is called are replaced with the corresponding assertion. See issue: [JDK-8330159](https://bugs.openjdk.org/browse/JDK-8330159) ------------- Commit messages: - compile::init_start replaced with assert Changes: https://git.openjdk.org/jdk/pull/20715/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20715&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8330159 Stats: 12 lines in 3 files changed: 0 ins; 9 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/20715.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20715/head:pull/20715 PR: https://git.openjdk.org/jdk/pull/20715 From chagedorn at openjdk.org Tue Aug 27 07:11:10 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 27 Aug 2024 07:11:10 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v13] In-Reply-To: References: Message-ID: On Mon, 26 Aug 2024 17:30:19 GMT, Kangcheng Xu wrote: >> Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 23 additional commits since the last revision: >> >> - Merge branch 'master' into boolnode-refactor >> - spread boolean AND and OR into subcases, update number of expected CMP_U nodes >> - Merge branch 'master' into boolnode-refactor >> - Merge branch 'master' into boolnode-refactor >> - update test values, @run directive, and remove an empty line >> - Merge branch 'master' into boolnode-refactor >> - move test location, add negative test case, simplify imports >> - Merge branch 'master' into boolnode-refactor >> - refactor BoolNode::Value() and extract code to ::Value_cmpu_and_mask >> - update comments >> - ... and 13 more: https://git.openjdk.org/jdk/compare/383e93a8...29655d35 > > I sincerely apologize for not following up on this timely. This won't happen again. > >> [...] change the IR rule to something like >> `@IR(counts = {IRNode.CMP_U + "\b", "1"}` > > First, I assume you mean `... + "\\b"` (double escaped for regex). > > Unfortunately this does not work. `IRNode.CMP_U` has a postfix `#_`, making the expression `_#CMP_U#_\\b`. > > https://github.com/openjdk/jdk/blob/a15af6998e8f7adac2ded94ef5a47e22ddb53452/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java#L440-L443 > > Even if we explicitly make it `@IR(counts = {"_#CMP_U\\b#_", "1"}`, it is still the one without the `\b` registered by `beforeMatchingNameRegex(irNodePlaceholder, irNodeRegex)`, leading to unexpected node type during assertion: > > > Violations (4) > -------------- > - IR Node "_#CMP_U\b#_" defined in class IRNode has no regex/compiler phase mapping (i.e. no static initializer block that adds a mapping entry to IRNode.IR_NODE_MAPPINGS). > Have you just created the entry "_#CMP_U\b#_" in class IRNode and forgot to add a mapping? > Violation for IR rule 1 at public static boolean compiler.c2.gvn.TestBoolNodeGVN.testShouldHaveCpmUCase1(int,int). > - [repeated violations omitted] > > > I think it's the second argument to `beforeMatchingNameRegex(irNodePlaceholder, irNodeRegex)` you want to add the word break to. Not the placeholder. This can only be done in `IRNode.java`. I propose the following change: > > > public static final String CMP_U = PREFIX + "CMP_U" + POSTFIX; > static { > - beforeMatchingNameRegex(CMP_U, "CmpU"); > + beforeMatchingNameRegex(CMP_U, "CmpU\\b"); > } > > > The three existing tests currently referencing `IRNode.CMP_U`: `compiler.c2.irTests.CmpUWithZero`, `compiler.intrinsics.TestCompareUnsigned`, `compiler.c2.irTests.TestUnsignedComparison`, are all passing w/o this change. It does not break existing tests. > > I'm going to push a commit to do so. If you think it's not appropriate to change `IRNode.java` with the scope of this issue, I can revert it. Hi @tabjy, no worries! You're right, you cannot just append the `\\b` as suggested above - this was only possible in an older version of the IR framework. I think going with what you suggested with `beforeMatchingNameRegex(CMP_U, "CmpU\\b")` should do the trick. You can go ahead and push that. Then I can run some internal testing again, just to be sure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2311734977 From kxu at openjdk.org Tue Aug 27 07:15:05 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 27 Aug 2024 07:15:05 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v13] In-Reply-To: References: Message-ID: <_Hapg-YC_6t1VjwYF45SF7WFznbWB2ghLR4sBVjbeoM=.4868a9c3-c8aa-435d-886a-4d43aff717ee@github.com> On Tue, 27 Aug 2024 07:08:25 GMT, Christian Hagedorn wrote: >> I sincerely apologize for not following up on this timely. This won't happen again. >> >>> [...] change the IR rule to something like >>> `@IR(counts = {IRNode.CMP_U + "\b", "1"}` >> >> First, I assume you mean `... + "\\b"` (double escaped for regex). >> >> Unfortunately this does not work. `IRNode.CMP_U` has a postfix `#_`, making the expression `_#CMP_U#_\\b`. >> >> https://github.com/openjdk/jdk/blob/a15af6998e8f7adac2ded94ef5a47e22ddb53452/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java#L440-L443 >> >> Even if we explicitly make it `@IR(counts = {"_#CMP_U\\b#_", "1"}`, it is still the one without the `\b` registered by `beforeMatchingNameRegex(irNodePlaceholder, irNodeRegex)`, leading to unexpected node type during assertion: >> >> >> Violations (4) >> -------------- >> - IR Node "_#CMP_U\b#_" defined in class IRNode has no regex/compiler phase mapping (i.e. no static initializer block that adds a mapping entry to IRNode.IR_NODE_MAPPINGS). >> Have you just created the entry "_#CMP_U\b#_" in class IRNode and forgot to add a mapping? >> Violation for IR rule 1 at public static boolean compiler.c2.gvn.TestBoolNodeGVN.testShouldHaveCpmUCase1(int,int). >> - [repeated violations omitted] >> >> >> I think it's the second argument to `beforeMatchingNameRegex(irNodePlaceholder, irNodeRegex)` you want to add the word break to. Not the placeholder. This can only be done in `IRNode.java`. I propose the following change: >> >> >> public static final String CMP_U = PREFIX + "CMP_U" + POSTFIX; >> static { >> - beforeMatchingNameRegex(CMP_U, "CmpU"); >> + beforeMatchingNameRegex(CMP_U, "CmpU\\b"); >> } >> >> >> The three existing tests currently referencing `IRNode.CMP_U`: `compiler.c2.irTests.CmpUWithZero`, `compiler.intrinsics.TestCompareUnsigned`, `compiler.c2.irTests.TestUnsignedComparison`, are all passing w/o this change. It does not break existing tests. >> >> I'm going to push a commit to do so. If you think it's not appropriate to change `IRNode.java` with the scope of this issue, I can revert it. > > Hi @tabjy, no worries! You're right, you cannot just append the `\\b` as suggested above - this was only possible in an older version of the IR framework. I think going with what you suggested with `beforeMatchingNameRegex(CMP_U, "CmpU\\b")` should do the trick. You can go ahead and push that. Then I can run some internal testing again, just to be sure. @chhagedorn It's already pushed. The HEAD has the up to date master merged in. Please let me know how the test goes. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2311742000 From rcastanedalo at openjdk.org Tue Aug 27 07:30:46 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 27 Aug 2024 07:30:46 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: Rename g1XChgX to g1GetAndSetX for consistency with Ideal operation names ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/92112802..daf38d3f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=08-09 Stats: 10 lines in 4 files changed: 0 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Tue Aug 27 07:38:08 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 27 Aug 2024 07:38:08 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> Message-ID: On Mon, 26 Aug 2024 13:23:16 GMT, Roberto Casta?eda Lozano wrote: > Sure, I agree that g1GetAndSetP and g1GetAndSetN are more consistent with the corresponding Ideal operation names, will rename the instructions in x64 and aarch64 and update the test expectations. Done (commit daf38d3). @offamitkumar @feilongjiang @snazarkin please note that the ADL instructions `g1XChgP` and `g1XChgN` have been renamed to `g1GetAndSetP` and `g1GetAndSetN`, and the same naming is expected across all platforms by the test `compiler/gcbarriers/TestG1BarrierGeneration.java` included in this changeset. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1732301224 From chagedorn at openjdk.org Tue Aug 27 07:39:04 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 27 Aug 2024 07:39:04 GMT Subject: RFR: 8330159: [C2] Remove or clarify Compile::init_start In-Reply-To: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> References: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> Message-ID: On Mon, 26 Aug 2024 13:54:16 GMT, Yagmur Eren wrote: > Compile::init_start method contained only an assertion. To cleanup, this method is removed and the locations where this method is called are replaced with the corresponding assertion. See issue: [JDK-8330159](https://bugs.openjdk.org/browse/JDK-8330159) Hi @nelanbu, I don't think this is correct. In `Compile::start()`, we have the following code: https://github.com/openjdk/jdk/blob/b8e8e965e541881605f9dbcd4d9871d4952b9232/src/hotspot/share/opto/compile.cpp#L1121-L1131 It asserts that `failing()` is false. Therefore, `init_start()` bails out before checking the assert with `start()` which you now no longer do with your refactoring. What you could do instead: - Simplify the code in `init_start()` to and add an assertion message: assert(failing() || s == start(), "should be StartNode"); - Change `init_start_node()` into a more meaningful name like `verify_start()`, as we are not actually initializing anything but rather sanity checking the start node. - Guard the method with `DEBUG_ONLY/ifdef ASSERT` since it's only calling an assert in debug VM and nothing in product VM. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20715#issuecomment-2311782455 From chagedorn at openjdk.org Tue Aug 27 07:39:11 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 27 Aug 2024 07:39:11 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v14] In-Reply-To: References: Message-ID: On Mon, 26 Aug 2024 17:46:51 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: > > - Merge branch 'master' into boolnode-refactor > - add a word break to IRNode.CMP_U > - Merge branch 'master' into boolnode-refactor > - spread boolean AND and OR into subcases, update number of expected CMP_U nodes > - Merge branch 'master' into boolnode-refactor > - Merge branch 'master' into boolnode-refactor > - update test values, @run directive, and remove an empty line > - Merge branch 'master' into boolnode-refactor > - move test location, add negative test case, simplify imports > - Merge branch 'master' into boolnode-refactor > - ... and 15 more: https://git.openjdk.org/jdk/compare/34161970...719199c2 Right, missed that. Testing is submitted, I report back once it's complete ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2311782534 From mbaesken at openjdk.org Tue Aug 27 09:15:03 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 27 Aug 2024 09:15:03 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v2] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 07:30:23 GMT, Matthias Baesken wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > add suggestion from Dean Long I created https://bugs.openjdk.org/browse/JDK-8339067 Convert Threshold flags (like Tier4MinInvocationThreshold and Tier3MinInvocationThreshold) to double for the flag conversion issue brought up in this review . Besides adding some comment (?) is there anything else that should be done in this PR ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2311981836 PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2311983148 From jbhateja at openjdk.org Tue Aug 27 09:58:44 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 27 Aug 2024 09:58:44 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v6] In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20508/files - new: https://git.openjdk.org/jdk/pull/20508/files/6cb1a46d..408a8694 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=04-05 Stats: 112 lines in 7 files changed: 91 ins; 14 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/20508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508 PR: https://git.openjdk.org/jdk/pull/20508 From jbhateja at openjdk.org Tue Aug 27 10:04:04 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 27 Aug 2024 10:04:04 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v5] In-Reply-To: <2_P1qPMS46tgh4RUSuitcjXYnd0koS_BxfRRRmj79EY=.c3baeeaa-87f7-47d4-bc70-ae2afd9de745@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> <7e5pWnvjqk-dQYNeaZjFzXcd5WlzniZPl5T4l1rKQGE=.0882bcd4-e307-4a29-aa41-5496ee029a60@github.com> <2_P1qPMS46tgh4RUSuitcjXYnd0koS_BxfRRRmj79EY=.c3baeeaa-87f7-47d4-bc70-ae2afd9de745@github.com> Message-ID: On Fri, 23 Aug 2024 22:29:46 GMT, Paul Sandoz wrote: > API changes look good. (Note at the moment we are not proposing to change how shuffles works - as you point out the two vector `selectFrom` and `rearrange` differ in the index representation.) > > IIUC if the more direct two-table instruction is not available you fall back to calling two single arg rearranges with a blend, as a lowering transformation, similar to the fallback Java expression. > > The float/double conversion bothers me, not suggesting we do something about it here, noting down for any future conversation on shuffles. Ideally we would want the equivalent integral vector (int or long) to represent the index, tricky to express in the API, or alternative treat as a bitwise no-op conversion (there is also impact on `toShuffle` too). Thanks @PaulSandoz, > IIUC if the more direct two-table instruction is not available you fall back to calling two single arg rearranges with a blend, as > a lowering transformation, similar to the fallback Java expression. Idea here is to be performant as much as possible and save additional boxing penalties incurred due to failed intrinsification if target does not directly support two vector permutation but does supports its constituents. I have now unwrapped and optimized the fallback implementation to directly operates over index vector lanes instead going through intermediate shuffle. > > The float/double conversion bothers me, not suggesting we do something about it here, noting down for any future conversation on shuffles. I Agree. > Ideally we would want the equivalent integral vector (int or long) to represent the index, tricky to express in the API, or alternative treat as a bitwise no-op conversion (there is also impact on `toShuffle` too). Since floating-point index vector may carry special values like NaN, POSITIVE_INFINITY and NEGATIVE_INFINITY, thus with default wrapping semantics, its necessary to convert this into integral vector followed by wrapping normalization to valid two vector index range, through existing sequence we are bypassing partial wrapping (part to toShuffle) altogether which may save few instruction. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2312089987 From rcastanedalo at openjdk.org Tue Aug 27 12:39:07 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 27 Aug 2024 12:39:07 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> Message-ID: On Tue, 27 Aug 2024 07:34:57 GMT, Roberto Casta?eda Lozano wrote: >>> That one is among the failing tests. Can we agree on better names than g1XChgP and g1XChgN? They are not readable very well IMHO. >> >> Sure, I agree that `g1GetAndSetP` and `g1GetAndSetN` are more consistent with the corresponding Ideal operation names, will rename the instructions in x64 and aarch64 and update the test expectations. >> >>> Regardless if you implement the compressed oops optimization or not, I'd reconsider moving the oop decoding into G1BarrierSetAssembler::g1_write_barrier_post_c2 because it makes the .ad file shorter because you can get rid of the replicated decode_heap_oop. >> >> Thanks, will try it out. > >> Sure, I agree that g1GetAndSetP and g1GetAndSetN are more consistent with the corresponding Ideal operation names, will rename the instructions in x64 and aarch64 and update the test expectations. > > Done (commit daf38d3). > > @offamitkumar @feilongjiang @snazarkin please note that the ADL instructions `g1XChgP` and `g1XChgN` have been renamed to `g1GetAndSetP` and `g1GetAndSetN`, and the same naming is expected across all platforms by the test `compiler/gcbarriers/TestG1BarrierGeneration.java` included in this changeset. > Regardless if you implement the compressed oops optimization or not, I'd reconsider moving the oop decoding into G1BarrierSetAssembler::g1_write_barrier_post_c2 because it makes the .ad file shorter because you can get rid of the replicated decode_heap_oop. I tried this refactoring [here](https://github.com/openjdk/jdk/commit/d4e83fd7d77c5415700b33556752e8c8da811dea), thanks again Martin for the suggestion. In my opinion, the result is similar in terms of readability/maintainability because the benefit of removing explicit `decode_heap_oop` operations in the ADL file is somewhat negated by the increased complexity of the `write_barrier_post` and `g1_write_barrier_post_c2` signatures. For aarch64, moving non-destructive `decode_heap_oop` operations would probably require passing both source and destination registers to these functions explicitly, which would make them more complex. I feel that this refactoring is only amortized when the new-value decoding operations are more complex, as in your PPC implementation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1732770143 From chagedorn at openjdk.org Tue Aug 27 13:41:10 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 27 Aug 2024 13:41:10 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v14] In-Reply-To: References: Message-ID: <_z9yxiS9mXR6k6qT2vQEOWn97sOoKtWO7GmNLrbHmnA=.7a6238bd-1788-4ce6-b37f-ead56ae1545e@github.com> On Mon, 26 Aug 2024 17:46:51 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: > > - Merge branch 'master' into boolnode-refactor > - add a word break to IRNode.CMP_U > - Merge branch 'master' into boolnode-refactor > - spread boolean AND and OR into subcases, update number of expected CMP_U nodes > - Merge branch 'master' into boolnode-refactor > - Merge branch 'master' into boolnode-refactor > - update test values, @run directive, and remove an empty line > - Merge branch 'master' into boolnode-refactor > - move test location, add negative test case, simplify imports > - Merge branch 'master' into boolnode-refactor > - ... and 15 more: https://git.openjdk.org/jdk/compare/1d08ac0d...719199c2 Testing looked good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-2263397629 From dfenacci at openjdk.org Tue Aug 27 14:28:47 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 27 Aug 2024 14:28:47 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v6] In-Reply-To: References: Message-ID: <07P5RLBPr-kRh3KSuMegKqvSZgvhj3rKrY9jcWa2ELQ=.f9dc370c-0c09-4ffc-b8db-7424c9b11912@github.com> > # Issue > > The test `compiler/startup/StartupOutput.java` fails intermittently due to a crash after correctly printing the error `Initial size of CodeCache is too small` (the test limits the code cache using k-XX:InitialCodeCacheSize=1024K -XX:ReservedCodeCacheSize=1200k`). > The appearance of the issue is very dependent on thread scheduling. The original report happens during C1 initialization but C2 initialization is affected as well. > > # Causes > > There is one occurrence during C1 initialization and one during C2 initialization where a call to `RuntimeStub::new_runtime_stub` can fail fatally if there is not enough space left. > For C1: `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`. > For C2: `C2Compiler::initialize` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`. > > # Solution > > https://github.com/openjdk/jdk/pull/15970 introduced an optional argument to `RuntimeStub::new_runtime_stub` to determine if it fails fatally or not. We can take advantage of it to avoid crashing and instead pass the information about the success or failure of the allocation up the (C1 and C2 initialization) call stack up to where we can set the compilations as failed. Damon Fenacci has updated the pull request incrementally with 12 additional commits since the last revision: - JDK-8326615: calculate minimum code cache size based on initial compiler buffer sizes - JDK-8326615 add forgotten problemlisted configuration after revert - JDK-8326615 add forgotten problemlisted test after revert - Revert "JDK-8326615: compiler/startup/StartupOutput.java intermittently Internal Error (codeBlob.cpp:429) Initial size of CodeCache is too small" This reverts commit d3a5b2fd - Revert "JDK-8326615: update copyright year" This reverts commit bd0e039f562eecbf8f63eeb41b29f2703b6f0f17. - Revert "Update src/hotspot/share/c1/c1_Compiler.cpp" This reverts commit 5bf1a5ec5b05de15d55018dabdf48449f0ccb9a1. - Revert "Update src/hotspot/share/c1/c1_Runtime1.cpp" This reverts commit c505aac55825a34567fc55115b4ab9eb60a2cc71. - Revert "JDK-8326615: handle allocation failures in barrier set" This reverts commit bd2a7adfdf0cba845565ae1ee059323daa1b40db. - Revert "JDK-8326615: update copyright year" This reverts commit f16d9910cd44380a3f348b96d6ec15eea937920d. - Revert "Update src/hotspot/share/gc/x/c1/xBarrierSetC1.cpp" This reverts commit 0374efedacf0885ad17ac3e346b1dd4741bb7cdb. - ... and 2 more: https://git.openjdk.org/jdk/compare/ea23c61e...c7f484a4 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19280/files - new: https://git.openjdk.org/jdk/pull/19280/files/ea23c61e..c7f484a4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=04-05 Stats: 93 lines in 20 files changed: 23 ins; 27 del; 43 mod Patch: https://git.openjdk.org/jdk/pull/19280.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19280/head:pull/19280 PR: https://git.openjdk.org/jdk/pull/19280 From kvn at openjdk.org Tue Aug 27 15:27:05 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 27 Aug 2024 15:27:05 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v2] In-Reply-To: References: Message-ID: On Tue, 27 Aug 2024 09:12:04 GMT, Matthias Baesken wrote: > Besides adding some comment (?) is there anything else that should be done in this PR ? Current change looks fine to me. Please, add comment about avoiding devision by 0. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2312871504 From dlunden at openjdk.org Tue Aug 27 15:41:06 2024 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 27 Aug 2024 15:41:06 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v4] In-Reply-To: References: Message-ID: On Tue, 20 Aug 2024 14:32:25 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after Roberto's comments and suggestions I explored the possibility of calculating a stacic upper bound for register mask size given that parameters must fit in 255 32-bit words (from the JVM spec). My conclusion is that the required increase in static register mask size is too expensive, and that we should proceed with my current solution. **Details** In C2, register masks must be capable of representing incoming arguments (up to 255) as well as the maximum number of outgoing arguments among all function calls in the compiled method (also up to 255). Additionally, arguments are 64-bit aligned in the frame, which doubles the number of required register mask bits. Example on my x64 machine (1 word = 32 bits here): - Registers require at minimum 18 words in register masks. - We currently add 4 words for representing arguments, locks, and some other stack locations, so in total 22 words. - To ensure we can fit 255 incoming arguments in the mask, we need 255 * 2 bits = 16 words. - To ensure we can fit 255 outgoing arguments in the mask, we need 255 * 2 bits = 16 words. That is, we go from 18 + 4 = 22 words to at least 18 + 16 + 16 = 50 words, only taking incoming and outgoing arguments into account. Performance experiments indicate a 3.3% C2 compilation speed degradation (compared to 1% for the solution in this PR). ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-2312905150 From dlong at openjdk.org Tue Aug 27 16:16:35 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 27 Aug 2024 16:16:35 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed Message-ID: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. ------------- Commit messages: - add ciMethod::equals() to compare possibly-redefined methods Changes: https://git.openjdk.org/jdk/pull/20730/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20730&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8335120 Stats: 19 lines in 3 files changed: 18 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20730.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20730/head:pull/20730 PR: https://git.openjdk.org/jdk/pull/20730 From dfenacci at openjdk.org Tue Aug 27 16:29:46 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 27 Aug 2024 16:29:46 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v7] In-Reply-To: References: Message-ID: > # Issue > > The test `compiler/startup/StartupOutput.java` fails intermittently due to a crash after correctly printing the error `Initial size of CodeCache is too small` (the test limits the code cache using k-XX:InitialCodeCacheSize=1024K -XX:ReservedCodeCacheSize=1200k`). > The appearance of the issue is very dependent on thread scheduling. The original report happens during C1 initialization but C2 initialization is affected as well. > > # Causes > > There is one occurrence during C1 initialization and one during C2 initialization where a call to `RuntimeStub::new_runtime_stub` can fail fatally if there is not enough space left. > For C1: `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`. > For C2: `C2Compiler::initialize` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`. > > # Solution > > https://github.com/openjdk/jdk/pull/15970 introduced an optional argument to `RuntimeStub::new_runtime_stub` to determine if it fails fatally or not. We can take advantage of it to avoid crashing and instead pass the information about the success or failure of the allocation up the (C1 and C2 initialization) call stack up to where we can set the compilations as failed. Damon Fenacci has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: - Merge tag 'jdk-24+7' into JDK-8326615 Added tag jdk-24+7 for changeset 21a6cf84 - JDK-8326615: calculate minimum code cache size based on initial compiler buffer sizes - JDK-8326615 add forgotten problemlisted configuration after revert - JDK-8326615 add forgotten problemlisted test after revert - Revert "JDK-8326615: compiler/startup/StartupOutput.java intermittently Internal Error (codeBlob.cpp:429) Initial size of CodeCache is too small" This reverts commit d3a5b2fd - Revert "JDK-8326615: update copyright year" This reverts commit bd0e039f562eecbf8f63eeb41b29f2703b6f0f17. - Revert "Update src/hotspot/share/c1/c1_Compiler.cpp" This reverts commit 5bf1a5ec5b05de15d55018dabdf48449f0ccb9a1. - Revert "Update src/hotspot/share/c1/c1_Runtime1.cpp" This reverts commit c505aac55825a34567fc55115b4ab9eb60a2cc71. - Revert "JDK-8326615: handle allocation failures in barrier set" This reverts commit bd2a7adfdf0cba845565ae1ee059323daa1b40db. - Revert "JDK-8326615: update copyright year" This reverts commit f16d9910cd44380a3f348b96d6ec15eea937920d. - ... and 13 more: https://git.openjdk.org/jdk/compare/21a6cf84...f79d0a31 ------------- Changes: https://git.openjdk.org/jdk/pull/19280/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=06 Stats: 29 lines in 7 files changed: 22 ins; 2 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/19280.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19280/head:pull/19280 PR: https://git.openjdk.org/jdk/pull/19280 From mdoerr at openjdk.org Tue Aug 27 17:18:06 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 27 Aug 2024 17:18:06 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v2] In-Reply-To: References: Message-ID: On Sun, 25 Aug 2024 17:00:33 GMT, Suchismith Roy wrote: >> JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) >> C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. >> Also, call_c is adapted as per endianess of system. >> We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. > > Suchismith Roy has updated the pull request incrementally with one additional commit since the last revision: > > remove this-> Nice cleanup! I have a couple of small change requests. src/hotspot/cpu/ppc/interp_masm_ppc_64.cpp line 138: > 136: // func to get the address of the same-named entrypoint in the > 137: // generated interpreter code. > 138: call_c(CAST_FROM_FN_PTR(address, Interpreter::remove_activation_preserving_args_entry), relocInfo::none); `relocInfo::none` can be omitted. It's the default. src/hotspot/cpu/ppc/macroAssembler_ppc.cpp line 1038: > 1036: } > 1037: > 1038: address MacroAssembler::call_c(address function_entry,relocInfo::relocType) { Please restore the original line. src/hotspot/cpu/ppc/macroAssembler_ppc.hpp line 367: > 365: // calling conventions. Updates and returns _last_calls_return_pc. > 366: address call_c(Register function_descriptor); > 367: address call_c(address function_entry, relocInfo::relocType rt = relocInfo::relocType::none) { I think `relocInfo::none` should be used like above. src/hotspot/cpu/ppc/macroAssembler_ppc.hpp line 419: > 417: void call_VM_leaf(address entry_point, Register arg_1, Register arg_2); > 418: void call_VM_leaf(address entry_point, Register arg_1, Register arg_2, Register arg_3); > 419: Please don't remove the empty line. src/hotspot/cpu/ppc/stubGenerator_ppc.cpp line 619: > 617: return stub->entry_point(); > 618: } > 619: #undef __ I guess this was added unintentionally? src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp line 1467: > 1465: > 1466: __ mr(R3_ARG1, R16_thread); > 1467: __ call_c(CAST_FROM_FN_PTR(address, JavaThread::check_special_condition_for_native_trans), relocInfo::none); `relocInfo::none` can be omitted. It's the default. ------------- Changes requested by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19947#pullrequestreview-2264025594 PR Review Comment: https://git.openjdk.org/jdk/pull/19947#discussion_r1733252476 PR Review Comment: https://git.openjdk.org/jdk/pull/19947#discussion_r1733248903 PR Review Comment: https://git.openjdk.org/jdk/pull/19947#discussion_r1733249839 PR Review Comment: https://git.openjdk.org/jdk/pull/19947#discussion_r1733250155 PR Review Comment: https://git.openjdk.org/jdk/pull/19947#discussion_r1733253411 PR Review Comment: https://git.openjdk.org/jdk/pull/19947#discussion_r1733253817 From mdoerr at openjdk.org Tue Aug 27 17:41:13 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 27 Aug 2024 17:41:13 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> Message-ID: On Tue, 27 Aug 2024 12:36:39 GMT, Roberto Casta?eda Lozano wrote: >>> Sure, I agree that g1GetAndSetP and g1GetAndSetN are more consistent with the corresponding Ideal operation names, will rename the instructions in x64 and aarch64 and update the test expectations. >> >> Done (commit daf38d3). >> >> @offamitkumar @feilongjiang @snazarkin please note that the ADL instructions `g1XChgP` and `g1XChgN` have been renamed to `g1GetAndSetP` and `g1GetAndSetN`, and the same naming is expected across all platforms by the test `compiler/gcbarriers/TestG1BarrierGeneration.java` included in this changeset. > >> Regardless if you implement the compressed oops optimization or not, I'd reconsider moving the oop decoding into G1BarrierSetAssembler::g1_write_barrier_post_c2 because it makes the .ad file shorter because you can get rid of the replicated decode_heap_oop. > > I tried this refactoring [here](https://github.com/openjdk/jdk/commit/d4e83fd7d77c5415700b33556752e8c8da811dea), thanks again Martin for the suggestion. In my opinion, the result is similar in terms of readability/maintainability because the benefit of removing explicit `decode_heap_oop` operations in the ADL file is somewhat negated by the increased complexity of the `write_barrier_post` and `g1_write_barrier_post_c2` signatures. For aarch64, moving non-destructive `decode_heap_oop` operations would probably require passing both source and destination registers to these functions explicitly, which would make them more complex. I feel that this refactoring is only amortized when the new-value decoding operations are more complex, as in your PPC implementation. Thanks for trying! I like your refactored version. I prefer moving stuff out of the .ad files. The complexity of the barrier code is not significantly higher. Note that you could use `decode_heap_oop_not_null(Register dst, Register src)` with `tmp2` as dst (`generate_post_barrier_fast_path` can deal with new_val == tmp2). That would save a move instruction in some cases. I haven't looked into the aarch64 code. I leave you free to decide. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1733283320 From sviswanathan at openjdk.org Tue Aug 27 18:30:08 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 27 Aug 2024 18:30:08 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. src/hotspot/cpu/x86/x86.ad line 1773: > 1771: return false; > 1772: } > 1773: if (bt == T_LONG && !VM_Version::supports_avx512vl()) { we should be able to support bt == T_LONG for 512 bit irrespective of avx512vl. src/hotspot/cpu/x86/x86.ad line 1953: > 1951: if (UseAVX < 1 || size_in_bits < 128 || (size_in_bits == 512 && !VM_Version::supports_avx512bw())) { > 1952: return false; > 1953: } UseAVX < 1 could be written as UseAVX == 0. Could we not do register version for size_in_bit < 128? src/hotspot/cpu/x86/x86.ad line 1962: > 1960: return false; // Implementation limitation > 1961: } > 1962: break; Could we not do register version for size_in_bit < 128? src/hotspot/cpu/x86/x86.ad line 2143: > 2141: if (is_subword_type(bt) && !VM_Version::supports_avx512bw()) { > 2142: return false; // Implementation limitation > 2143: } UMinV and UMaxV are supported on AVX1, AVX2 platform. src/hotspot/cpu/x86/x86.ad line 2155: > 2153: return false; // Implementation limitation > 2154: } > 2155: return true; Byte/Short saturating vector add is supported for AVX1, AVX2 platforms. Could we not do register version for size_in_bit < 128? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1733330892 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1733333203 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1733333608 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1733336005 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1733338300 From kvn at openjdk.org Tue Aug 27 18:46:02 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 27 Aug 2024 18:46:02 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: On Tue, 27 Aug 2024 16:11:22 GMT, Dean Long wrote: > The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. Do you have regression test? ------------- PR Review: https://git.openjdk.org/jdk/pull/20730#pullrequestreview-2264195431 From psandoz at openjdk.org Tue Aug 27 20:03:07 2024 From: psandoz at openjdk.org (Paul Sandoz) Date: Tue, 27 Aug 2024 20:03:07 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v6] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Tue, 27 Aug 2024 09:58:44 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. I think we should leave the fallback expression as `vec2.rearrange(vec1.toShuffle(), vec3);`, lets address that separately if needed. Otherwise, you have introduced an additional code path that requires more explicit testing. My comment was related to understanding what `SelectFromTwoVectorNode::Ideal` and `VectorRearrangeNode::Ideal` are doing - the former lowers, if needed, into the rearrange expression and the latter adjusts, if needed, the index vector (a comment describing this transformation would be useful, like you have in the former method). ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2313401788 From dlong at openjdk.org Tue Aug 27 20:22:02 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 27 Aug 2024 20:22:02 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: On Tue, 27 Aug 2024 18:43:53 GMT, Vladimir Kozlov wrote: > Do you have regression test? No, I ran the closed stress test 100s of times, with and without the fix. With the fix, the assert went away, and without the fix, I got a handful of crashes. A regression test is not realistic because the failure depends on a race condition between the compiler and class redefinition. I just added the "noreg-hard" label to the bug. Thanks for asking. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20730#issuecomment-2313436185 From dlong at openjdk.org Tue Aug 27 21:14:03 2024 From: dlong at openjdk.org (Dean Long) Date: Tue, 27 Aug 2024 21:14:03 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: On Tue, 27 Aug 2024 16:11:22 GMT, Dean Long wrote: > The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. @iwanowww , please review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20730#issuecomment-2313522440 From sdohrmann at openjdk.org Tue Aug 27 22:25:03 2024 From: sdohrmann at openjdk.org (Steve Dohrmann) Date: Tue, 27 Aug 2024 22:25:03 GMT Subject: RFR: 8329035: New Data Destination instructions support Message-ID: Adds assembler support for APX New Data Destination (NDD) and No Flags (NF) features. The NDD feature is supported by new functions that take an additional destination-only register operand. If the instruction also supports NF, a no_flags boolean parameter is present. To use these instructions with NF behavior, but without NDD semantics, the same register can be supplied for both the new destination and the (first) source operand. Some instructions support NF but not NDD. These instructions have a new function that just adds a boolean no_flags parameter. Existing functions were not overloaded with a boolean here because of signature collisions (bool / int) with functions that take immediate operands. All of the new functions have a letter "e" prefix, to avoid signature collisions and to indicate they will be evex encoded. ------------- Commit messages: - fix 32-bit build name errors, missing no_flags arg, and addw functions - 8329035: New Data Destination instructions support Changes: https://git.openjdk.org/jdk/pull/20698/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20698&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329035 Stats: 1749 lines in 2 files changed: 1725 ins; 1 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/20698.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20698/head:pull/20698 PR: https://git.openjdk.org/jdk/pull/20698 From sviswanathan at openjdk.org Tue Aug 27 22:28:21 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 27 Aug 2024 22:28:21 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: <4k6vX8rkREK9CYMZjs0KfHikLJJ1NWbMtWYYzLcYPc0=.53547148-1abb-4a7f-8238-944c13a26304@github.com> On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. src/hotspot/cpu/x86/x86.ad line 10635: > 10633: %} > 10634: > 10635: instruct saturating_unsigned_add_reg_avx(vec dst, vec src1, vec src2, vec xtmp1, vec xtmp2, vec xtmp3, vec xtmp4) Should the temp here and all the places related to !avx512vl() be legVec instead of vec? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1733588147 From john.r.rose at oracle.com Tue Aug 27 23:44:28 2024 From: john.r.rose at oracle.com (John Rose) Date: Tue, 27 Aug 2024 16:44:28 -0700 Subject: RFR: 8338023: Support two vector selectFrom API [v5] In-Reply-To: <2_P1qPMS46tgh4RUSuitcjXYnd0koS_BxfRRRmj79EY=.c3baeeaa-87f7-47d4-bc70-ae2afd9de745@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> <7e5pWnvjqk-dQYNeaZjFzXcd5WlzniZPl5T4l1rKQGE=.0882bcd4-e307-4a29-aa41-5496ee029a60@github.com> <2_P1qPMS46tgh4RUSuitcjXYnd0koS_BxfRRRmj79EY=.c3baeeaa-87f7-47d4-bc70-ae2afd9de745@github.com> Message-ID: On 23 Aug 2024, at 15:33, Paul Sandoz wrote: > The float/double conversion bothers me, not suggesting we do something about it here, noting down for any future conversation on shuffles. Yes, it?s a pain which is noticeable in the vector/shuffle conversions. In the worst case it adds dynamic reformatting operations to get from the artificially ?uniform? float/double index format into the real format the hardware requires. As a workaround, the user could convert the float/double payloads bitwise into int/long payloads, and then do the shuffling in the uniform int/long API, later reconverting back to float/double after the payloads are reordered. Those conversions don?t actually use any dynamic operations. For prototyping, it seems fine to take the hit and ignore the fact that the index vectors are in an odd (though ?uniform?) format. From sviswanathan at openjdk.org Wed Aug 28 00:15:19 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 28 Aug 2024 00:15:19 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: On Thu, 15 Aug 2024 06:59:53 GMT, Jatin Bhateja wrote: >>> its usage in existing patch is limited to [type comparison.](https://github.com/openjdk/jdk/pull/20507/files#diff-3559dcf23b719805be5fd06fd5c1851dbd8f53e47afe6d99cba13a3de0ebc6b2R1542) >> >> Ah, that makes sense to me. I took a closer look and I think since the patch is creating a `VectorReinterpret` node after unsigned vector nodes, it might be able to avoid cases where the type might get filtered/joined, like with `PhiNode::Value`. That might lead to errors since `empty_type->filter(other_type) == TOP`. It's unfortunate that it's not really possible to disambiguate between an empty type and an unsigned range, which would allow us to solve this elegantly. > > Hey @jaskarth , Central idea behind introducing VectorReinterpretNode after unsigned vector IR is to facilitate unboxing-boxing optimization, this explicit reinterpretation ensures type compatibility between value being boxed and box type which is always signed vector types. > > As mentioned previously my plan is to address is handle value range related concerns in a follow up patch along with intrisification and auto-vectorization of newly created scalar saturating IR, this patch is not generating scalar IR with newly defined unsigned types. Wonder if it would have been simpler if we added unsigned vector operators like Op_SaturatingUnsignedAddVB etc. We are not adding unsigned data types to Java, only supporting unsigned (saturating) operations on existing signed integral types. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1733659843 From sroy at openjdk.org Wed Aug 28 06:26:56 2024 From: sroy at openjdk.org (Suchismith Roy) Date: Wed, 28 Aug 2024 06:26:56 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v3] In-Reply-To: References: Message-ID: > JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) > C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. > Also, call_c is adapted as per endianess of system. > We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. Suchismith Roy has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: remove stubgenerator merged code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19947/files - new: https://git.openjdk.org/jdk/pull/19947/files/2b07a3bc..7c0816bf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/19947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19947/head:pull/19947 PR: https://git.openjdk.org/jdk/pull/19947 From sroy at openjdk.org Wed Aug 28 06:49:01 2024 From: sroy at openjdk.org (Suchismith Roy) Date: Wed, 28 Aug 2024 06:49:01 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v4] In-Reply-To: References: Message-ID: > JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) > C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. > Also, call_c is adapted as per endianess of system. > We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. Suchismith Roy has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: remove stubgenerator change ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19947/files - new: https://git.openjdk.org/jdk/pull/19947/files/7c0816bf..2230d6c7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=02-03 Stats: 100 lines in 1 file changed: 0 ins; 100 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/19947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19947/head:pull/19947 PR: https://git.openjdk.org/jdk/pull/19947 From sroy at openjdk.org Wed Aug 28 06:54:57 2024 From: sroy at openjdk.org (Suchismith Roy) Date: Wed, 28 Aug 2024 06:54:57 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v5] In-Reply-To: References: Message-ID: > JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) > C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. > Also, call_c is adapted as per endianess of system. > We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. Suchismith Roy has updated the pull request incrementally with two additional commits since the last revision: - review comments - review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19947/files - new: https://git.openjdk.org/jdk/pull/19947/files/2230d6c7..f7d7854c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=03-04 Stats: 4 lines in 3 files changed: 1 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/19947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19947/head:pull/19947 PR: https://git.openjdk.org/jdk/pull/19947 From mbaesken at openjdk.org Wed Aug 28 07:33:57 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Wed, 28 Aug 2024 07:33:57 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v3] In-Reply-To: References: Message-ID: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: add comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20615/files - new: https://git.openjdk.org/jdk/pull/20615/files/daa9ebf3..bc483da4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20615&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20615&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20615.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20615/head:pull/20615 PR: https://git.openjdk.org/jdk/pull/20615 From kbarrett at openjdk.org Wed Aug 28 08:27:25 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 28 Aug 2024 08:27:25 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: References: Message-ID: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> On Tue, 27 Aug 2024 07:30:46 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Rename g1XChgX to g1GetAndSetX for consistency with Ideal operation names I've only looked at the changes in gc directories (shared and cpu-specific). src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 160: > 158: * To reduce the number of updates to the remembered set, the post-barrier > 159: * filters out updates to fields in objects located in the Young Generation, the > 160: * same region as the reference, when the null is being written, or if the card s/the null/null/ src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 166: > 164: * post-barrier completely, if it is possible during compile time to prove the > 165: * object is newly allocated and that no safepoint exists between the allocation > 166: * and the store. It might be worth saying explicitly that this is a compile-time version of the above mentioned young generation filter. src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 229: > 227: } > 228: > 229: void refine_barrier_by_new_val_type(Node* n) { This function should probably be `static`. ------------- Changes requested by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19746#pullrequestreview-2259069811 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1734167614 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1734196887 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1734207820 From kbarrett at openjdk.org Wed Aug 28 08:27:27 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 28 Aug 2024 08:27:27 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v9] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 08:53:30 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Pass oldval to the pre-barrier in g1CompareAndExchange/SwapP src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 218: > 216: __ cbz(new_val, done); > 217: } > 218: // Storing region crossing non-null, is card already dirty? s/already dirty/young/ src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 280: > 278: > 279: #undef __ > 280: #define __ masm-> These "changes" to `__` are unnecessary and confusing. We have the same define near the top of the file, unconditionally. This one is conditonal on COMPILER2, but is left in place at the end of the conditional block, affecting following unconditional code. src/hotspot/share/opto/memnode.cpp line 3468: > 3466: // Capture an unaliased, unconditional, simple store into an initializer. > 3467: // Or, if it is independent of the allocation, hoist it above the allocation. > 3468: if (ReduceFieldZeroing && ReduceInitialCardMarks && /*can_reshape &&*/ It's not obvious to me how this is related to the late barrier changes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730194278 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730238757 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1730246320 From kbarrett at openjdk.org Wed Aug 28 08:27:28 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 28 Aug 2024 08:27:28 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> References: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> Message-ID: On Wed, 28 Aug 2024 08:09:44 GMT, Kim Barrett wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename g1XChgX to g1GetAndSetX for consistency with Ideal operation names > > src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 166: > >> 164: * post-barrier completely, if it is possible during compile time to prove the >> 165: * object is newly allocated and that no safepoint exists between the allocation >> 166: * and the store. > > It might be worth saying explicitly that this is a compile-time version of the above mentioned young > generation filter. We can similarly elide the post-barrier if we can prove at compile-time that the value being written is null. That case isn't handled here though. Instead that's checked for in `refine_barrier_by_new_val_type` and in `get_store_barrier`. I'm not sure why it's structured that way. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1734201007 From roland at openjdk.org Wed Aug 28 08:37:18 2024 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 28 Aug 2024 08:37:18 GMT Subject: RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v2] In-Reply-To: <8QYwwkPgR9grcCttNV0KHliNjH9kmBGUt7kc2b5wPW0=.b10e5195-eb8e-4148-960d-670554b57ace@github.com> References: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> <6kgDCB1rxZyn1JEX-8hvKyOQE07oMs8_kr7Cjbix3Gg=.61a77a4f-394c-48d6-a482-fe73f3314f5b@github.com> <8QYwwkPgR9grcCttNV0KHliNjH9kmBGUt7kc2b5wPW0=.b10e5195-eb8e-4148-960d-670554b57ace@github.com> Message-ID: On Thu, 22 Aug 2024 16:23:42 GMT, Vladimir Kozlov wrote: > I am not comfortable with big regression on MacOSX aarch64 even if you can't reproduce it locally. We need to rerun that testing to make sure it is random as you said. Christian ran performance testing again (thanks Christian!) and there was no regression this time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19831#issuecomment-2314683726 From adinn at openjdk.org Wed Aug 28 09:23:19 2024 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 28 Aug 2024 09:23:19 GMT Subject: RFR: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only [v2] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 08:51:11 GMT, Joshua Zhu wrote: >> Please review this minor enhancement that skips verify_sve_vector_length after native calls. >> It works on SVE micro-architecture that only supports 128-bit vector length. > > Joshua Zhu has updated the pull request incrementally with one additional commit since the last revision: > > Fix compilation failure with --disable-precompiled-headers The code changes look ok. What have you done to test it? ------------- PR Review: https://git.openjdk.org/jdk/pull/20724#pullrequestreview-2265676276 From luhenry at openjdk.org Wed Aug 28 09:35:28 2024 From: luhenry at openjdk.org (Ludovic Henry) Date: Wed, 28 Aug 2024 09:35:28 GMT Subject: RFR: 8321010: RISC-V: C2 RoundVF [v15] In-Reply-To: <8hGiyN1XJKBa5eFp9xy15NfL5iFkhFHaG55bR6gX-_I=.001d319e-fec1-4e44-ac38-ae0b13aaa104@github.com> References: <8hGiyN1XJKBa5eFp9xy15NfL5iFkhFHaG55bR6gX-_I=.001d319e-fec1-4e44-ac38-ae0b13aaa104@github.com> Message-ID: On Thu, 25 Jul 2024 14:29:55 GMT, Hamlin Li wrote: >> Hi, >> Can you have a review on this patch to add RoundVF/RoundDF intrinsics? >> >> Current test shows that, it bring performance gain when vlenb >= 32 (which is on bananapi), but bring regression when vlenb == 16 (which is on k230). So I only enable the intrinsic when vlenb >= 32. >> >> For double, even vlenb == 32, there is still some regression, so in this pr I only enable it when vlenb >= 64. Although there is no hardware to verify it, I think from the trend of performance data on bananapi and k230, it's promising to bring better performance rather than regression for double when vlenb == 64+. Please compare the data of `Benchmark on bananapi, +UseSuperWord` and `Benchmark on k230, +UseSuperWord, enable RoundVF/D when vlenb == 16`. >> >> Thanks! >> >> ## Tests >> >> test/hotspot/jtreg/compiler/vectorization/TestRoundVectRiscv64.java test/hotspot/jtreg/compiler/c2/cr6340864/TestFloatVect.java test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java test/hotspot/jtreg/compiler/floatingpoint/TestRound.java >> >> test/jdk/java/lang/Math/RoundTests.java >> >> ## Performance - with Intrinsic >> >> ### on bananapi >> Benchmark on bananapi, +UseSuperWord >> >> Benchmark on bananapi, +UseSuperWord | (TESTSIZE) | Mode | Cnt | Score +intrinsic | Score -intrinsic | Error | Units | Improvement >> -- | -- | -- | -- | -- | -- | -- | -- | -- >> FpRoundingBenchmark.test_round_double | 2048 | avgt | 10 | 23794.153 | 20557.467 | 2899.266 | ns/op | 0.864 >> FpRoundingBenchmark.test_round_float | 2048 | avgt | 10 | 11531.853 | 16562.431 | 865.779 | ns/op | 1.436 >> >> >> >> ### on k230 (enable intrinsic even when vlenb == 16) >> Benchmark on k230, +UseSuperWord, enable RoundVF/D when vlenb == 16 >> >> Benchmark on k230, +UseSuperWord, enable RoundVF/D ... > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > minor Marked as reviewed by luhenry (Committer). src/hotspot/cpu/riscv/riscv.ad line 1916: > 1914: return UseRVV && MaxVectorSize >= 32; > 1915: case Op_RoundVD: > 1916: return UseRVV && MaxVectorSize >= 64; It would be worth leaving the same comment you've in the PR description here as well, to make it very clear in the sources why the option is enable/disabled on these parameters. ------------- PR Review: https://git.openjdk.org/jdk/pull/17745#pullrequestreview-2265701366 PR Review Comment: https://git.openjdk.org/jdk/pull/17745#discussion_r1734332605 From thartmann at openjdk.org Wed Aug 28 09:39:28 2024 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 28 Aug 2024 09:39:28 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v14] In-Reply-To: References: Message-ID: On Mon, 26 Aug 2024 17:46:51 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: > > - Merge branch 'master' into boolnode-refactor > - add a word break to IRNode.CMP_U > - Merge branch 'master' into boolnode-refactor > - spread boolean AND and OR into subcases, update number of expected CMP_U nodes > - Merge branch 'master' into boolnode-refactor > - Merge branch 'master' into boolnode-refactor > - update test values, @run directive, and remove an empty line > - Merge branch 'master' into boolnode-refactor > - move test location, add negative test case, simplify imports > - Merge branch 'master' into boolnode-refactor > - ... and 15 more: https://git.openjdk.org/jdk/compare/112017ab...719199c2 That looks good to me. @eme64 is currently out but you can integrate this now. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/18198#pullrequestreview-2265712164 From mli at openjdk.org Wed Aug 28 10:29:59 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 28 Aug 2024 10:29:59 GMT Subject: RFR: 8321010: RISC-V: C2 RoundVF [v16] In-Reply-To: References: Message-ID: > Hi, > Can you have a review on this patch to add RoundVF/RoundDF intrinsics? > > Current test shows that, it bring performance gain when vlenb >= 32 (which is on bananapi), but bring regression when vlenb == 16 (which is on k230). So I only enable the intrinsic when vlenb >= 32. > > For double, even vlenb == 32, there is still some regression, so in this pr I only enable it when vlenb >= 64. Although there is no hardware to verify it, I think from the trend of performance data on bananapi and k230, it's promising to bring better performance rather than regression for double when vlenb == 64+. Please compare the data of `Benchmark on bananapi, +UseSuperWord` and `Benchmark on k230, +UseSuperWord, enable RoundVF/D when vlenb == 16`. > > Thanks! > > ## Tests > > test/hotspot/jtreg/compiler/vectorization/TestRoundVectRiscv64.java test/hotspot/jtreg/compiler/c2/cr6340864/TestFloatVect.java test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java test/hotspot/jtreg/compiler/floatingpoint/TestRound.java > > test/jdk/java/lang/Math/RoundTests.java > > ## Performance - with Intrinsic > > ### on bananapi > Benchmark on bananapi, +UseSuperWord > > Benchmark on bananapi, +UseSuperWord | (TESTSIZE) | Mode | Cnt | Score +intrinsic | Score -intrinsic | Error | Units | Improvement > -- | -- | -- | -- | -- | -- | -- | -- | -- > FpRoundingBenchmark.test_round_double | 2048 | avgt | 10 | 23794.153 | 20557.467 | 2899.266 | ns/op | 0.864 > FpRoundingBenchmark.test_round_float | 2048 | avgt | 10 | 11531.853 | 16562.431 | 865.779 | ns/op | 1.436 > > > > ### on k230 (enable intrinsic even when vlenb == 16) > Benchmark on k230, +UseSuperWord, enable RoundVF/D when vlenb == 16 > > Benchmark on k230, +UseSuperWord, enable RoundVF/D when vlenb == 16 | (TESTSIZE) | Mode | Cnt | Score +intrinsic ... Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 25 commits: - comments - Merge branch 'master' into round-F+D-v - minor - minor - minor - add additional tests - enable roundVD when MaxVectorSize >= 64 - enable intrinsic when MaxVectorSize >= 32 - Merge branch 'master' into round-F+D-v - enable when vlenb >= 32 - ... and 15 more: https://git.openjdk.org/jdk/compare/be34730f...c35fcddc ------------- Changes: https://git.openjdk.org/jdk/pull/17745/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17745&range=15 Stats: 921 lines in 11 files changed: 921 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17745.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17745/head:pull/17745 PR: https://git.openjdk.org/jdk/pull/17745 From mli at openjdk.org Wed Aug 28 10:29:59 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 28 Aug 2024 10:29:59 GMT Subject: RFR: 8321010: RISC-V: C2 RoundVF [v15] In-Reply-To: References: <8hGiyN1XJKBa5eFp9xy15NfL5iFkhFHaG55bR6gX-_I=.001d319e-fec1-4e44-ac38-ae0b13aaa104@github.com> Message-ID: <5x97Bsr8ACFZIPBHwBLRJJdvcYuzB3H5WLc1ehyIPi0=.169febcd-d56e-434a-9827-ec4ba897324e@github.com> On Wed, 28 Aug 2024 09:31:52 GMT, Ludovic Henry wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> minor > > src/hotspot/cpu/riscv/riscv.ad line 1916: > >> 1914: return UseRVV && MaxVectorSize >= 32; >> 1915: case Op_RoundVD: >> 1916: return UseRVV && MaxVectorSize >= 64; > > It would be worth leaving the same comment you've in the PR description here as well, to make it very clear in the sources why the option is enable/disabled on these parameters. Added comments. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17745#discussion_r1734407261 From duke at openjdk.org Wed Aug 28 14:14:27 2024 From: duke at openjdk.org (duke) Date: Wed, 28 Aug 2024 14:14:27 GMT Subject: RFR: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value [v14] In-Reply-To: References: Message-ID: On Mon, 26 Aug 2024 17:46:51 GMT, Kangcheng Xu wrote: >> This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) >> >> Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. >> >> New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 25 additional commits since the last revision: > > - Merge branch 'master' into boolnode-refactor > - add a word break to IRNode.CMP_U > - Merge branch 'master' into boolnode-refactor > - spread boolean AND and OR into subcases, update number of expected CMP_U nodes > - Merge branch 'master' into boolnode-refactor > - Merge branch 'master' into boolnode-refactor > - update test values, @run directive, and remove an empty line > - Merge branch 'master' into boolnode-refactor > - move test location, add negative test case, simplify imports > - Merge branch 'master' into boolnode-refactor > - ... and 15 more: https://git.openjdk.org/jdk/compare/2ab36fdd...719199c2 @tabjy Your change (at version 719199c29d06081f825202e13ea96d45141e18e4) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/18198#issuecomment-2315439915 From jzhu at openjdk.org Wed Aug 28 14:17:19 2024 From: jzhu at openjdk.org (Joshua Zhu) Date: Wed, 28 Aug 2024 14:17:19 GMT Subject: RFR: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only [v2] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 09:21:00 GMT, Andrew Dinn wrote: > The code changes look ok. What have you done to test it? @adinn Thanks for your review! > The maximum SVE vector length "VLmax" is determined by the hardware: 16 <= VLmax <= 256. The value of VL can be configured at runtime: 16 <= VL <= VLmax, where VL must be a multiple of 16. > > Once we find cpu's VLMax is 16 bytes only, the verification "verify_sve_vector_length()" after native calls is not required - in other words, VL cannot be configured to a value other than 16. I checked the behavior of prctl(PR_SVE_SET_VL, value) by a separated C case. https://github.com/JoshuaZhuwj/openjdk_cases/blob/master/8339063/setSVEVL.c The output is aligned with the above expectation. https://github.com/JoshuaZhuwj/openjdk_cases/blob/master/8339063/output I have an aarch64 hardware at hand with only 128-bit SVE vector length. With this change applied, the generated native wrapper and native entry no longer check SVE VL change after native calls in the machine. I also ensure no regression failures by jtreg case: test/hotspot/jtreg/compiler/c2/aarch64/TestSVEWithJNI.java Also no regression failures when JVM starts up by specifying different MaxVectorSize. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20724#issuecomment-2315459504 From rcastanedalo at openjdk.org Wed Aug 28 15:49:22 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 28 Aug 2024 15:49:22 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> Message-ID: <6rvTU-KY2DpLx3sK7zmkaZCBaNELr3FLGItGwUJzNUM=.0e75c7a7-4309-49c5-b48b-5aa642bcbe43@github.com> On Tue, 27 Aug 2024 17:38:28 GMT, Martin Doerr wrote: >>> Regardless if you implement the compressed oops optimization or not, I'd reconsider moving the oop decoding into G1BarrierSetAssembler::g1_write_barrier_post_c2 because it makes the .ad file shorter because you can get rid of the replicated decode_heap_oop. >> >> I tried this refactoring [here](https://github.com/openjdk/jdk/commit/d4e83fd7d77c5415700b33556752e8c8da811dea), thanks again Martin for the suggestion. In my opinion, the result is similar in terms of readability/maintainability because the benefit of removing explicit `decode_heap_oop` operations in the ADL file is somewhat negated by the increased complexity of the `write_barrier_post` and `g1_write_barrier_post_c2` signatures. For aarch64, moving non-destructive `decode_heap_oop` operations would probably require passing both source and destination registers to these functions explicitly, which would make them more complex. I feel that this refactoring is only amortized when the new-value decoding operations are more complex, as in your PPC implementation. > > Thanks for trying! I like your refactored version. I prefer moving stuff out of the .ad files. The complexity of the barrier code is not significantly higher. Note that you could use `decode_heap_oop_not_null(Register dst, Register src)` with `tmp2` as dst (`generate_post_barrier_fast_path` can deal with new_val == tmp2). That would save a move instruction in some cases. > I haven't looked into the aarch64 code. I leave you free to decide. Thanks, I prototyped the refactored version for both x64 and aarch64 [here](https://github.com/robcasloz/jdk/commit/c1ae871eadac0d44981b7892ac8f7b64e8734283). I do not have a strong opinion for or against this refactoring. @albertnetymk @kimbarrett what do you think about it? (asking since you have recently looked at this code). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1734924686 From adinn at openjdk.org Wed Aug 28 15:55:18 2024 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 28 Aug 2024 15:55:18 GMT Subject: RFR: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only [v2] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 08:51:11 GMT, Joshua Zhu wrote: >> Please review this minor enhancement that skips verify_sve_vector_length after native calls. >> It works on SVE micro-architecture that only supports 128-bit vector length. > > Joshua Zhu has updated the pull request incrementally with one additional commit since the last revision: > > Fix compilation failure with --disable-precompiled-headers Marked as reviewed by adinn (Reviewer). Ok, that sounds like it is sufficient. ------------- PR Review: https://git.openjdk.org/jdk/pull/20724#pullrequestreview-2266680405 PR Comment: https://git.openjdk.org/jdk/pull/20724#issuecomment-2315725636 From sviswanathan at openjdk.org Wed Aug 28 16:08:25 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 28 Aug 2024 16:08:25 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v2] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 00:12:26 GMT, Sandhya Viswanathan wrote: >> Hey @jaskarth , Central idea behind introducing VectorReinterpretNode after unsigned vector IR is to facilitate unboxing-boxing optimization, this explicit reinterpretation ensures type compatibility between value being boxed and box type which is always signed vector types. >> >> As mentioned previously my plan is to address is handle value range related concerns in a follow up patch along with intrisification and auto-vectorization of newly created scalar saturating IR, this patch is not generating scalar IR with newly defined unsigned types. > > Wonder if it would have been simpler if we added unsigned vector operators like Op_SaturatingUnsignedAddVB etc. We are not adding unsigned data types to Java, only supporting unsigned (saturating) operations on existing signed integral types. If the aim is to reduce the number of nodes, we could merge the Op_SaturatingAddVB, Op_SaturatingAddVS, Op_SaturatingAddVI, and Op_SaturatingAddVL into one Op_SaturatingAddV. Likewise for unsigned saturating add into Op_SaturatingUnsignedAddV. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1734951862 From vlivanov at openjdk.org Wed Aug 28 16:48:20 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 28 Aug 2024 16:48:20 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: On Tue, 27 Aug 2024 16:11:22 GMT, Dean Long wrote: > The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20730#pullrequestreview-2266794232 From dcubed at openjdk.org Wed Aug 28 16:52:21 2024 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 28 Aug 2024 16:52:21 GMT Subject: Integrated: 8339175: ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp Message-ID: A trivial fix to ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp. ------------- Commit messages: - 8339175: ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp Changes: https://git.openjdk.org/jdk/pull/20751/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20751&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8339175 Stats: 3 lines in 2 files changed: 2 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20751.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20751/head:pull/20751 PR: https://git.openjdk.org/jdk/pull/20751 From matsaave at openjdk.org Wed Aug 28 16:52:21 2024 From: matsaave at openjdk.org (Matias Saavedra Silva) Date: Wed, 28 Aug 2024 16:52:21 GMT Subject: Integrated: 8339175: ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 16:41:30 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp. Looks good, thanks! ------------- Marked as reviewed by matsaave (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20751#pullrequestreview-2266791371 From dcubed at openjdk.org Wed Aug 28 16:52:22 2024 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 28 Aug 2024 16:52:22 GMT Subject: Integrated: 8339175: ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 16:44:12 GMT, Matias Saavedra Silva wrote: >> A trivial fix to ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp. > > Looks good, thanks! @matias9927 - Thanks (again) for the lightning fast review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20751#issuecomment-2315828811 From dcubed at openjdk.org Wed Aug 28 16:52:22 2024 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Wed, 28 Aug 2024 16:52:22 GMT Subject: Integrated: 8339175: ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp In-Reply-To: References: Message-ID: <5UBP_2bC_dkPfxFtBDz57mCxCRyF7oiltdPuGydA1CE=.2ef9242a-d4d0-4e8b-be7a-b60c4dc22462@github.com> On Wed, 28 Aug 2024 16:41:30 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp. This pull request has now been integrated. Changeset: 379f3db0 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/379f3db001fe4bffd3a00e0363a98275e7b2eba8 Stats: 3 lines in 2 files changed: 2 ins; 1 del; 0 mod 8339175: ProblemList runtime/interpreter/LastJsrTest.java on all platforms with Xcomp Reviewed-by: matsaave ------------- PR: https://git.openjdk.org/jdk/pull/20751 From jbhateja at openjdk.org Wed Aug 28 17:56:26 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 28 Aug 2024 17:56:26 GMT Subject: RFR: 8329035: New Data Destination instructions support In-Reply-To: References: Message-ID: On Fri, 23 Aug 2024 22:44:09 GMT, Steve Dohrmann wrote: > Adds assembler support for APX New Data Destination (NDD) and No Flags (NF) features. > > The NDD feature is supported by new functions that take an additional destination-only register operand. If the instruction also supports NF, a no_flags boolean parameter is present. To use these instructions with NF behavior, but without NDD semantics, the same register can be supplied for both the new destination and the (first) source operand. > > Some instructions support NF but not NDD. These instructions have a new function that just adds a boolean no_flags parameter. Existing functions were not overloaded with a boolean here because of signature collisions (bool / int) with functions that take immediate operands. > > All of the new functions have a letter "e" prefix, to avoid signature collisions and to indicate they will be evex encoded. src/hotspot/cpu/x86/assembler_x86.cpp line 1579: > 1577: > 1578: void Assembler::eaddl(Register dst, Register src1, Register src2, bool no_flags) { > 1579: InstructionAttr attributes(AVX_128bit, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false); Should we not auto demote these instruction to use legacy MAP0 encoding, if dst and src1 / src2 are same and does not belong to EGPR set? We do REX to VEX promotion and EVEX to VEX demotions at assembler level if the required criteria is met. src/hotspot/cpu/x86/assembler_x86.cpp line 2647: > 2645: InstructionAttr attributes(AVX_128bit, /* vex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false); > 2646: attributes.set_address_attributes(/* tuple_type */ EVEX_NOSCALE, /* input_size_in_bits */ EVEX_32bit); > 2647: vex_prefix_ndd(src, dst->encoding(), 0, VEX_SIMD_NONE, VEX_OPCODE_0F_3C, &attributes, no_flags); Suggestion: eevex_prefix_ndd(src, dst->encoding(), 0, VEX_SIMD_NONE, VEX_OPCODE_0F_3C, &attributes, no_flags); src/hotspot/cpu/x86/assembler_x86.hpp line 794: > 792: bool eevex_x, int nds_enc, VexSimdPrefix pre, VexOpcode opc, bool no_flags = false); > 793: > 794: void vex_prefix_ndd(Address adr, int ndd_enc, int xreg_enc, VexSimdPrefix pre, VexOpcode opc, InstructionAttr *attributes, bool no_flags = false) { Suggestion: void eevx_prefix_ndd(Address adr, int ndd_enc, int xreg_enc, VexSimdPrefix pre, VexOpcode opc, InstructionAttr *attributes, bool no_flags = false) { NDD is only supported with 4 byte extended evex encoding. src/hotspot/cpu/x86/assembler_x86.hpp line 798: > 796: } > 797: > 798: void vex_prefix_nf(Address adr, int ndd_enc, int xreg_enc, VexSimdPrefix pre, VexOpcode opc, InstructionAttr *attributes, bool no_flags = false) { Same as above. src/hotspot/cpu/x86/assembler_x86.hpp line 809: > 807: InstructionAttr *attributes, bool src_is_gpr = false, bool nds_is_ndd = false, bool force_evex = false, bool no_flags = false); > 808: > 809: int vex_prefix_and_encode_ndd(int dst_enc, int nds_enc, int src_enc, VexSimdPrefix pre, VexOpcode opc, Suggestion: int vex_prefix_and_encode_ndd(int ndd_enc, int nds_enc, int src_enc, VexSimdPrefix pre, VexOpcode opc, src/hotspot/cpu/x86/assembler_x86.hpp line 811: > 809: int vex_prefix_and_encode_ndd(int dst_enc, int nds_enc, int src_enc, VexSimdPrefix pre, VexOpcode opc, > 810: InstructionAttr *attributes, bool no_flags = false) { > 811: return vex_prefix_and_encode(dst_enc, nds_enc, src_enc, pre, opc, attributes, /* src_is_gpr */ true, /* nds_is_ndd */ true , /* force_evex */ true, no_flags); Suggestion: return vex_prefix_and_encode(dst_enc, nds_enc, src_enc, pre, opc, attributes, /* src_is_gpr */ true, /* nds_is_ndd */ true , /* force_evex */ true, no_flags); Suggestion: return vex_prefix_and_encode(ndd_enc, nds_enc, src_enc, pre, opc, attributes, /* src_is_gpr */ true, /* nds_is_ndd */ true , /* force_evex */ true, no_flags); src/hotspot/cpu/x86/assembler_x86.hpp line 814: > 812: } > 813: > 814: int vex_prefix_and_encode_nf(int dst_enc, int nds_enc, int src_enc, VexSimdPrefix pre, VexOpcode opc, Suggestion: int vex_prefix_and_encode_nf(int ndd_enc, int nds_enc, int src_enc, VexSimdPrefix pre, VexOpcode opc, src/hotspot/cpu/x86/assembler_x86.hpp line 816: > 814: int vex_prefix_and_encode_nf(int dst_enc, int nds_enc, int src_enc, VexSimdPrefix pre, VexOpcode opc, > 815: InstructionAttr *attributes, bool no_flags = false) { > 816: return vex_prefix_and_encode(dst_enc, nds_enc, src_enc, pre, opc, attributes, /* src_is_gpr */ true, /* nds_is_ndd */ false, /* force_evex */ true, no_flags); Suggestion: return vex_prefix_and_encode(ndd_enc, nds_enc, src_enc, pre, opc, attributes, /* src_is_gpr */ true, /* nds_is_ndd */ false, /* force_evex */ true, no_flags); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1734889129 PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1735083052 PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1734877069 PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1734877452 PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1734908102 PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1734908555 PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1734909059 PR Review Comment: https://git.openjdk.org/jdk/pull/20698#discussion_r1734909377 From mdoerr at openjdk.org Wed Aug 28 20:19:24 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 28 Aug 2024 20:19:24 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v5] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 06:54:57 GMT, Suchismith Roy wrote: >> JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) >> C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. >> Also, call_c is adapted as per endianess of system. >> We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. > > Suchismith Roy has updated the pull request incrementally with two additional commits since the last revision: > > - review comments > - review comments LGTM. Thanks for the update! One more thing: `C1_MacroAssembler::call_c_with_frame_resize` should be removed completely. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19947#pullrequestreview-2267182675 Changes requested by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19947#pullrequestreview-2267186515 From dlong at openjdk.org Wed Aug 28 20:48:24 2024 From: dlong at openjdk.org (Dean Long) Date: Wed, 28 Aug 2024 20:48:24 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: <4bcYvq1xWiAzMPmIw5XY7n4Jbkk63v3aPx23Vptf6FA=.b29ccea8-7a0b-4fda-98d2-d2e872a15581@github.com> On Wed, 28 Aug 2024 16:45:30 GMT, Vladimir Ivanov wrote: >> The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. > > Looks good. Thanks @iwanowww. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20730#issuecomment-2316222323 From svkamath at openjdk.org Wed Aug 28 22:42:33 2024 From: svkamath at openjdk.org (Smita Kamath) Date: Wed, 28 Aug 2024 22:42:33 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 [v2] In-Reply-To: References: Message-ID: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17515/files - new: https://git.openjdk.org/jdk/pull/17515/files/24c9c792..6b21983c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17515&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17515&range=00-01 Stats: 183 lines in 5 files changed: 17 ins; 45 del; 121 mod Patch: https://git.openjdk.org/jdk/pull/17515.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17515/head:pull/17515 PR: https://git.openjdk.org/jdk/pull/17515 From kvn at openjdk.org Wed Aug 28 22:43:26 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 28 Aug 2024 22:43:26 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v3] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 07:33:57 GMT, Matthias Baesken wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > add comment Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20615#pullrequestreview-2267384863 From iveresov at openjdk.org Wed Aug 28 22:43:28 2024 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 28 Aug 2024 22:43:28 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v3] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 07:33:57 GMT, Matthias Baesken wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > add comment Marked as reviewed by iveresov (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/20615#pullrequestreview-2267386184 From kvn at openjdk.org Wed Aug 28 22:47:32 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 28 Aug 2024 22:47:32 GMT Subject: RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v3] In-Reply-To: <_J7rxi43QXdtD_aJuhnbnoxDFUIi7HMFsHUutkpijw8=.324943d4-001a-4833-a106-ee46f7fbc6e6@github.com> References: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> <_J7rxi43QXdtD_aJuhnbnoxDFUIi7HMFsHUutkpijw8=.324943d4-001a-4833-a106-ee46f7fbc6e6@github.com> Message-ID: On Mon, 26 Aug 2024 08:46:41 GMT, Roland Westrelin wrote: >> I propose removing `PhaseIdealLoop::cast_incr_before_loop()` and the >> `CastII` nodes that it adds at counted loop heads. >> >> They were added to prevent nodes to float above the zero trip guard >> when the backedge of a counted loop is removed. In particular, when a >> range check is hoisted by predication, pre/main/post loops are created >> and if one of the main or post loops lose its backedge, an array load >> that's control dependent on a predicate above the pre loop could float >> above the zero trip guard of the main or post loop. That can no longer >> happen AFAICT with changes related to assert predicates. The array >> load is now updated to have a control dependency that's below the zero >> trip guard. >> >> The reason I'm revisiting this is that I noticed that >> `PhaseIdealLoop::cast_incr_before_loop()` has a bug. When it adds the >> `CastII`, it looks for the loop phi and picks input 1 of the phi it >> finds as input to the `CastII`. To find the loop phi, it starts from >> the loop incremement and loop for a use that's a phi and has the loop >> head as control. It never checks that the phi it finds is the loop >> phi. There can be more than one phi as uses of the increment at the >> loop head and it can pick the wrong one. I tried to write a test case >> where this would cause a bug but couldn't actually find any use for >> the `CastII` anymore. >> >> In my testing, the only issue when the `CastII` are not added is that >> some IR tests for vectorization fails: >> >> compiler/vectorization/TestPopulateIndex.java >> compiler/vectorization/runner/ArrayShiftOpTest.java >> compiler/vectorization/runner/LoopArrayIndexComputeTest.java >> >> because removing the `CastII` causes split if to occur with some nodes >> that take the loop phi as input. That then causes pattern matching >> during superword to break. I added logic to prevent split if for those >> cases. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8334724 > - review > - fix Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19831#pullrequestreview-2267389359 From kvn at openjdk.org Wed Aug 28 22:49:24 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 28 Aug 2024 22:49:24 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: On Tue, 27 Aug 2024 16:11:22 GMT, Dean Long wrote: > The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. Okay. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20730#pullrequestreview-2267391571 From kxu at openjdk.org Wed Aug 28 22:57:48 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Wed, 28 Aug 2024 22:57:48 GMT Subject: RFR: 8325495: C2: implement optimization for series of Add of unique value Message-ID: This pull request resolves [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) by converting series of additions of the same operand into multiplications. I.e., `a + a + ... + a + a + a => n*a`. As an added benefit, it also converts `C * a + a` into `(C+1) * a` and `a << C + a` into `(2^C + 1) * a` (with respect to constant `C`). This is actually a side effect of IGVN being iterative: at converting the `i`-th addition, the previous `i-1` additions would have already been optimized to multiplication (and thus, further into bit shifts and additions/subtractions if possible). Some notable examples of this transformation include: - `a + a + a` => `a*3` => `(a<<1) + a` - `a + a + a + a` => `a*4` => `a<<2` - `a*3 + a` => `a*4` => `a<<2` - `(a << 1) + a + a` => `a*2 + a + a` => `a*3 + a` => `a*4 => a<<2` See included IR unit tests for more. ------------- Commit messages: - add more IR tests - update comments of existing tests - Merge branch 'master' into arithmetic-canonicalization - add more IR tests - add more IR tests - distinguish AndNode from MulNode - add initial IR unit tests - passes all hotspot-compiler tests - implement arithmetic canonicalization for additions Changes: https://git.openjdk.org/jdk/pull/20754/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20754&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8325495 Stats: 316 lines in 5 files changed: 312 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/20754.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20754/head:pull/20754 PR: https://git.openjdk.org/jdk/pull/20754 From kxu at openjdk.org Wed Aug 28 23:06:18 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Wed, 28 Aug 2024 23:06:18 GMT Subject: RFR: 8325495: C2: implement optimization for series of Add of unique value In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 19:27:29 GMT, Kangcheng Xu wrote: > This pull request resolves [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) by converting series of additions of the same operand into multiplications. I.e., `a + a + ... + a + a + a => n*a`. > > As an added benefit, it also converts `C * a + a` into `(C+1) * a` and `a << C + a` into `(2^C + 1) * a` (with respect to constant `C`). This is actually a side effect of IGVN being iterative: at converting the `i`-th addition, the previous `i-1` additions would have already been optimized to multiplication (and thus, further into bit shifts and additions/subtractions if possible). > > Some notable examples of this transformation include: > - `a + a + a` => `a*3` => `(a<<1) + a` > - `a + a + a + a` => `a*4` => `a<<2` > - `a*3 + a` => `a*4` => `a<<2` > - `(a << 1) + a + a` => `a*2 + a + a` => `a*3 + a` => `a*4 => a<<2` > > See included IR unit tests for more. Also, please feel free to suggest better naming for `SerialAdditionCanonicalization` (test) and `AddNode::find_repeated_operand_in_chained_addition`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20754#issuecomment-2316383587 From dlong at openjdk.org Thu Aug 29 00:37:23 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 29 Aug 2024 00:37:23 GMT Subject: RFR: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: On Wed, 28 Aug 2024 22:47:02 GMT, Vladimir Kozlov wrote: >> The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. > > Okay. Thanks @vnkozlov . ------------- PR Comment: https://git.openjdk.org/jdk/pull/20730#issuecomment-2316499121 From dlong at openjdk.org Thu Aug 29 00:37:24 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 29 Aug 2024 00:37:24 GMT Subject: Integrated: 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed In-Reply-To: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> References: <53hCMrAv73-MOnpcsLz6wrv688AfIJ1mLeLog1WX07o=.4296304e-70d5-4a60-afb1-d00644732519@github.com> Message-ID: On Tue, 27 Aug 2024 16:11:22 GMT, Dean Long wrote: > The failing test uses class redefinition, which happen at safepoints. Because JIT compiler threads do not block safepoints, compiler threads can sometimes see multiple versions of the same method. The above assert can fail if target and cha_monomorphic_target are different versions of the same method. The fix is to introduce ciMethod::equals to deal with this possibility. This pull request has now been integrated. Changeset: 0ddcd701 Author: Dean Long URL: https://git.openjdk.org/jdk/commit/0ddcd7017576a0f9c97a74b7d47624ae06ed06d6 Stats: 19 lines in 3 files changed: 18 ins; 0 del; 1 mod 8335120: assert(!target->can_be_statically_bound() || target == cha_monomorphic_target) failed Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/20730 From sviswanathan at openjdk.org Thu Aug 29 00:39:22 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 29 Aug 2024 00:39:22 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 [v2] In-Reply-To: References: Message-ID: <6a58ZAJUKVmo6a0AZcEAD2GU9T4CsnapCx8lG1w2SGw=.fde0e4a0-2565-40d1-bd24-21fcac33cc8b@github.com> On Wed, 28 Aug 2024 22:42:33 GMT, Smita Kamath wrote: >> Hi, >> I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. >> >> Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain >> -- | -- | -- | -- | -- >> full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 >> full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 >> full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 >> full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 >> full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 >> full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 >> full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 >> full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 >> full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 >> ? | ? | ? | ? | ? >> full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 >> full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 >> full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 >> full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 >> full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 >> full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 >> full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 >> full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 >> full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments src/hotspot/cpu/x86/assembler_x86.cpp line 8984: > 8982: void Assembler::vinserti64x2(XMMRegister dst, XMMRegister nds, XMMRegister src, uint8_t imm8, int vector_len) { > 8983: assert(VM_Version::supports_avx512dq(), ""); > 8984: assert(vector_len == AVX_256bit || VM_Version::supports_avx512vl(), ""); As this is an evex instruction, we could call it evinserti64x2. Also the assert should be: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), ""); src/hotspot/cpu/x86/assembler_x86.cpp line 11057: > 11055: void Assembler::evbroadcastf64x2(XMMRegister dst, Address src, int vector_len) { > 11056: assert(VM_Version::supports_avx512dq(), ""); > 11057: assert(vector_len == AVX_256bit || VM_Version::supports_avx512vl(), ""); The assert should be: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), ""); src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 203: > 201: 0x0000000000000003ULL, 0x0000000000000000ULL, > 202: 0x0000000000000004ULL, 0x0000000000000000ULL, > 203: }; This is same as COUNTER_MASK_LINC0[] in this file. Could we reuse that instead? src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2813: > 2811: __ evshufi64x2(ZT5, ZT7, ZT7, 0x00, Assembler::AVX_512bit);//;; broadcast HashKey ^ 8 across all ZT5 > 2812: > 2813: for (int i = 20, j = 52; i >= 0;) { This should be i > 0. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1735419450 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1735419749 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1735427066 PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1735420157 From sviswanathan at openjdk.org Thu Aug 29 00:43:30 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 29 Aug 2024 00:43:30 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 [v2] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 22:42:33 GMT, Smita Kamath wrote: >> Hi, >> I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. >> >> Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain >> -- | -- | -- | -- | -- >> full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 >> full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 >> full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 >> full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 >> full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 >> full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 >> full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 >> full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 >> full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 >> ? | ? | ? | ? | ? >> full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 >> full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 >> full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 >> full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 >> full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 >> full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 >> full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 >> full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 >> full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments Please also update copyright dates in stubGenerator_x86_64_aes.cpp and stubGenerator_x86_64_ghash.cpp files. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17515#issuecomment-2316505875 From chagedorn at openjdk.org Thu Aug 29 05:26:25 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 29 Aug 2024 05:26:25 GMT Subject: RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v3] In-Reply-To: <_J7rxi43QXdtD_aJuhnbnoxDFUIi7HMFsHUutkpijw8=.324943d4-001a-4833-a106-ee46f7fbc6e6@github.com> References: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> <_J7rxi43QXdtD_aJuhnbnoxDFUIi7HMFsHUutkpijw8=.324943d4-001a-4833-a106-ee46f7fbc6e6@github.com> Message-ID: On Mon, 26 Aug 2024 08:46:41 GMT, Roland Westrelin wrote: >> I propose removing `PhaseIdealLoop::cast_incr_before_loop()` and the >> `CastII` nodes that it adds at counted loop heads. >> >> They were added to prevent nodes to float above the zero trip guard >> when the backedge of a counted loop is removed. In particular, when a >> range check is hoisted by predication, pre/main/post loops are created >> and if one of the main or post loops lose its backedge, an array load >> that's control dependent on a predicate above the pre loop could float >> above the zero trip guard of the main or post loop. That can no longer >> happen AFAICT with changes related to assert predicates. The array >> load is now updated to have a control dependency that's below the zero >> trip guard. >> >> The reason I'm revisiting this is that I noticed that >> `PhaseIdealLoop::cast_incr_before_loop()` has a bug. When it adds the >> `CastII`, it looks for the loop phi and picks input 1 of the phi it >> finds as input to the `CastII`. To find the loop phi, it starts from >> the loop incremement and loop for a use that's a phi and has the loop >> head as control. It never checks that the phi it finds is the loop >> phi. There can be more than one phi as uses of the increment at the >> loop head and it can pick the wrong one. I tried to write a test case >> where this would cause a bug but couldn't actually find any use for >> the `CastII` anymore. >> >> In my testing, the only issue when the `CastII` are not added is that >> some IR tests for vectorization fails: >> >> compiler/vectorization/TestPopulateIndex.java >> compiler/vectorization/runner/ArrayShiftOpTest.java >> compiler/vectorization/runner/LoopArrayIndexComputeTest.java >> >> because removing the `CastII` causes split if to occur with some nodes >> that take the loop phi as input. That then causes pattern matching >> during superword to break. I added logic to prevent split if for those >> cases. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8334724 > - review > - fix I will run some testing again with latest master, just to be sure. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19831#pullrequestreview-2267706574 From kxu at openjdk.org Thu Aug 29 05:37:30 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Thu, 29 Aug 2024 05:37:30 GMT Subject: Integrated: 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value In-Reply-To: References: Message-ID: On Mon, 11 Mar 2024 14:58:06 GMT, Kangcheng Xu wrote: > This PR resolves [JDK-8327381](https://bugs.openjdk.org/browse/JDK-8327381) > > Currently the transformations for expressions with patterns `((x & m) u<= m)` or `((m & x) u<= m)` to `true` is in `BoolNode::Ideal` function with a new constant node of value `1` created. However, this is technically a type-improving (reduction in range) transformation that's better suited in `BoolNode::Value` function. > > New unit test `test/hotspot/jtreg/compiler/c2/TestBoolNodeGvn.java` asserting on IR nodes and correctness of this transformation is added and passing. This pull request has now been integrated. Changeset: 1383fec4 Author: Kangcheng Xu URL: https://git.openjdk.org/jdk/commit/1383fec41756322bf2832c55633e46395b937b40 Stats: 186 lines in 4 files changed: 166 ins; 17 del; 3 mod 8327381: Refactor type-improving transformations in BoolNode::Ideal to BoolNode::Value Reviewed-by: chagedorn, thartmann, jkarthikeyan, epeter ------------- PR: https://git.openjdk.org/jdk/pull/18198 From jbhateja at openjdk.org Thu Aug 29 05:42:58 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 29 Aug 2024 05:42:58 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v7] In-Reply-To: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: > Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. > > > Declaration:- > Vector.selectFrom(Vector v1, Vector v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Adding descriptive comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20508/files - new: https://git.openjdk.org/jdk/pull/20508/files/408a8694..8d71f175 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20508&range=05-06 Stats: 7 lines in 1 file changed: 6 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20508.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20508/head:pull/20508 PR: https://git.openjdk.org/jdk/pull/20508 From jbhateja at openjdk.org Thu Aug 29 05:46:24 2024 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 29 Aug 2024 05:46:24 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v6] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Tue, 27 Aug 2024 20:00:56 GMT, Paul Sandoz wrote: > My comment was related to understanding what `SelectFromTwoVectorNode::Ideal` and `VectorRearrangeNode::Ideal` are doing - the former lowers, if needed, into the rearrange expression and the latter adjusts, if needed, the index vector (a comment describing this transformation would be useful, like you have in the former method). Done. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2316759572 From dfenacci at openjdk.org Thu Aug 29 06:34:54 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 29 Aug 2024 06:34:54 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v8] In-Reply-To: References: Message-ID: > # Issue > > The test `compiler/startup/StartupOutput.java` fails intermittently due to a crash after correctly printing the error `Initial size of CodeCache is too small` (the test limits the code cache using k-XX:InitialCodeCacheSize=1024K -XX:ReservedCodeCacheSize=1200k`). > The appearance of the issue is very dependent on thread scheduling. The original report happens during C1 initialization but C2 initialization is affected as well. > > # Causes > > There is one occurrence during C1 initialization and one during C2 initialization where a call to `RuntimeStub::new_runtime_stub` can fail fatally if there is not enough space left. > For C1: `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`. > For C2: `C2Compiler::initialize` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`. > > # Solution > > https://github.com/openjdk/jdk/pull/15970 introduced an optional argument to `RuntimeStub::new_runtime_stub` to determine if it fails fatally or not. We can take advantage of it to avoid crashing and instead pass the information about the success or failure of the allocation up the (C1 and C2 initialization) call stack up to where we can set the compilations as failed. Damon Fenacci has updated the pull request incrementally with two additional commits since the last revision: - JDK-8326615: fix min code cache calculation - JDK-8326615: remove empty line from problemlist ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19280/files - new: https://git.openjdk.org/jdk/pull/19280/files/f79d0a31..0ea93c08 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=06-07 Stats: 8 lines in 4 files changed: 2 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/19280.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19280/head:pull/19280 PR: https://git.openjdk.org/jdk/pull/19280 From dfenacci at openjdk.org Thu Aug 29 06:37:34 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 29 Aug 2024 06:37:34 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v9] In-Reply-To: References: Message-ID: > # Issue > > The test `compiler/startup/StartupOutput.java` fails intermittently due to a crash after correctly printing the error `Initial size of CodeCache is too small` (the test limits the code cache using k-XX:InitialCodeCacheSize=1024K -XX:ReservedCodeCacheSize=1200k`). > The appearance of the issue is very dependent on thread scheduling. The original report happens during C1 initialization but C2 initialization is affected as well. > > # Causes > > There is one occurrence during C1 initialization and one during C2 initialization where a call to `RuntimeStub::new_runtime_stub` can fail fatally if there is not enough space left. > For C1: `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`. > For C2: `C2Compiler::initialize` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`. > > # Solution > > https://github.com/openjdk/jdk/pull/15970 introduced an optional argument to `RuntimeStub::new_runtime_stub` to determine if it fails fatally or not. We can take advantage of it to avoid crashing and instead pass the information about the success or failure of the allocation up the (C1 and C2 initialization) call stack up to where we can set the compilations as failed. Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8326615: update copyright year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19280/files - new: https://git.openjdk.org/jdk/pull/19280/files/0ea93c08..2bf65dad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=07-08 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/19280.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19280/head:pull/19280 PR: https://git.openjdk.org/jdk/pull/19280 From mbaesken at openjdk.org Thu Aug 29 07:10:22 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Thu, 29 Aug 2024 07:10:22 GMT Subject: RFR: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero [v3] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 07:33:57 GMT, Matthias Baesken wrote: >> When running test >> compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java >> with ubsan enabled binaries we run into the issue reported below. >> Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). >> >> /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero >> #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 >> #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 >> #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 >> #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 >> #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 >> #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 >> #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 >> #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 >> #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 >> #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 >> #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 >> #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 >> #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 >> #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 >> #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 >> #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) >> #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: ... > > Matthias Baesken has updated the pull request incrementally with one additional commit since the last revision: > > add comment Thanks for the reviews ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/20615#issuecomment-2316863984 From mbaesken at openjdk.org Thu Aug 29 07:10:23 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Thu, 29 Aug 2024 07:10:23 GMT Subject: Integrated: 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero In-Reply-To: References: Message-ID: On Fri, 16 Aug 2024 11:57:09 GMT, Matthias Baesken wrote: > When running test > compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java > with ubsan enabled binaries we run into the issue reported below. > Reason seems to be that we divide by zero in the code in some special cases (we should instead check for `CompilationPolicy::min_invocations() == 0 `and handle it separately). > > /jdk/src/hotspot/share/opto/bytecodeInfo.cpp:318:59: runtime error: division by zero > #0 0x7f5145c0dda2 in InlineTree::should_not_inline(ciMethod*, ciMethod*, int, bool&, ciCallProfile&) src/hotspot/share/opto/bytecodeInfo.cpp:318 > #1 0x7f51466366d7 in InlineTree::try_to_inline(ciMethod*, ciMethod*, int, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:382 > #2 0x7f514663d36b in InlineTree::ok_to_inline(ciMethod*, JVMState*, ciCallProfile&, bool&) src/hotspot/share/opto/bytecodeInfo.cpp:596 > #3 0x7f51470dffd6 in Compile::call_generator(ciMethod*, int, bool, JVMState*, bool, float, ciKlass*, bool) src/hotspot/share/opto/doCall.cpp:189 > #4 0x7f51470e18ab in Parse::do_call() src/hotspot/share/opto/doCall.cpp:641 > #5 0x7f514887dbf1 in Parse::do_one_block() src/hotspot/share/opto/parse1.cpp:1607 > #6 0x7f514887fefa in Parse::do_all_blocks() src/hotspot/share/opto/parse1.cpp:724 > #7 0x7f514888d4da in Parse::Parse(JVMState*, ciMethod*, float) src/hotspot/share/opto/parse1.cpp:628 > #8 0x7f51469d8418 in ParseGenerator::generate(JVMState*) src/hotspot/share/opto/callGenerator.cpp:99 > #9 0x7f5146d99cff in Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*) src/hotspot/share/opto/compile.cpp:793 > #10 0x7f51469d5ebf in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) src/hotspot/share/opto/c2compiler.cpp:142 > #11 0x7f5146db0274 in CompileBroker::invoke_compiler_on_method(CompileTask*) src/hotspot/share/compiler/compileBroker.cpp:2303 > #12 0x7f5146db2826 in CompileBroker::compiler_thread_loop() src/hotspot/share/compiler/compileBroker.cpp:1961 > #13 0x7f51478d475a in JavaThread::thread_main_inner() src/hotspot/share/runtime/javaThread.cpp:759 > #14 0x7f51491620ea in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #15 0x7f51487ac201 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:846 > #16 0x7f514e5cf6e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 2f8d3c2d0f4d7888c2598d2ff6356537f5708a73) > #17 0x7f514db1550e in clone (/lib64/libc.so.6+0x11850e) (BuildId: f732026552f6adff988b338e92d466bc81a01c37) This pull request has now been integrated. Changeset: f080b4bb Author: Matthias Baesken URL: https://git.openjdk.org/jdk/commit/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod 8333098: ubsan: bytecodeInfo.cpp:318:59: runtime error: division by zero Reviewed-by: kvn, iveresov ------------- PR: https://git.openjdk.org/jdk/pull/20615 From sroy at openjdk.org Thu Aug 29 07:49:49 2024 From: sroy at openjdk.org (Suchismith Roy) Date: Thu, 29 Aug 2024 07:49:49 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v6] In-Reply-To: References: Message-ID: <0XReE7gZIQm55hR-SDnLB5664zfafkhOyivXTmGEIFk=.be01447c-7667-45f5-b1c4-08688b75ce4e@github.com> > JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) > C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. > Also, call_c is adapted as per endianess of system. > We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. Suchismith Roy has updated the pull request incrementally with two additional commits since the last revision: - header file change - remove frame_resize ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19947/files - new: https://git.openjdk.org/jdk/pull/19947/files/f7d7854c..7c7de0ec Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19947&range=04-05 Stats: 13 lines in 3 files changed: 0 ins; 12 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/19947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19947/head:pull/19947 PR: https://git.openjdk.org/jdk/pull/19947 From mdoerr at openjdk.org Thu Aug 29 08:09:20 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 29 Aug 2024 08:09:20 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v6] In-Reply-To: <0XReE7gZIQm55hR-SDnLB5664zfafkhOyivXTmGEIFk=.be01447c-7667-45f5-b1c4-08688b75ce4e@github.com> References: <0XReE7gZIQm55hR-SDnLB5664zfafkhOyivXTmGEIFk=.be01447c-7667-45f5-b1c4-08688b75ce4e@github.com> Message-ID: On Thu, 29 Aug 2024 07:49:49 GMT, Suchismith Roy wrote: >> JBS Issue : [JDK-8332423](https://bugs.openjdk.org/browse/JDK-8332423) >> C1_MacroAssembler::call_c_with_frame_resize is only used with frame_resize == 0. >> Also, call_c is adapted as per endianess of system. >> We can adapt the exisiting code to handle the endianness check at one place and not have to repeatedly check at multiple places to make calls to call_c. > > Suchismith Roy has updated the pull request incrementally with two additional commits since the last revision: > > - header file change > - remove frame_resize This looks good, now! Thank you! ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19947#pullrequestreview-2267988286 From sroy at openjdk.org Thu Aug 29 08:19:19 2024 From: sroy at openjdk.org (Suchismith Roy) Date: Thu, 29 Aug 2024 08:19:19 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v6] In-Reply-To: References: <0XReE7gZIQm55hR-SDnLB5664zfafkhOyivXTmGEIFk=.be01447c-7667-45f5-b1c4-08688b75ce4e@github.com> Message-ID: On Thu, 29 Aug 2024 08:06:54 GMT, Martin Doerr wrote: >> Suchismith Roy has updated the pull request incrementally with two additional commits since the last revision: >> >> - header file change >> - remove frame_resize > > This looks good, now! Thank you! @TheRealMDoerr i would need another review for this right ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/19947#issuecomment-2316990248 From mdoerr at openjdk.org Thu Aug 29 08:22:20 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 29 Aug 2024 08:22:20 GMT Subject: RFR: JDK-8332423 : [PPC64] Remove C1_MacroAssembler::call_c_with_frame_resize [v6] In-Reply-To: References: <0XReE7gZIQm55hR-SDnLB5664zfafkhOyivXTmGEIFk=.be01447c-7667-45f5-b1c4-08688b75ce4e@github.com> Message-ID: On Thu, 29 Aug 2024 08:06:54 GMT, Martin Doerr wrote: >> Suchismith Roy has updated the pull request incrementally with two additional commits since the last revision: >> >> - header file change >> - remove frame_resize > > This looks good, now! Thank you! > @TheRealMDoerr i would need another review for this right ? Yes, please. It's not classified as trivial. Maybe you can ask somebody from your team. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19947#issuecomment-2316996414 From ayang at openjdk.org Thu Aug 29 08:40:21 2024 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Thu, 29 Aug 2024 08:40:21 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: <6rvTU-KY2DpLx3sK7zmkaZCBaNELr3FLGItGwUJzNUM=.0e75c7a7-4309-49c5-b48b-5aa642bcbe43@github.com> References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> <6rvTU-KY2DpLx3sK7zmkaZCBaNELr3FLGItGwUJzNUM=.0e75c7a7-4309-49c5-b48b-5aa642bcbe43@github.com> Message-ID: On Wed, 28 Aug 2024 15:46:57 GMT, Roberto Casta?eda Lozano wrote: >> Thanks for trying! I like your refactored version. I prefer moving stuff out of the .ad files. The complexity of the barrier code is not significantly higher. Note that you could use `decode_heap_oop_not_null(Register dst, Register src)` with `tmp2` as dst (`generate_post_barrier_fast_path` can deal with new_val == tmp2). That would save a move instruction in some cases. >> I haven't looked into the aarch64 code. I leave you free to decide. > > Thanks, I prototyped the refactored version for both x64 and aarch64 [here](https://github.com/robcasloz/jdk/commit/c1ae871eadac0d44981b7892ac8f7b64e8734283). I do not have a strong opinion for or against this refactoring. @albertnetymk @kimbarrett what do you think about it? (asking since you have recently looked at this code). I find the use of default-arg-value `bool decode_new_val = false` a bit confusing. (I tend to think default-arg-value makes the code less readable in general.) If not using default-arg-value, I suspect the diff will be larger, and I don't see the immediate benefit of this refactoring. Maybe this can be deferred to its own PR if it's really desirable? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1735806805 From rcastanedalo at openjdk.org Thu Aug 29 09:11:23 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 29 Aug 2024 09:11:23 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2] In-Reply-To: References: <4c-MLXwKcNcSnloSkYkuk3gnv3ux5i5beS51Fd9Z8MQ=.cd0a7eba-ff26-4855-a01c-d1ae5182100b@github.com> <8fuUEkswt05x0IuT4PrNQuYgLd49g4EpZWOPPQog4PQ=.70b5edb6-98d0-4276-8578-f7a496b7f2a7@github.com> <3H3rBSKDnpg5fmYqcZ5hT9yH2EAxCocycRompQJJCOo=.1b30fd89-09e9-4708-bd20-cdea00e809a7@github.com> <7Bjcf6MF4aTBuk4DmTnGzP0WwCWqJx_sv5k2sGMt9No=.ca426f04-4fa5-4ab1-a414-8c5e6a4e0dce@github.com> <6PF-kgezzOb9Ed7j-BbrwaURnLJH5aFgOouFwTYiFrE=.670092b8-6980-43f3-a091-25312cfa0f1b@github.com> <1vQH6zpEgjhIO_mq9DCnpwxgDXmnqdx0owlvjJq4Fcw=.78e60c6e-2b23-4e94-a998-e7ba9eafcb6a@github.com> <6rvTU-KY2DpLx3sK7zmkaZCBaNELr3FLGItGwUJzNUM=.0e75c7a7-4309-49c5-b48b-5aa642bcbe43@github.com> Message-ID: On Thu, 29 Aug 2024 08:37:24 GMT, Albert Mingkun Yang wrote: >> Thanks, I prototyped the refactored version for both x64 and aarch64 [here](https://github.com/robcasloz/jdk/commit/c1ae871eadac0d44981b7892ac8f7b64e8734283). I do not have a strong opinion for or against this refactoring. @albertnetymk @kimbarrett what do you think about it? (asking since you have recently looked at this code). > > I find the use of default-arg-value `bool decode_new_val = false` a bit confusing. (I tend to think default-arg-value makes the code less readable in general.) > > If not using default-arg-value, I suspect the diff will be larger, and I don't see the immediate benefit of this refactoring. Maybe this can be deferred to its own PR if it's really desirable? Thanks for looking at it, Albert! Since there is no clear consensus, let's postpone the refactoring. We can come back to it after the JEP is integrated if there is renewed interest. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1735852561 From duke at openjdk.org Thu Aug 29 09:44:45 2024 From: duke at openjdk.org (Yagmur Eren) Date: Thu, 29 Aug 2024 09:44:45 GMT Subject: RFR: 8330159: [C2] Remove or clarify Compile::init_start [v2] In-Reply-To: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> References: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> Message-ID: > Compile::init_start method contained only an assertion. To cleanup, this method is removed and the locations where this method is called are replaced with the corresponding assertion. See issue: [JDK-8330159](https://bugs.openjdk.org/browse/JDK-8330159) Yagmur Eren has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: Rename Compile::init_start as Compile::verify_start and make it debug only ------------- Changes: https://git.openjdk.org/jdk/pull/20715/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20715&range=01 Stats: 12 lines in 3 files changed: 3 ins; 3 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/20715.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20715/head:pull/20715 PR: https://git.openjdk.org/jdk/pull/20715 From duke at openjdk.org Thu Aug 29 09:47:18 2024 From: duke at openjdk.org (Yagmur Eren) Date: Thu, 29 Aug 2024 09:47:18 GMT Subject: RFR: 8330159: [C2] Remove or clarify Compile::init_start In-Reply-To: References: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> Message-ID: On Tue, 27 Aug 2024 07:36:21 GMT, Christian Hagedorn wrote: >> Compile::init_start method contained only an assertion. To cleanup, this method is removed and the locations where this method is called are replaced with the corresponding assertion. See issue: [JDK-8330159](https://bugs.openjdk.org/browse/JDK-8330159) > > Hi @nelanbu, I don't think this is correct. In `Compile::start()`, we have the following code: > https://github.com/openjdk/jdk/blob/b8e8e965e541881605f9dbcd4d9871d4952b9232/src/hotspot/share/opto/compile.cpp#L1121-L1131 > > It asserts that `failing()` is false. Therefore, `init_start()` bails out before checking the assert with `start()` which you now no longer do with your refactoring. > > What you could do instead: > - Simplify the code in `init_start()` to and add an assertion message: > > assert(failing() || s == start(), "should be StartNode"); > > - Change `init_start_node()` into a more meaningful name like `verify_start()`, as we are not actually initializing anything but rather sanity checking the start node. > - Guard the method with `DEBUG_ONLY/ifdef ASSERT` since it's only calling an assert in debug VM and nothing in product VM. Hello @chhagedorn, thanks a lot for your feedback! I updated accordingly. Hope it looks better now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20715#issuecomment-2317179737 From fgao at openjdk.org Thu Aug 29 09:58:21 2024 From: fgao at openjdk.org (Fei Gao) Date: Thu, 29 Aug 2024 09:58:21 GMT Subject: RFR: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only [v2] In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 08:51:11 GMT, Joshua Zhu wrote: >> Please review this minor enhancement that skips verify_sve_vector_length after native calls. >> It works on SVE micro-architecture that only supports 128-bit vector length. > > Joshua Zhu has updated the pull request incrementally with one additional commit since the last revision: > > Fix compilation failure with --disable-precompiled-headers src/hotspot/cpu/aarch64/aarch64_vector.ad line 158: > 156: > 157: int length_in_bytes = vlen * type2aelembytes(bt); > 158: if (UseSVE == 0 && length_in_bytes > FloatRegister::neon_vl) { Should we also update `aarch64_vector_ad.m4` to avoid any mismatch :) ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20724#discussion_r1735911352 From jzhu at openjdk.org Thu Aug 29 10:54:19 2024 From: jzhu at openjdk.org (Joshua Zhu) Date: Thu, 29 Aug 2024 10:54:19 GMT Subject: RFR: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only [v2] In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 09:50:33 GMT, Fei Gao wrote: >> Joshua Zhu has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix compilation failure with --disable-precompiled-headers > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 158: > >> 156: >> 157: int length_in_bytes = vlen * type2aelembytes(bt); >> 158: if (UseSVE == 0 && length_in_bytes > FloatRegister::neon_vl) { > > Should we also update `aarch64_vector_ad.m4` to avoid any mismatch :) ? Nice catch! I overlooked this place. Thanks for your reminder! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20724#discussion_r1735987350 From gcao at openjdk.org Thu Aug 29 11:00:44 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 29 Aug 2024 11:00:44 GMT Subject: RFR: 8339237: RISC-V: Builds fail after JDK-8339120 Message-ID: Encountered this build warning/error when doing a native build on linux-riscv64 platform with GCC-13. We can simply remove float_regs_as_doubles_size_in_slots in file c1_Runtime1_riscv.cpp as it is not used anywhere. ### Testing - [x] release & fastdebug build OK on linux-riscv64 after this change. ------------- Commit messages: - 8339237: RISC-V: Builds fail after JDK-8339120 Changes: https://git.openjdk.org/jdk/pull/20765/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20765&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8339237 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20765.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20765/head:pull/20765 PR: https://git.openjdk.org/jdk/pull/20765 From fyang at openjdk.org Thu Aug 29 11:00:44 2024 From: fyang at openjdk.org (Fei Yang) Date: Thu, 29 Aug 2024 11:00:44 GMT Subject: RFR: 8339237: RISC-V: Builds fail after JDK-8339120 In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 10:56:01 GMT, Gui Cao wrote: > Encountered this build warning/error when doing a native build on linux-riscv64 platform with GCC-13. We can simply remove float_regs_as_doubles_size_in_slots in file c1_Runtime1_riscv.cpp as it is not used anywhere. > > ### Testing > - [x] release & fastdebug build OK on linux-riscv64 after this change. Looks good and trivial. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20765#pullrequestreview-2268380386 From chagedorn at openjdk.org Thu Aug 29 11:09:18 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 29 Aug 2024 11:09:18 GMT Subject: RFR: 8325495: C2: implement optimization for series of Add of unique value In-Reply-To: References: Message-ID: <9bufdoCabT8Xy6esVgnfU324AlnK7OOh9eXPVuGzjGs=.b73f651e-8820-4820-88a8-859320c9b33b@github.com> On Wed, 28 Aug 2024 19:27:29 GMT, Kangcheng Xu wrote: > This pull request resolves [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) by converting series of additions of the same operand into multiplications. I.e., `a + a + ... + a + a + a => n*a`. > > As an added benefit, it also converts `C * a + a` into `(C+1) * a` and `a << C + a` into `(2^C + 1) * a` (with respect to constant `C`). This is actually a side effect of IGVN being iterative: at converting the `i`-th addition, the previous `i-1` additions would have already been optimized to multiplication (and thus, further into bit shifts and additions/subtractions if possible). > > Some notable examples of this transformation include: > - `a + a + a` => `a*3` => `(a<<1) + a` > - `a + a + a + a` => `a*4` => `a<<2` > - `a*3 + a` => `a*4` => `a<<2` > - `(a << 1) + a + a` => `a*2 + a + a` => `a*3 + a` => `a*4 => a<<2` > > See included IR unit tests for more. I gave your patch a quick spinning in our testing. The existing test `compiler/c2/TestLargeTreeOfSubNodes.java` times out on linux-x64-debug with the following two flag combos: 1. -XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=0 -XX:PerMethodTrapLimit=0 2. -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:+StressArrayCopyMacroNode -XX:+StressLCM -XX:+StressGCM -XX:+StressIGVN -XX:+StressCCP -XX:+StressMacroExpansion -XX:+StressMethodHandleLinkerInlining -XX:+StressCompiledExceptionHandlers The test itself was created for an issue with `AddNode::IdealIL()` (see https://github.com/openjdk/jdk/pull/15923). Maybe there is a similar problem with your patch. The stack trace at the timeout looks like this: #0 0x00007fb0b813f456 in AddNode::find_repeated_operand_in_chained_addition(PhaseGVN*, Node*, Node**, int*) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #1 0x00007fb0b813f4dc in AddNode::find_repeated_operand_in_chained_addition(PhaseGVN*, Node*, Node**, int*) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so ... ... #54 0x00007fb0b813f4dc in AddNode::find_repeated_operand_in_chained_addition(PhaseGVN*, Node*, Node**, int*) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #55 0x00007fb0b813fc03 in AddNode::IdealIL(PhaseGVN*, bool, BasicType) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #56 0x00007fb0b90d5b3d in PhaseIterGVN::transform_old(Node*) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #57 0x00007fb0b92c93a6 in SubINode::Ideal(PhaseGVN*, bool) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #58 0x00007fb0b90d5b3d in PhaseIterGVN::transform_old(Node*) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #59 0x00007fb0b90cc31c in PhaseIterGVN::optimize() () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #60 0x00007fb0b857aed2 in Compile::process_for_post_loop_opts_igvn(PhaseIterGVN&) () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so #61 0x00007fb0b8580735 in Compile::Optimize() () from /opt/mach5/mesos/work_dir/jib-master/install/2024-08-29-0545095.christian.hagedorn.jdk-test/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so ------------- PR Comment: https://git.openjdk.org/jdk/pull/20754#issuecomment-2317336911 From duke at openjdk.org Thu Aug 29 11:23:50 2024 From: duke at openjdk.org (Yagmur Eren) Date: Thu, 29 Aug 2024 11:23:50 GMT Subject: RFR: 8330159: [C2] Remove or clarify Compile::init_start [v3] In-Reply-To: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> References: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> Message-ID: <4tc9doDDziDT16yv9jghJoWtiPO3AEWOuO0wfPk1QGs=.9ff1f3c3-82e0-4ab1-8437-bc12ea34820d@github.com> > Compile::init_start method contained only an assertion. To cleanup, this method is removed and the locations where this method is called are replaced with the corresponding assertion. See issue: [JDK-8330159](https://bugs.openjdk.org/browse/JDK-8330159) Yagmur Eren has updated the pull request incrementally with one additional commit since the last revision: remove method header ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20715/files - new: https://git.openjdk.org/jdk/pull/20715/files/fb5fd187..192eaaae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20715&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20715&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20715.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20715/head:pull/20715 PR: https://git.openjdk.org/jdk/pull/20715 From gcao at openjdk.org Thu Aug 29 14:21:20 2024 From: gcao at openjdk.org (Gui Cao) Date: Thu, 29 Aug 2024 14:21:20 GMT Subject: RFR: 8339237: RISC-V: Builds fail after JDK-8339120 In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 10:56:01 GMT, Gui Cao wrote: > Encountered this build warning/error when doing a native build on linux-riscv64 platform with GCC-13. We can simply remove float_regs_as_doubles_size_in_slots in file c1_Runtime1_riscv.cpp as it is not used anywhere. > > ### Testing > - [x] release & fastdebug build OK on linux-riscv64 after this change. Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20765#issuecomment-2317816390 From duke at openjdk.org Thu Aug 29 14:21:20 2024 From: duke at openjdk.org (duke) Date: Thu, 29 Aug 2024 14:21:20 GMT Subject: RFR: 8339237: RISC-V: Builds fail after JDK-8339120 In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 10:56:01 GMT, Gui Cao wrote: > Encountered this build warning/error when doing a native build on linux-riscv64 platform with GCC-13. We can simply remove float_regs_as_doubles_size_in_slots in file c1_Runtime1_riscv.cpp as it is not used anywhere. > > ### Testing > - [x] release & fastdebug build OK on linux-riscv64 after this change. @zifeihan Your change (at version edadbddf3ee653eebec4d63cbe0dbc78638f1108) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20765#issuecomment-2317824436 From mli at openjdk.org Thu Aug 29 15:17:33 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 29 Aug 2024 15:17:33 GMT Subject: RFR: 8338407: Support grouping several of existing regs into a new one Message-ID: Hi, Can you help to review this patch to add `group` support to operand? ### Some background about this pr In some platforms, there is some concept like a group of registers, for example on riscv there is vector group, which is a group of other single vectors. For example, m2 could be v2+v3, or v4+v5, m4 could be v4+v5+v6+v7, or v8+v9+v10+v11. And, it's helpful to represent these vector group explicitly, otherwise it's tedious and error-prone. For example, in existing code, there's some like below: instruct vstring_compareUL(iRegP_R11 str1, iRegI_R12 cnt1, iRegP_R13 str2, iRegI_R14 cnt2, iRegI_R10 result, vReg_V4 v4, vReg_V5 v5, vReg_V6 v6, vReg_V7 v7, vReg_V8 v8, vReg_V9 v9, vReg_V10 v10, vReg_V11 v11, iRegP_R28 tmp1, iRegL_R29 tmp2) // ... effect(KILL tmp1, KILL tmp2, USE_KILL str1, USE_KILL str2, USE_KILL cnt1, USE_KILL cnt2, TEMP v4, TEMP v5, TEMP v6, TEMP v7, TEMP v8, TEMP v9, TEMP v10, TEMP v11); // ... __ string_compare_v($str1$$Register, $str2$$Register, $cnt1$$Register, $cnt2$$Register, $result$$Register, $tmp1$$Register, $tmp2$$Register, StrIntrinsicNode::UL); The potential problems of the above code are that we need to 1. write v4~v11 explicitly in its `instruct` and its `effect`, it's tedious; 2. vector group are represented implicitly, which is not clear and error-prone; 3. in its encoding `string_compare_v`, we need to specify m4, and v4/v8 explicitly. 4. if some day we need to adjust from m4 to m2 or m8, it's really tedious and error-prone to make that change in both ad file and macro assembler files. ### This PR The proposed solution is to represent a group of vector registers with a real vector group, e.g. `vReg_V4 v4, vReg_V5 v5, vReg_V6 v6, vReg_V7 v7` with `vReg_V4M4 v4m4`, `TEMP v4, TEMP v5, TEMP v6, TEMP v7` with `TEMP v4m4` and in `string_compare_v` implementation, we could query the length of of vector group (i.e. m4 in this case) and set its vtype automatically. This solution solve the above listed issues, especially the last issue, that means in the future if we need to adjust m4 to m2 or m8, we only need to change the code in ad file and the change is simpler, and no change in string_compare_v is needed. ### What it looks like For more usage details, please please check [here](https://github.com/openjdk/jdk/compare/master...Hamlin-Li:jdk:explicit-v-reg-group-v3), the riscv part. Basically it looks like below: operand reg_x() %{ group(reg_y, reg_z) %} And this reg_x can be used in an instruct as its input operand, a TEMP in effect list, in ins_encode as parameters. Underlying, a group operand will be ungrouped automatically as separate operands. ### Alternative I tried several solutions. 1. One of them is to just add some new reg class and operand, it kindly worked, but can only prevent other regs in an instruct using the one of the vector regs in a vector group. 2. Another is to make modification in chaitin of C2 to support vector group, seems to me it finally turned out not practical, too much basic code change in chaitin and the impact is too big to be acceptable. Current solution is to change the adlc parser, basically it works like a macro expansion, replacing vector groups with vectors, so in this way chaitin implementation does not need any changes, also means it's simpler to implement and more acceptable for the community (I hope so). Thanks! ------------- Commit messages: - clean - Initial commit - rename from expand/unexpand to ungroup/group - simplify group - warning - assert -> syntax error - Merge branch 'master' into explicit-v-reg-group-v3 - fix assert - Merge branch 'master' into explicit-v-reg-group-v3 - fix warning - ... and 10 more: https://git.openjdk.org/jdk/compare/129f527f...f4be2c6c Changes: https://git.openjdk.org/jdk/pull/20775/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20775&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338407 Stats: 398 lines in 8 files changed: 320 ins; 30 del; 48 mod Patch: https://git.openjdk.org/jdk/pull/20775.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20775/head:pull/20775 PR: https://git.openjdk.org/jdk/pull/20775 From kvn at openjdk.org Thu Aug 29 16:25:26 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 29 Aug 2024 16:25:26 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v9] In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 06:37:34 GMT, Damon Fenacci wrote: >> # Issue >> >> The test `compiler/startup/StartupOutput.java` fails intermittently due to a crash after correctly printing the error `Initial size of CodeCache is too small` (the test limits the code cache using k-XX:InitialCodeCacheSize=1024K -XX:ReservedCodeCacheSize=1200k`). >> The appearance of the issue is very dependent on thread scheduling. The original report happens during C1 initialization but C2 initialization is affected as well. >> >> # Causes >> >> There is one occurrence during C1 initialization and one during C2 initialization where a call to `RuntimeStub::new_runtime_stub` can fail fatally if there is not enough space left. >> For C1: `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`. >> For C2: `C2Compiler::initialize` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`. >> >> # Solution >> >> https://github.com/openjdk/jdk/pull/15970 introduced an optional argument to `RuntimeStub::new_runtime_stub` to determine if it fails fatally or not. We can take advantage of it to avoid crashing and instead pass the information about the success or failure of the allocation up the (C1 and C2 initialization) call stack up to where we can set the compilations as failed. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8326615: update copyright year src/hotspot/share/compiler/compilerDefinitions.inline.hpp line 29: > 27: > 28: #include "c1/c1_Compiler.hpp" > 29: #include "opto/c2compiler.hpp" These should be also under COMPILER*_PRESENT macros. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19280#discussion_r1736635992 From kxu at openjdk.org Thu Aug 29 16:49:19 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Thu, 29 Aug 2024 16:49:19 GMT Subject: RFR: 8325495: C2: implement optimization for series of Add of unique value In-Reply-To: <9bufdoCabT8Xy6esVgnfU324AlnK7OOh9eXPVuGzjGs=.b73f651e-8820-4820-88a8-859320c9b33b@github.com> References: <9bufdoCabT8Xy6esVgnfU324AlnK7OOh9eXPVuGzjGs=.b73f651e-8820-4820-88a8-859320c9b33b@github.com> Message-ID: On Thu, 29 Aug 2024 11:06:21 GMT, Christian Hagedorn wrote: > The existing test compiler/c2/TestLargeTreeOfSubNodes.java times out... I'm currently investigating this. The test is unrolled and produces a chained subtraction of ~180 nodes and causing performance regression. This will take a little bit of time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20754#issuecomment-2318347858 From matsaave at openjdk.org Thu Aug 29 17:20:27 2024 From: matsaave at openjdk.org (Matias Saavedra Silva) Date: Thu, 29 Aug 2024 17:20:27 GMT Subject: RFR: 8338924: C1: assert(0 <= i && i < _len) failed: illegal index 5 for length 5 Message-ID: The change in [JDK-8335664](https://bugs.openjdk.org/browse/JDK-8335664) caused a crash when initializing basic blocks with `-Xcomp`. This change introduces a check to see if JSR is the last bytecode in its method so that expected behavior matches the previous patch. Verified with tier 1-6 tests. ------------- Commit messages: - Copyright date - Removed from Problemlist-Xcomp - Merge branch 'master' into jsr_test_8338924 - Dean patch - Moved conditional up - Removed test from problem list - Corrected previous change - Merge branch 'master' into jsr_test_8338924 - Removed repeated code - 8338924: runtime/interpreter/LastJsrTest.java fails assert(0 <= i && i < _len) failed: illegal index 5 for length 5 Changes: https://git.openjdk.org/jdk/pull/20732/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20732&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338924 Stats: 25 lines in 4 files changed: 19 ins; 2 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/20732.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20732/head:pull/20732 PR: https://git.openjdk.org/jdk/pull/20732 From dlong at openjdk.org Thu Aug 29 17:20:27 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 29 Aug 2024 17:20:27 GMT Subject: RFR: 8338924: C1: assert(0 <= i && i < _len) failed: illegal index 5 for length 5 In-Reply-To: References: Message-ID: On Tue, 27 Aug 2024 18:01:16 GMT, Matias Saavedra Silva wrote: > The change in [JDK-8335664](https://bugs.openjdk.org/browse/JDK-8335664) caused a crash when initializing basic blocks with `-Xcomp`. This change introduces a check to see if JSR is the last bytecode in its method so that expected behavior matches the previous patch. Verified with tier 1-6 tests. Please add a loop to the test so it triggers this issue without -Xcomp. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20732#issuecomment-2316469683 From dlong at openjdk.org Thu Aug 29 19:32:18 2024 From: dlong at openjdk.org (Dean Long) Date: Thu, 29 Aug 2024 19:32:18 GMT Subject: RFR: 8325495: C2: implement optimization for series of Add of unique value In-Reply-To: References: Message-ID: On Wed, 28 Aug 2024 19:27:29 GMT, Kangcheng Xu wrote: > This pull request resolves [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) by converting series of additions of the same operand into multiplications. I.e., `a + a + ... + a + a + a => n*a`. > > As an added benefit, it also converts `C * a + a` into `(C+1) * a` and `a << C + a` into `(2^C + 1) * a` (with respect to constant `C`). This is actually a side effect of IGVN being iterative: at converting the `i`-th addition, the previous `i-1` additions would have already been optimized to multiplication (and thus, further into bit shifts and additions/subtractions if possible). > > Some notable examples of this transformation include: > - `a + a + a` => `a*3` => `(a<<1) + a` > - `a + a + a + a` => `a*4` => `a<<2` > - `a*3 + a` => `a*4` => `a<<2` > - `(a << 1) + a + a` => `a*2 + a + a` => `a*3 + a` => `a*4 => a<<2` > > See included IR unit tests for more. Can we redistrute the repeated terms and turn `a + x + a + y + a` into `a*3 + x + y`? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20754#issuecomment-2318719149 From sviswanathan at openjdk.org Thu Aug 29 23:41:22 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 29 Aug 2024 23:41:22 GMT Subject: RFR: 8338021: Support saturating vector operators in VectorAPI [v4] In-Reply-To: References: Message-ID: On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators. >> >> >> . SUADD : Saturating unsigned addition. >> . SADD : Saturating signed addition. >> . SUSUB : Saturating unsigned subtraction. >> . SSUB : Saturating signed subtraction. >> . UMAX : Unsigned max >> . UMIN : Unsigned min. >> >> >> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value. >> >> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios. >> >> Summary of changes: >> - Java side implementation of new vector operators. >> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it. >> - C2 compiler IR and inline expander changes. >> - Optimized x86 backend implementation for new vector operators and their predicated counterparts. >> - Extends existing VectorAPI Jtreg test suite to cover new operations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch. >> >> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6674: > 6672: // Res = Mask ? Zero : Res > 6673: evmovdqu(etype, ktmp, dst, dst, false, vlen_enc); > 6674: } We could directly do masked evpsubd/evpsubq here with merge as false. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6698: > 6696: // Unsigned values ranges comprise of only +ve numbers, thus there exist only an upper bound saturation. > 6697: // overflow = ((UMAX - MAX(SRC1 & SRC2)) >> 31 == 1 > 6698: // Res = Signed Add INP1, INP2 The >>> 31 is not coded so comment could be improved to match the code. Comment has SRC1/INP1 term mixed. Also, could overflow not be implemented based on much simpler Java scalar algo: Overflow = Res 6714: // > 6715: // Adaptation of unsigned addition overflow detection from hacker's delight > 6716: // section 2-13 : overflow = ((a & b) | ((a | b) & ~(s))) >>> 31 == 1 Not clear what is s here? I think it is s = a + b. Could you please update the comments to indicate this. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6738: > 6736: XMMRegister xtmp1, XMMRegister xtmp2, XMMRegister xtmp3, > 6737: XMMRegister xtmp4, int vlen_enc) { > 6738: // Res = Signed Add INP1, INP2 Wondering if we could implement overflow here also based on much simpler Java scalar algo: Overflow = Res 6743: vpcmpeqd(xtmp3, xtmp3, xtmp3, vlen_enc); > 6744: // T2 = ~Res > 6745: vpxor(xtmp2, xtmp3, dst, vlen_enc); Did you mean this to be T3 = ~Res src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6749: > 6747: vpor(xtmp2, xtmp2, src2, vlen_enc); > 6748: // Compute mask for muxing T1 with T3 using SRC1. > 6749: vpsign_extend_dq(etype, xtmp4, src1, vlen_enc); I don't think we need to do the sign extension. The blend instruction uses most significant bit to do the blend. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6932: > 6930: > 6931: // Sign-extend to compute overflow detection mask. > 6932: vpsign_extend_dq(etype, xtmp3, xtmp2, vlen_enc); Sign extend to lower bits not needed as blend uses msbit only. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6939: > 6937: > 6938: // Compose saturating min/max vector using first input polarity mask. > 6939: vpsign_extend_dq(etype, xtmp4, src1, vlen_enc); Sign extend to lower bits not needed as blend uses msbit only. src/hotspot/cpu/x86/x86.ad line 10656: > 10654: match(Set dst (SaturatingSubVI src1 src2)); > 10655: match(Set dst (SaturatingSubVL src1 src2)); > 10656: effect(TEMP ktmp); This needs TEMP dst as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737116841 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737272705 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737306541 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737307396 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737325898 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737338765 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467234 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467902 PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737489758 From svkamath at openjdk.org Fri Aug 30 00:07:39 2024 From: svkamath at openjdk.org (Smita Kamath) Date: Fri, 30 Aug 2024 00:07:39 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 [v3] In-Reply-To: References: Message-ID: > Hi, > I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. > > Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain > -- | -- | -- | -- | -- > full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 > full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 > full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 > full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 > full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 > full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 > full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 > full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 > full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 > ? | ? | ? | ? | ? > full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 > full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 > full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 > full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 > full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 > full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 > full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 > full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 > full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Updated copyright dates and addressed review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17515/files - new: https://git.openjdk.org/jdk/pull/17515/files/6b21983c..ed10bcca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17515&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17515&range=01-02 Stats: 48 lines in 4 files changed: 28 ins; 2 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/17515.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17515/head:pull/17515 PR: https://git.openjdk.org/jdk/pull/17515 From gcao at openjdk.org Fri Aug 30 01:08:30 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 30 Aug 2024 01:08:30 GMT Subject: Integrated: 8339237: RISC-V: Builds fail after JDK-8339120 In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 10:56:01 GMT, Gui Cao wrote: > Encountered this build warning/error when doing a native build on linux-riscv64 platform with GCC-13. We can simply remove float_regs_as_doubles_size_in_slots in file c1_Runtime1_riscv.cpp as it is not used anywhere. > > ### Testing > - [x] release & fastdebug build OK on linux-riscv64 after this change. This pull request has now been integrated. Changeset: 4675913e Author: Gui Cao Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/4675913edb16ec1dde5f0ba2dfcfada134ce17f1 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod 8339237: RISC-V: Builds fail after JDK-8339120 Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/20765 From sviswanathan at openjdk.org Fri Aug 30 02:11:27 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 30 Aug 2024 02:11:27 GMT Subject: RFR: 8337632: AES-GCM Algorithm optimization for x86_64 [v3] In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 00:07:39 GMT, Smita Kamath wrote: >> Hi, >> I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you. >> >> Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain >> -- | -- | -- | -- | -- >> full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67 >> full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72 >> full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73 >> full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34 >> full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66 >> full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75 >> full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42 >> full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6 >> full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54 >> ? | ? | ? | ? | ? >> full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45 >> full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39 >> full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26 >> full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52 >> full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08 >> full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94 >> full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91 >> full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81 >> full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597 > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Updated copyright dates and addressed review comments Looks good to me now. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17515#pullrequestreview-2270800917 From jzhu at openjdk.org Fri Aug 30 03:03:53 2024 From: jzhu at openjdk.org (Joshua Zhu) Date: Fri, 30 Aug 2024 03:03:53 GMT Subject: RFR: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only [v3] In-Reply-To: References: Message-ID: > Please review this minor enhancement that skips verify_sve_vector_length after native calls. > It works on SVE micro-architecture that only supports 128-bit vector length. Joshua Zhu has updated the pull request incrementally with one additional commit since the last revision: Fix mismatch issue in ad m4 file ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20724/files - new: https://git.openjdk.org/jdk/pull/20724/files/c0ec5499..d1910858 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20724&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20724&range=01-02 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/20724.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20724/head:pull/20724 PR: https://git.openjdk.org/jdk/pull/20724 From gcao at openjdk.org Fri Aug 30 07:08:46 2024 From: gcao at openjdk.org (Gui Cao) Date: Fri, 30 Aug 2024 07:08:46 GMT Subject: RFR: 8339298: Remove unused function declaration poll_for_safepoint Message-ID: Hi, I noticed that there are two unused function declarations here, in the historical version they were used without UseCompilerSafepoints, now the unused UseCompilerSafepoints have been removed, but the function declarations may have forgotten to be removed. ### Testing - [x] release & fastdebug build OK on linux-aarch64 - [x] release & fastdebug build OK on linux-riscv64 ------------- Commit messages: - 8339298: Remove unused function declaration poll_for_safepoint Changes: https://git.openjdk.org/jdk/pull/20785/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20785&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8339298 Stats: 4 lines in 2 files changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/20785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20785/head:pull/20785 PR: https://git.openjdk.org/jdk/pull/20785 From duke at openjdk.org Fri Aug 30 07:31:51 2024 From: duke at openjdk.org (kuaiwei) Date: Fri, 30 Aug 2024 07:31:51 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method Message-ID: I found sometimes C1 will miss type profile and I tried to write a test to demonstrate it. It will happen with these conditions 1 It's a virtual call and the callee is a final method. So c1 will think it's static bound. 2 The interpreter never touch the callsite, so interpreter does not add type profile. 3 In c1 compilation, it will be inlined based on CHA. 4 In c2 compilation, the CHA is broken, but type profile is missing, so c2 can not inline it. ------------- Commit messages: - 8339299: C1 will miss type profile when inline final method Changes: https://git.openjdk.org/jdk/pull/20786/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20786&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8339299 Stats: 149 lines in 3 files changed: 147 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/20786.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20786/head:pull/20786 PR: https://git.openjdk.org/jdk/pull/20786 From fyang at openjdk.org Fri Aug 30 07:36:18 2024 From: fyang at openjdk.org (Fei Yang) Date: Fri, 30 Aug 2024 07:36:18 GMT Subject: RFR: 8339298: Remove unused function declaration poll_for_safepoint In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 07:03:57 GMT, Gui Cao wrote: > Hi, I noticed that there are two unused function declarations here, in the historical version they were used without UseCompilerSafepoints, now the unused UseCompilerSafepoints have been removed, but the function declarations may have forgotten to be removed. > > ### Testing > - [x] release & fastdebug build OK on linux-aarch64 > - [x] release & fastdebug build OK on linux-riscv64 Looks good. I find this is leftover for [8189596: AArch64: implementation for Thread-local handshakes](https://bugs.openjdk.org/browse/JDK-8189596) ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20785#pullrequestreview-2271441289 From rcastanedalo at openjdk.org Fri Aug 30 08:22:43 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 08:22:43 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v11] In-Reply-To: References: Message-ID: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... Roberto Casta?eda Lozano has updated the pull request incrementally with six additional commits since the last revision: - Add test to motivate compile-time null checks in 'refine_barrier_by_new_val_type' - Remark relation between compiler optimization and barrier filter - Make 'refine_barrier_by_new_val_type' static and its input argument 'const' - Replace 'the null' with 'null' in comment - Remove redundant redefinitions of '__' - Replace 'already dirty' with 'young' in post-barrier fast path comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/19746/files - new: https://git.openjdk.org/jdk/pull/19746/files/daf38d3f..57adcfb0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=09-10 Stats: 39 lines in 4 files changed: 27 ins; 6 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/19746.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746 PR: https://git.openjdk.org/jdk/pull/19746 From rcastanedalo at openjdk.org Fri Aug 30 08:22:44 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 08:22:44 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> References: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> Message-ID: On Wed, 28 Aug 2024 07:50:11 GMT, Kim Barrett wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename g1XChgX to g1GetAndSetX for consistency with Ideal operation names > > src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 160: > >> 158: * To reduce the number of updates to the remembered set, the post-barrier >> 159: * filters out updates to fields in objects located in the Young Generation, the >> 160: * same region as the reference, when the null is being written, or if the card > > s/the null/null/ Done (commit [d1a2349](https://github.com/openjdk/jdk/pull/19746/commits/d1a2349068194ee598cec2b6afe7aa972781b491)), thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1738183062 From rcastanedalo at openjdk.org Fri Aug 30 08:22:44 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 08:22:44 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: References: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> Message-ID: On Wed, 28 Aug 2024 08:12:36 GMT, Kim Barrett wrote: >> src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 166: >> >>> 164: * post-barrier completely, if it is possible during compile time to prove the >>> 165: * object is newly allocated and that no safepoint exists between the allocation >>> 166: * and the store. >> >> It might be worth saying explicitly that this is a compile-time version of the above mentioned young >> generation filter. > > We can similarly elide the post-barrier if we can prove at compile-time that the value being written > is null. That case isn't handled here though. Instead that's checked for in > `refine_barrier_by_new_val_type` and in `get_store_barrier`. I'm not sure why it's structured > that way. > It might be worth saying explicitly that this is a compile-time version of the above mentioned young generation filter. Done (commit [72a04c4](https://github.com/openjdk/jdk/pull/19746/commits/72a04c4e8046256ee7e811d66934d5d9e24f4c7c)), thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1738184612 From rcastanedalo at openjdk.org Fri Aug 30 08:22:44 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 08:22:44 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v9] In-Reply-To: References: Message-ID: On Sun, 25 Aug 2024 01:53:30 GMT, Kim Barrett wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Pass oldval to the pre-barrier in g1CompareAndExchange/SwapP > > src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 218: > >> 216: __ cbz(new_val, done); >> 217: } >> 218: // Storing region crossing non-null, is card already dirty? > > s/already dirty/young/ Done (commit [70c2771](https://github.com/openjdk/jdk/pull/19746/commits/70c2771818834a74a12f8a61de3c77bb69e3e531)), thanks. > src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp line 280: > >> 278: >> 279: #undef __ >> 280: #define __ masm-> > > These "changes" to `__` are unnecessary and confusing. We have the same define near the top of > the file, unconditionally. This one is conditonal on COMPILER2, but is left in place at the end of the > conditional block, affecting following unconditional code. Removed now (commit [2dc688b](https://github.com/openjdk/jdk/pull/19746/commits/2dc688baf2a8f446c7579fafce7eab3a953e623a)), thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1738181093 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1738182128 From rcastanedalo at openjdk.org Fri Aug 30 08:27:22 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 08:27:22 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: References: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> Message-ID: On Fri, 30 Aug 2024 08:19:50 GMT, Roberto Casta?eda Lozano wrote: >> We can similarly elide the post-barrier if we can prove at compile-time that the value being written >> is null. That case isn't handled here though. Instead that's checked for in >> `refine_barrier_by_new_val_type` and in `get_store_barrier`. I'm not sure why it's structured >> that way. > >> It might be worth saying explicitly that this is a compile-time version of the above mentioned young > generation filter. > > Done (commit [72a04c4](https://github.com/openjdk/jdk/pull/19746/commits/72a04c4e8046256ee7e811d66934d5d9e24f4c7c)), thanks. > We can similarly elide the post-barrier if we can prove at compile-time that the value being written is null. That case isn't handled here though. Instead that's checked for in refine_barrier_by_new_val_type and in get_store_barrier. I'm not sure why it's structured that way. The reason why the compile-time null check is performed outside of `g1_can_remove_post_barrier` is for consistency with the [current mainline code](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp#L382-L388). The difference between the current and this changeset's `g1_can_remove_post_barrier` function is minimal, but this is unfortunately obscured in the patch by the temporary `G1_LATE_BARRIER_MIGRATION_SUPPORT`-guarded code. `refine_barrier_by_new_val_type` performs a compile-time null check again at the end of C2's platform-independent optimizations (see https://bugs.openjdk.org/secure/attachment/107747/late-expansion.png) to exploit potentially stronger type information that might be revealed only after applying some optimizations. I have added a new test case that illustrates this scenario (commit [57adcfb](https://github.com/openjdk/jdk/pull/19746/commits/57adcfb04b163ba6744389d6258efe4b2ace534d)). I will study if the check in `get_store_barrier` is superseded by that in `refine_barrier_by_new_val_type`. If I can convince myself that this is the case I will consider removing the former. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1738191022 From rcastanedalo at openjdk.org Fri Aug 30 08:27:23 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 08:27:23 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> References: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> Message-ID: On Wed, 28 Aug 2024 08:17:14 GMT, Kim Barrett wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename g1XChgX to g1GetAndSetX for consistency with Ideal operation names > > src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 229: > >> 227: } >> 228: >> 229: void refine_barrier_by_new_val_type(Node* n) { > > This function should probably be `static`. Done, thanks (I also made its argument `const`, see commit [29d8a89](https://github.com/openjdk/jdk/pull/19746/commits/29d8a89a9a7fd0c1717330609c6d7cb36b0ff174)). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1738192979 From mdoerr at openjdk.org Fri Aug 30 08:33:23 2024 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 30 Aug 2024 08:33:23 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v11] In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 08:22:43 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with six additional commits since the last revision: > > - Add test to motivate compile-time null checks in 'refine_barrier_by_new_val_type' > - Remark relation between compiler optimization and barrier filter > - Make 'refine_barrier_by_new_val_type' static and its input argument 'const' > - Replace 'the null' with 'null' in comment > - Remove redundant redefinitions of '__' > - Replace 'already dirty' with 'young' in post-barrier fast path comment Are you planning to merge jdk-24+13? It has a known testbug on PPC64, but that's not a problem. It looks good otherwise. I'll have to rebase the PPC64 implementation after it is merged and I should be able to provide a stable version for this PR afterwards. So, I'd appreciate the update unless @feilongjiang @offamitkumar @snazarkin see any issue on their platforms. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2320483973 From amitkumar at openjdk.org Fri Aug 30 08:53:25 2024 From: amitkumar at openjdk.org (Amit Kumar) Date: Fri, 30 Aug 2024 08:53:25 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v11] In-Reply-To: References: Message-ID: <5FM9bNeaaI0Lcsto0kfzrcrY4u6SODtf3wqDwmlninw=.367c8d65-c059-4726-a10a-6dd616b643af@github.com> On Fri, 30 Aug 2024 08:22:43 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with six additional commits since the last revision: > > - Add test to motivate compile-time null checks in 'refine_barrier_by_new_val_type' > - Remark relation between compiler optimization and barrier filter > - Make 'refine_barrier_by_new_val_type' static and its input argument 'const' > - Replace 'the null' with 'null' in comment > - Remove redundant redefinitions of '__' > - Replace 'already dirty' with 'young' in post-barrier fast path comment On s390x side, we are good. So I don't have issue with merging jdk-24+13. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2320533252 From duke at openjdk.org Fri Aug 30 08:54:49 2024 From: duke at openjdk.org (Casper Norrbin) Date: Fri, 30 Aug 2024 08:54:49 GMT Subject: RFR: 8339242: Fix overflow issues in AdlArena Message-ID: Hi everyone, This PR addresses an issue in `adlArena` where some allocations lack checks for overflow. This could potentially result in successful allocations when called with unrealistic values. The fix includes: - Adding assertions to check for potential overflow. - Reordering some operations to guard against overflow. ------------- Commit messages: - overflow checks in adlArena Changes: https://git.openjdk.org/jdk/pull/20774/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20774&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8339242 Stats: 9 lines in 2 files changed: 5 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/20774.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20774/head:pull/20774 PR: https://git.openjdk.org/jdk/pull/20774 From chagedorn at openjdk.org Fri Aug 30 09:01:22 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 30 Aug 2024 09:01:22 GMT Subject: RFR: 8339298: Remove unused function declaration poll_for_safepoint In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 07:03:57 GMT, Gui Cao wrote: > Hi, I noticed that there are two unused function declarations here, in the historical version they were used without UseCompilerSafepoints, now the unused UseCompilerSafepoints have been removed, but the function declarations may have forgotten to be removed. > > ### Testing > - [x] release & fastdebug build OK on linux-aarch64 > - [x] release & fastdebug build OK on linux-riscv64 Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20785#pullrequestreview-2271655731 From rcastanedalo at openjdk.org Fri Aug 30 09:23:23 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 09:23:23 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v5] In-Reply-To: References: Message-ID: On Wed, 14 Aug 2024 09:10:10 GMT, Feilong Jiang wrote: >>> Are you planning to rebase it with master ? Nothing important, but there were couple of failures which are fixed after this PR. So will make test result a bit clean for us ?. >> >> Actually, I have refrained to update to the latest mainline changes to avoid interfering with the porting work while it is in progress, but if there is consensus among the port maintainers I would be happy to update the changeset regularly. @TheRealMDoerr @feilongjiang @offamitkumar @snazarkin what do you prefer? > >> > Are you planning to rebase it with master ? Nothing important, but there were couple of failures which are fixed after this PR. So will make test result a bit clean for us ?. >> >> Actually, I have refrained to update to the latest mainline changes to avoid interfering with the porting work while it is in progress, but if there is consensus among the port maintainers I would be happy to update the changeset regularly. @TheRealMDoerr @feilongjiang @offamitkumar @snazarkin what do you prefer? > > I have already merged upstream commits on my local branch, so I'm fine with regular updates. > So, I'd appreciate the update unless @feilongjiang @offamitkumar @snazarkin see any issue on their platforms. OK, if there are no objections from @feilongjiang or @snazarkin within a couple of days I will prepare an update to jdk-24+13. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2320618425 From epeter at openjdk.org Fri Aug 30 12:53:22 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 30 Aug 2024 12:53:22 GMT Subject: RFR: 8336860: x86: Change integer src operand for CMoveL of 0 and 1 to long [v3] In-Reply-To: References: Message-ID: On Mon, 5 Aug 2024 16:48:52 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch fixes `cmovL_imm_01*` instructions matching against an integer immediate 1 instead of a long immediate 1. I noticed while looking at the backend implementation of CMove that the rules specify `immI_1` instead of `immL1`, which means that the instructions can't be matched and instead falls through to the base case. I added a small benchmark and got these results: >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> BasicRules.cmovL_imm_01 avgt 15 259.073 ? 5.806 ns/op 231.108 ? 2.730 ns/op (+ 11.41%) >> >> Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Add IR test for codegen Otherwise, it looks good to me! test/hotspot/jtreg/compiler/c2/irTests/CMoveLConstants.java line 33: > 31: * bug 8336860 > 32: * @summary Verify codegen for CMoveL with constants 0 and 1 > 33: * @requires os.simpleArch == "x64" Hmm. How about only requiring `x64` in the IR rules, but running the test on all platforms? That way we can at least check correct results. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20275#pullrequestreview-2272221017 PR Review Comment: https://git.openjdk.org/jdk/pull/20275#discussion_r1738593057 From fjiang at openjdk.org Fri Aug 30 13:26:23 2024 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 30 Aug 2024 13:26:23 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v11] In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 08:22:43 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request incrementally with six additional commits since the last revision: > > - Add test to motivate compile-time null checks in 'refine_barrier_by_new_val_type' > - Remark relation between compiler optimization and barrier filter > - Make 'refine_barrier_by_new_val_type' static and its input argument 'const' > - Replace 'the null' with 'null' in comment > - Remove redundant redefinitions of '__' > - Replace 'already dirty' with 'young' in post-barrier fast path comment risc-v port looks good too. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2321247648 From rcastanedalo at openjdk.org Fri Aug 30 13:43:24 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 13:43:24 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v9] In-Reply-To: References: Message-ID: On Sun, 25 Aug 2024 06:15:20 GMT, Kim Barrett wrote: >> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: >> >> Pass oldval to the pre-barrier in g1CompareAndExchange/SwapP > > src/hotspot/share/opto/memnode.cpp line 3468: > >> 3466: // Capture an unaliased, unconditional, simple store into an initializer. >> 3467: // Or, if it is independent of the allocation, hoist it above the allocation. >> 3468: if (ReduceFieldZeroing && ReduceInitialCardMarks && /*can_reshape &&*/ > > It's not obvious to me how this is related to the late barrier changes. I agree this change is not obvious and deserves an explanation. With `ReduceInitialCardMarks` disabled, a store to a newly allocated object requires a post-barrier. In current mainline code, the post-barrier is expanded early, which allows the store-capturing transformation (a first step to avoid needless zeroing in object initialization) to move the store and its post-barrier apart: the store goes into the initialization sequence of the recently allocated object, whereas the post-barrier itself remains outside. Here is an example in pseudo-code of this transformation for early-expanded GC barriers: (before store capturing): allocate object o start initialization of o ... o.f <- 0 ... end initialization of o memory barrier (store-store) o.f <- new-val post-barrier of o.f <- new-val (after store capturing): allocate object o start initialization of o ... o.f <- new-val ... end initialization of o memory barrier (store-store) post-barrier of o.f <- new-val In late barrier expansion however, the post-barrier is an implicit, inseparable part of the store, so if we have stores with post-barriers we have no other choice than leaving them outside the initialization section. To enforce this, the change simply disables store-capturing analysis in the `!ReduceInitialCardMarks` case, which is the only case where we might find stores with post-barriers on recently allocated objects. A perhaps more principled solution might be extending store-capturing analysis to reject stores with late-expanded barriers. I will give it a try. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1738693695 From rcastanedalo at openjdk.org Fri Aug 30 13:51:23 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 30 Aug 2024 13:51:23 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v10] In-Reply-To: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> References: <5aSkkYnXN8xLzsZy4OSEVNrIG1rv6dOPESBb4I-nfYE=.7027cc32-4dcf-4674-a9af-c960d8a2d95e@github.com> Message-ID: On Wed, 28 Aug 2024 08:25:06 GMT, Kim Barrett wrote: > I've only looked at the changes in gc directories (shared and cpu-specific). Thanks for your suggestions and comments, Kim! I have addressed them now (modulo a couple of follow-up experiments that I will try out in the next days), please let me know if there is any further question. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2321323461 From dfenacci at openjdk.org Fri Aug 30 14:12:36 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Fri, 30 Aug 2024 14:12:36 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v10] In-Reply-To: References: Message-ID: > # Issue > > The test `compiler/startup/StartupOutput.java` fails intermittently due to a crash after correctly printing the error `Initial size of CodeCache is too small` (the test limits the code cache using k-XX:InitialCodeCacheSize=1024K -XX:ReservedCodeCacheSize=1200k`). > The appearance of the issue is very dependent on thread scheduling. The original report happens during C1 initialization but C2 initialization is affected as well. > > # Causes > > There is one occurrence during C1 initialization and one during C2 initialization where a call to `RuntimeStub::new_runtime_stub` can fail fatally if there is not enough space left. > For C1: `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`. > For C2: `C2Compiler::initialize` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`. > > # Solution > > https://github.com/openjdk/jdk/pull/15970 introduced an optional argument to `RuntimeStub::new_runtime_stub` to determine if it fails fatally or not. We can take advantage of it to avoid crashing and instead pass the information about the success or failure of the allocation up the (C1 and C2 initialization) call stack up to where we can set the compilations as failed. Damon Fenacci has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 29 commits: - JDK-8326615: don't remove empty line between includes - Merge tag 'jdk-24+13' into JDK-8326615 Added tag jdk-24+13 for changeset ff59532d - JDK-8326615: add compiler present macros to includes - JDK-8326615: update copyright year - JDK-8326615: fix min code cache calculation - JDK-8326615: remove empty line from problemlist - Merge tag 'jdk-24+7' into JDK-8326615 Added tag jdk-24+7 for changeset 21a6cf84 - JDK-8326615: calculate minimum code cache size based on initial compiler buffer sizes - JDK-8326615 add forgotten problemlisted configuration after revert - JDK-8326615 add forgotten problemlisted test after revert - ... and 19 more: https://git.openjdk.org/jdk/compare/ff59532d...e7d977e2 ------------- Changes: https://git.openjdk.org/jdk/pull/19280/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=19280&range=09 Stats: 36 lines in 7 files changed: 26 ins; 3 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/19280.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/19280/head:pull/19280 PR: https://git.openjdk.org/jdk/pull/19280 From kxu at openjdk.org Fri Aug 30 14:28:21 2024 From: kxu at openjdk.org (Kangcheng Xu) Date: Fri, 30 Aug 2024 14:28:21 GMT Subject: RFR: 8325495: C2: implement optimization for series of Add of unique value In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 19:29:30 GMT, Dean Long wrote: >> This pull request resolves [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) by converting series of additions of the same operand into multiplications. I.e., `a + a + ... + a + a + a => n*a`. >> >> As an added benefit, it also converts `C * a + a` into `(C+1) * a` and `a << C + a` into `(2^C + 1) * a` (with respect to constant `C`). This is actually a side effect of IGVN being iterative: at converting the `i`-th addition, the previous `i-1` additions would have already been optimized to multiplication (and thus, further into bit shifts and additions/subtractions if possible). >> >> Some notable examples of this transformation include: >> - `a + a + a` => `a*3` => `(a<<1) + a` >> - `a + a + a + a` => `a*4` => `a<<2` >> - `a*3 + a` => `a*4` => `a<<2` >> - `(a << 1) + a + a` => `a*2 + a + a` => `a*3 + a` => `a*4 => a<<2` >> >> See included IR unit tests for more. > > Can we redistrute the repeated terms and turn `a + x + a + y + a` into `a*3 + x + y`? @dean-long No it cannot, although I agree this is a fairly tempting feature to have. Maybe this can be looked into with another issue, but I believe re-ordering addition is out-of-scope for this one. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20754#issuecomment-2321430562 From dfenacci at openjdk.org Fri Aug 30 14:38:23 2024 From: dfenacci at openjdk.org (Damon Fenacci) Date: Fri, 30 Aug 2024 14:38:23 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v9] In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 16:22:01 GMT, Vladimir Kozlov wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8326615: update copyright year > > src/hotspot/share/compiler/compilerDefinitions.inline.hpp line 29: > >> 27: >> 28: #include "c1/c1_Compiler.hpp" >> 29: #include "opto/c2compiler.hpp" > > These should be also under COMPILER*_PRESENT macros. Right! Fixed. Thanks @vnkozlov. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19280#discussion_r1738803009 From jkarthikeyan at openjdk.org Fri Aug 30 14:58:01 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 30 Aug 2024 14:58:01 GMT Subject: RFR: 8336860: x86: Change integer src operand for CMoveL of 0 and 1 to long [v4] In-Reply-To: References: Message-ID: > Hi all, > This patch fixes `cmovL_imm_01*` instructions matching against an integer immediate 1 instead of a long immediate 1. I noticed while looking at the backend implementation of CMove that the rules specify `immI_1` instead of `immL1`, which means that the instructions can't be matched and instead falls through to the base case. I added a small benchmark and got these results: > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > BasicRules.cmovL_imm_01 avgt 15 259.073 ? 5.806 ns/op 231.108 ? 2.730 ns/op (+ 11.41%) > > Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Move architecture checks into IR ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20275/files - new: https://git.openjdk.org/jdk/pull/20275/files/1b926ecf..2d5dc243 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20275&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20275&range=02-03 Stats: 4 lines in 1 file changed: 0 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/20275.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20275/head:pull/20275 PR: https://git.openjdk.org/jdk/pull/20275 From jkarthikeyan at openjdk.org Fri Aug 30 14:58:01 2024 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Fri, 30 Aug 2024 14:58:01 GMT Subject: RFR: 8336860: x86: Change integer src operand for CMoveL of 0 and 1 to long [v2] In-Reply-To: <6NP28KkoRw8wbXPcby4lWfvQS2CLEomS3Pkt0-z3A-U=.172dca0b-5730-4c6f-8f47-7c90370963b0@github.com> References: <6NP28KkoRw8wbXPcby4lWfvQS2CLEomS3Pkt0-z3A-U=.172dca0b-5730-4c6f-8f47-7c90370963b0@github.com> Message-ID: <7QevkgI68p1dKDtP42ahgq3NuEI3422jDcsk5wGfHj4=.baaa08f1-d278-4584-ba05-bac82822d884@github.com> On Thu, 25 Jul 2024 14:28:35 GMT, Emanuel Peter wrote: >> @eme64 or @TobiHartmann might take a look too, I guess? > > @liach @jaskarth I'll run some testing. > > Can you point me to the "base case" you mention in your PR description? Thanks for taking another look @eme64! I've pushed a commit that moves the architecture checks to the IR annotations. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20275#issuecomment-2321516830 From epeter at openjdk.org Fri Aug 30 15:02:26 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 30 Aug 2024 15:02:26 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v7] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Thu, 29 Aug 2024 05:42:58 GMT, Jatin Bhateja wrote: >> Hi All, >> >> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs. >> >> >> Declaration:- >> Vector.selectFrom(Vector v1, Vector v2) >> >> >> Semantics:- >> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown. >> >> Summary of changes: >> - Java side implementation of new selectFrom API. >> - C2 compiler IR and inline expander changes. >> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms. >> - Optimized x86 backend implementation for AVX512 and legacy target. >> - Function tests covering new API. >> >> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :- >> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] >> >> >> Benchmark (size) Mode Cnt Score Error Units >> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms >> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms >> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms >> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms >> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms >> S... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Adding descriptive comments I left a few comments, hopefully I can spend some more time on this next week src/hotspot/cpu/x86/matcher_x86.hpp line 215: > 213: } > 214: > 215: static bool vector_indexes_needs_massaging(BasicType ety, int vlen) { The name "massaging" sounds quite vague. Can we have something more expressive / descriptive? Is it the vector that "needs" massaging or the indices that "need" massaging? Why `ety` and not `bt`? Is that not the name we use most often? src/hotspot/cpu/x86/x86.ad line 10490: > 10488: > 10489: > 10490: instruct selectFromTwoVec_evex(vec dst, vec src1, vec src2) You could rename `dst` -> `mask_and_dst`. That would maybe help the reader to more quickly know that it is an input-mask and output-dst. src/hotspot/share/opto/vectorIntrinsics.cpp line 2716: > 2714: C->set_max_vector_size(MAX2(C->max_vector_size(), (uint)(num_elem * type2aelembytes(elem_bt)))); > 2715: return true; > 2716: } The code in these methods are extremely duplicated. Unboxing and boxing in every method around here. Maybe not your problem in this PR. BTW: your error logging used `v1` in all 3 cases `op1-3`, you probably want to give them useful names. `v1-3` probably? All this copy-pasting makes it easy to miss updating some cases... like it happenend here. src/hotspot/share/opto/vectornode.cpp line 2090: > 2088: int num_elem = vect_type()->length(); > 2089: BasicType elem_bt = vect_type()->element_basic_type(); > 2090: if (Matcher::match_rule_supported_vector(Op_SelectFromTwoVector, num_elem, elem_bt)) { Suggestion: // Keep the node if it is supported, else lower it to other nodes. if (Matcher::match_rule_supported_vector(Op_SelectFromTwoVector, num_elem, elem_bt)) { src/hotspot/share/opto/vectornode.cpp line 2095: > 2093: Node* index_vec = in(1); > 2094: Node* src1 = in(2); > 2095: Node* src2 = in(3); Suggestion: Node* src1 = in(2); Node* src2 = in(3); unnecessary spaces src/hotspot/share/opto/vectornode.cpp line 2101: > 2099: // (VectorBlend > 2100: // (VectorRearrange SRC1, INDEX) > 2101: // (VectorRearrange SRC2, NORM_INDEX) Suggestion: // (VectorRearrange SRC1 INDEX) // (VectorRearrange SRC2 NORM_INDEX) Either consistently use commas or none at all ;) src/hotspot/share/opto/vectornode.cpp line 2104: > 2102: // MASK) > 2103: // This shall prevent an intrinsification failure and associated argument > 2104: // boxing penalties. A quick comment about how the mask is computed could be nice. `MASK = INDEX < num_elem` src/hotspot/share/opto/vectornode.cpp line 2126: > 2124: case T_FLOAT: > 2125: return phase->transform(new VectorCastF2XNode(index_vec, TypeVect::make(T_INT, num_elem))); > 2126: break; `break` after `return`? src/hotspot/share/opto/vectornode.cpp line 2141: > 2139: default: return elem_bt; > 2140: } > 2141: }; This is definitely a style question. But it might be nice to make these functions member functions. They now kinda disrupt the flow of the `::Ideal` method. And in some cases you use the captured variables, and in other cases you pass them in explicitly, even though they already exist in the captured scope... consistency would be nice. src/hotspot/share/opto/vectornode.cpp line 2148: > 2146: > 2147: BoolTest::mask pred = BoolTest::lt; > 2148: ConINode* pred_node = (ConINode*)phase->makecon(TypeInt::make(pred)); Would `as_ConI()` be a better alternative to the `(ConINode*)` cast? src/hotspot/share/opto/vectornode.cpp line 2149: > 2147: BoolTest::mask pred = BoolTest::lt; > 2148: ConINode* pred_node = (ConINode*)phase->makecon(TypeInt::make(pred)); > 2149: Node* lane_cnt = phase->makecon(lane_count_type()); Hmm. I don't like to have different names for the same thing. `num_elem` and `lane_count` and `lane_cnt`. What about a method `make_num_elem_node`, returns a `ConNode*`. Then you pass it around as `num_elem_scalar`, and broadcast it to `num_elem_vector`. src/hotspot/share/opto/vectornode.cpp line 2159: > 2157: > 2158: vmask_type = TypeVect::makemask(elem_bt, num_elem); > 2159: mask = phase->transform(new VectorMaskCastNode(mask, vmask_type)); I would just have two variables, and not overwrite it: `integral_vmask_type` and `vmask_type`. Maybe also `mask` could be split into two variables? src/hotspot/share/opto/vectornode.cpp line 2181: > 2179: default: return elem_bt; > 2180: } > 2181: }; You are now using this twice. Is there not some method that already does this? src/hotspot/share/opto/vectornode.cpp line 2183: > 2181: }; > 2182: // Targets emulating unsupported permutation for certain vector types > 2183: // may need to message the indexes to match the users intent. Suggestion: // may need to massage the indexes to match the users intent. src/hotspot/share/opto/vectornode.hpp line 1272: > 1270: }; > 1271: > 1272: spurious newline src/hotspot/share/opto/vectornode.hpp line 1621: > 1619: public: > 1620: SelectFromTwoVectorNode(Node* in1, Node* in2, Node* in3, const TypeVect* vt) > 1621: : VectorNode(in1, in2, in3, vt) {} I would prefer more expressive variable names and a short specification what the node does. Otherwise one always has to reverse-engineer what inputs are acceptable etc. I mean you could even require `VectorNode*` as inputs. ------------- PR Review: https://git.openjdk.org/jdk/pull/20508#pullrequestreview-2272308274 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738648483 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738738172 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738759799 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738767466 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738768205 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738765017 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738823939 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738808199 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738781635 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738787762 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738806978 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738814420 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738838073 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738835911 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738729168 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738866021 From epeter at openjdk.org Fri Aug 30 15:02:27 2024 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 30 Aug 2024 15:02:27 GMT Subject: RFR: 8338023: Support two vector selectFrom API [v7] In-Reply-To: References: <28KQHru1heR-YOVsRVo8Ffj_4D29IV8vD2tombvTHdI=.dba80ac3-9804-4074-ac0f-8acb9b042a08@github.com> Message-ID: On Fri, 30 Aug 2024 13:17:26 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding descriptive comments > > src/hotspot/cpu/x86/matcher_x86.hpp line 215: > >> 213: } >> 214: >> 215: static bool vector_indexes_needs_massaging(BasicType ety, int vlen) { > > The name "massaging" sounds quite vague. Can we have something more expressive / descriptive? Is it the vector that "needs" massaging or the indices that "need" massaging? > > Why `ety` and not `bt`? Is that not the name we use most often? Hmm, I see that `ety` is used in other places here. What does it stand for? > src/hotspot/share/opto/vectornode.cpp line 2183: > >> 2181: }; >> 2182: // Targets emulating unsupported permutation for certain vector types >> 2183: // may need to message the indexes to match the users intent. > > Suggestion: > > // may need to massage the indexes to match the users intent. This optimization for now seems quite specific to your `SelectFromTwoVectorNode::Ideal` lowering code. Can this conversion not be done there already? What is the semantics of `VectorRearrangeNode`? Should its shuffle vector always be bytes, and we now violated that "for a quick second"? Or is it going to be generally the idea to create all sorts of shuffle types and then fix that up? But then why do we need the `vector_indexes_needs_massaging`? Can you help me understand the concept/strategy behind this? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738714401 PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1738862138 From kvn at openjdk.org Fri Aug 30 15:42:22 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 30 Aug 2024 15:42:22 GMT Subject: RFR: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash) [v10] In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 14:12:36 GMT, Damon Fenacci wrote: >> # Issue >> >> The test `compiler/startup/StartupOutput.java` fails intermittently due to a crash after correctly printing the error `Initial size of CodeCache is too small` (the test limits the code cache using k-XX:InitialCodeCacheSize=1024K -XX:ReservedCodeCacheSize=1200k`). >> The appearance of the issue is very dependent on thread scheduling. The original report happens during C1 initialization but C2 initialization is affected as well. >> >> # Causes >> >> There is one occurrence during C1 initialization and one during C2 initialization where a call to `RuntimeStub::new_runtime_stub` can fail fatally if there is not enough space left. >> For C1: `Compiler::init_c1_runtime` -> `Runtime1::initialize` -> `Runtime1::generate_blob_for` -> `Runtime1::generate_blob` -> `RuntimeStub::new_runtime_stub`. >> For C2: `C2Compiler::initialize` -> `OptoRuntime::generate` -> `OptoRuntime::generate_stub` -> `Compile::Compile` -> `Compile::Code_Gen` -> `PhaseOutput::install` -> `PhaseOutput::install_stub` -> `RuntimeStub::new_runtime_stub`. >> >> # Solution >> >> https://github.com/openjdk/jdk/pull/15970 introduced an optional argument to `RuntimeStub::new_runtime_stub` to determine if it fails fatally or not. We can take advantage of it to avoid crashing and instead pass the information about the success or failure of the allocation up the (C1 and C2 initialization) call stack up to where we can set the compilations as failed. > > Damon Fenacci has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 29 commits: > > - JDK-8326615: don't remove empty line between includes > - Merge tag 'jdk-24+13' into JDK-8326615 > > Added tag jdk-24+13 for changeset ff59532d > - JDK-8326615: add compiler present macros to includes > - JDK-8326615: update copyright year > - JDK-8326615: fix min code cache calculation > - JDK-8326615: remove empty line from problemlist > - Merge tag 'jdk-24+7' into JDK-8326615 > > Added tag jdk-24+7 for changeset 21a6cf84 > - JDK-8326615: calculate minimum code cache size based on initial compiler buffer sizes > - JDK-8326615 add forgotten problemlisted configuration after revert > - JDK-8326615 add forgotten problemlisted test after revert > - ... and 19 more: https://git.openjdk.org/jdk/compare/ff59532d...e7d977e2 Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/19280#pullrequestreview-2272808291 From kvn at openjdk.org Fri Aug 30 15:49:19 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 30 Aug 2024 15:49:19 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 07:25:59 GMT, kuaiwei wrote: > In c2 compilation, the CHA is broken What do you mean by that? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20786#issuecomment-2321720058 From roland at openjdk.org Fri Aug 30 16:05:10 2024 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 30 Aug 2024 16:05:10 GMT Subject: RFR: 8338100: C2: assert(!n_loop->is_member(get_loop(lca))) failed: control must not be back in the loop Message-ID: The crash occurs because a `Store` is sunk out of a loop that's an inner loop of an infinite loop. The infinite loop was just found to be infinite in the current round of loop opts. When that happens the infinite loop is not properly attached to the rest of the loop tree. As a consequence, the `IdealLoopTree` instance for the infinite loop and its children are only partially initialized (`_nest` is not set) and the structure is an inconsistent state. When the `Store` is sunk it's reported as belonging to a loop but the `IdealLoopTree` for that loop is only half populated. As a consequence a call to `is_dominator` for that loop hits an inconsistency, returns an incorrect result and the assert fires. A possible fix would be a point fix that skips that optimization for a loop that's part of an infinite loop nest. But given basic methods of loop opts can't be trusted to work in the infinite loop nest, I suppose similar issues can surface elsewhere. It's not the first time, we have issues with an infinite loop that's not properly attached to the loop tree the first time it is encountered (a NeverBranch is then added and on the next loop passes, the infinite loop is properly attached to the loop tree). For instance on a loop opts round, C2 can see that it has no loops and on the next that it has some. I propose fixing this by properly attaching the infinite loop to the loop tree when it's first discovered. A comment in the code seems to hint that it requires going over the graph again after the `NeverBranch` is added but I don't think that's case. I changed the assert in `loopnode.cpp` because it was there to work around the inconsistency I mentioned above (no loop in a round, some loops on the next one). The change in `parse1.cpp` fixes an issue I ran into when testing the fix. The existing logic doesn't properly detect an exception backedge. I added the test case from 8336478 to this. The problem there is that an infinite loop contains a long counted loop. The long counted loop is transformed into a loop nest which is a 2 step process that requires 2 rounds of loop opts. But c2 finds an infinite loop in the middle of the process which causes it to see no more loops and to not attempt another round of loop opts. The assert fires because it finds a long counted loop nest that's half transformed. The change I propose here fixes this too. If we go with this fix, I'll close 8336478 as duplicate of this one. ------------- Commit messages: - comment - test fix - remove verification code - test & fix Changes: https://git.openjdk.org/jdk/pull/20797/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20797&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8338100 Stats: 287 lines in 7 files changed: 257 ins; 11 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/20797.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20797/head:pull/20797 PR: https://git.openjdk.org/jdk/pull/20797 From chagedorn at openjdk.org Fri Aug 30 17:31:25 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 30 Aug 2024 17:31:25 GMT Subject: RFR: 8330159: [C2] Remove or clarify Compile::init_start [v3] In-Reply-To: <4tc9doDDziDT16yv9jghJoWtiPO3AEWOuO0wfPk1QGs=.9ff1f3c3-82e0-4ab1-8437-bc12ea34820d@github.com> References: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> <4tc9doDDziDT16yv9jghJoWtiPO3AEWOuO0wfPk1QGs=.9ff1f3c3-82e0-4ab1-8437-bc12ea34820d@github.com> Message-ID: On Thu, 29 Aug 2024 11:23:50 GMT, Yagmur Eren wrote: >> Compile::init_start method contained only an assertion. To cleanup, this method is removed and the locations where this method is called are replaced with the corresponding assertion. See issue: [JDK-8330159](https://bugs.openjdk.org/browse/JDK-8330159) > > Yagmur Eren has updated the pull request incrementally with one additional commit since the last revision: > > remove method header Looks good, thanks for the update. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20715#pullrequestreview-2273144775 From coleenp at openjdk.org Fri Aug 30 17:31:25 2024 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 30 Aug 2024 17:31:25 GMT Subject: RFR: 8338924: C1: assert(0 <= i && i < _len) failed: illegal index 5 for length 5 In-Reply-To: References: Message-ID: On Tue, 27 Aug 2024 18:01:16 GMT, Matias Saavedra Silva wrote: > The change in [JDK-8335664](https://bugs.openjdk.org/browse/JDK-8335664) caused a crash when initializing basic blocks with `-Xcomp`. This change introduces a check to see if JSR is the last bytecode in its method so that expected behavior matches the previous patch. Verified with tier 1-6 tests. Just a comment. test/hotspot/jtreg/runtime/interpreter/LastJsrTest.java line 39: > 37: public class LastJsrTest { > 38: public static void main(String[] args) { > 39: for (int i = 0; i < 1000; ++i) { Don't you need 10,000 in your loop to trigger compilation? ------------- PR Review: https://git.openjdk.org/jdk/pull/20732#pullrequestreview-2273143025 PR Review Comment: https://git.openjdk.org/jdk/pull/20732#discussion_r1739172439 From chagedorn at openjdk.org Fri Aug 30 17:33:25 2024 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 30 Aug 2024 17:33:25 GMT Subject: RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v3] In-Reply-To: <_J7rxi43QXdtD_aJuhnbnoxDFUIi7HMFsHUutkpijw8=.324943d4-001a-4833-a106-ee46f7fbc6e6@github.com> References: <1w8L64U3JeY8qKskSNm4OlTablOucHHThfW-9nakz4E=.e52b0193-9048-4382-8242-f5296483898a@github.com> <_J7rxi43QXdtD_aJuhnbnoxDFUIi7HMFsHUutkpijw8=.324943d4-001a-4833-a106-ee46f7fbc6e6@github.com> Message-ID: On Mon, 26 Aug 2024 08:46:41 GMT, Roland Westrelin wrote: >> I propose removing `PhaseIdealLoop::cast_incr_before_loop()` and the >> `CastII` nodes that it adds at counted loop heads. >> >> They were added to prevent nodes to float above the zero trip guard >> when the backedge of a counted loop is removed. In particular, when a >> range check is hoisted by predication, pre/main/post loops are created >> and if one of the main or post loops lose its backedge, an array load >> that's control dependent on a predicate above the pre loop could float >> above the zero trip guard of the main or post loop. That can no longer >> happen AFAICT with changes related to assert predicates. The array >> load is now updated to have a control dependency that's below the zero >> trip guard. >> >> The reason I'm revisiting this is that I noticed that >> `PhaseIdealLoop::cast_incr_before_loop()` has a bug. When it adds the >> `CastII`, it looks for the loop phi and picks input 1 of the phi it >> finds as input to the `CastII`. To find the loop phi, it starts from >> the loop incremement and loop for a use that's a phi and has the loop >> head as control. It never checks that the phi it finds is the loop >> phi. There can be more than one phi as uses of the increment at the >> loop head and it can pick the wrong one. I tried to write a test case >> where this would cause a bug but couldn't actually find any use for >> the `CastII` anymore. >> >> In my testing, the only issue when the `CastII` are not added is that >> some IR tests for vectorization fails: >> >> compiler/vectorization/TestPopulateIndex.java >> compiler/vectorization/runner/ArrayShiftOpTest.java >> compiler/vectorization/runner/LoopArrayIndexComputeTest.java >> >> because removing the `CastII` causes split if to occur with some nodes >> that take the loop phi as input. That then causes pattern matching >> during superword to break. I added logic to prevent split if for those >> cases. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8334724 > - review > - fix Testing passed! ------------- PR Comment: https://git.openjdk.org/jdk/pull/19831#issuecomment-2322030739 From dlong at openjdk.org Fri Aug 30 21:38:20 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 30 Aug 2024 21:38:20 GMT Subject: RFR: 8338407: Support grouping several of existing regs into a new one In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 15:11:32 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch to add `group` support to operand? > > ### Some background about this pr > > In some platforms, there is some concept like a group of registers, for example on riscv there is vector group, which is a group of other single vectors. For example, m2 could be v2+v3, or v4+v5, m4 could be v4+v5+v6+v7, or v8+v9+v10+v11. > And, it's helpful to represent these vector group explicitly, otherwise it's tedious and error-prone. For example, in existing code, there's some like below: > > instruct vstring_compareUL(iRegP_R11 str1, iRegI_R12 cnt1, iRegP_R13 str2, iRegI_R14 cnt2, > iRegI_R10 result, vReg_V4 v4, vReg_V5 v5, vReg_V6 v6, vReg_V7 v7, > vReg_V8 v8, vReg_V9 v9, vReg_V10 v10, vReg_V11 v11, > iRegP_R28 tmp1, iRegL_R29 tmp2) > // ... > effect(KILL tmp1, KILL tmp2, USE_KILL str1, USE_KILL str2, USE_KILL cnt1, USE_KILL cnt2, > TEMP v4, TEMP v5, TEMP v6, TEMP v7, TEMP v8, TEMP v9, TEMP v10, TEMP v11); > // ... > __ string_compare_v($str1$$Register, $str2$$Register, > $cnt1$$Register, $cnt2$$Register, $result$$Register, > $tmp1$$Register, $tmp2$$Register, > StrIntrinsicNode::UL); > > The potential problems of the above code are that we need to > 1. write v4~v11 explicitly in its `instruct` and its `effect`, it's tedious; > 2. vector group are represented implicitly, which is not clear and error-prone; > 3. in its encoding `string_compare_v`, we need to specify m4, and v4/v8 explicitly. > 4. if some day we need to adjust from m4 to m2 or m8, it's really tedious and error-prone to make that change in both ad file and macro assembler files. > > > ### This PR > > The proposed solution is to represent a group of vector registers with a real vector group, e.g. `vReg_V4 v4, vReg_V5 v5, vReg_V6 v6, vReg_V7 v7` with `vReg_V4M4 v4m4`, `TEMP v4, TEMP v5, TEMP v6, TEMP v7` with `TEMP v4m4` and in `string_compare_v` implementation, we could query the length of of vector group (i.e. m4 in this case) and set its vtype automatically. > This solution solve the above listed issues, especially the last issue, that means in the future if we need to adjust m4 to m2 or m8, we only need to change the code in ad file and the change is simpler, and no change in string_compare_v is needed. > > ### What it looks like > > For more usage details, please please check [here](https://github.com/openjdk/jdk/compare/master...Hamlin-Li:jdk:explicit-v-reg-gro... Are you trying to support arbitrary groups of vectors, or only aligned and sequential groups, like on riscv? > 1. One of them is to just add some new reg class and operand, it kindly worked, but can only prevent other regs in an instruct using the one of the vector regs in a vector group. I don't see why this wouldn't work. As long as the register mask is correct, it should prevent/exclude all the vector regs in a vector group. I believe this is how arm32 implements vecD on the dflt_low_reg registers. 64-bit "D" vectors are composed of two adjacent 32-bit "S" registers. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20775#issuecomment-2322370471 From dlong at openjdk.org Fri Aug 30 21:41:24 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 30 Aug 2024 21:41:24 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 07:25:59 GMT, kuaiwei wrote: > I found sometimes C1 will miss type profile and I tried to write a test to demonstrate it. > It will happen with these conditions > 1 It's a virtual call and the callee is a final method. So c1 will think it's static bound. > 2 The interpreter never touch the callsite, so interpreter does not add type profile. > 3 In c1 compilation, it will be inlined based on CHA. > 4 In c2 compilation, the CHA is broken, but type profile is missing, so c2 can not inline it. src/hotspot/share/c1/c1_LIR.hpp line 2037: > 2035: bool callee_is_static = _profiled_callee->is_loaded() && _profiled_callee->is_static(); > 2036: Bytecodes::Code bc = _profiled_method->java_code_at_bci(_profiled_bci); > 2037: bool call_is_virtual = (bc == Bytecodes::_invokevirtual && (UseCHA || !_profiled_callee->can_be_statically_bound())) || bc == Bytecodes::_invokeinterface; UseCHA is on by default, so this effectively always turns on profiling. I think it would be better if we set a flag that says if CHA was used or not. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739431991 From dlong at openjdk.org Fri Aug 30 22:26:23 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 30 Aug 2024 22:26:23 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 15:47:01 GMT, Vladimir Kozlov wrote: > > In c2 compilation, the CHA is broken > > What do you mean by that? It looks like the test invalidates the initial CHA optimazation that was done when there was only one subclass by loading a 2nd subclass. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20786#issuecomment-2322460942 From vlivanov at openjdk.org Fri Aug 30 22:26:24 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 30 Aug 2024 22:26:24 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 22:19:36 GMT, Dean Long wrote: > It looks like the test invalidates the initial CHA optimazation that was done when there was only one subclass by loading a 2nd subclass. Moreover, CHA has to discover a final method in order to satisfy `can_be_statically_bound()` predicate. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20786#issuecomment-2322484125 From vlivanov at openjdk.org Fri Aug 30 22:26:25 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 30 Aug 2024 22:26:25 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 21:38:33 GMT, Dean Long wrote: >> I found sometimes C1 will miss type profile and I tried to write a test to demonstrate it. >> It will happen with these conditions >> 1 It's a virtual call and the callee is a final method. So c1 will think it's static bound. >> 2 The interpreter never touch the callsite, so interpreter does not add type profile. >> 3 In c1 compilation, it will be inlined based on CHA. >> 4 In c2 compilation, the CHA is broken, but type profile is missing, so c2 can not inline it. > > src/hotspot/share/c1/c1_LIR.hpp line 2037: > >> 2035: bool callee_is_static = _profiled_callee->is_loaded() && _profiled_callee->is_static(); >> 2036: Bytecodes::Code bc = _profiled_method->java_code_at_bci(_profiled_bci); >> 2037: bool call_is_virtual = (bc == Bytecodes::_invokevirtual && (UseCHA || !_profiled_callee->can_be_statically_bound())) || bc == Bytecodes::_invokeinterface; > > UseCHA is on by default, so this effectively always turns on profiling. I think it would be better if we set a flag that says if CHA was used or not. IMO a better option is to simply drop `can_be_statically_bound()` check here. As this bug demonstrates, selected method at call site (represented by `_profiled_callee`) is not an accurate representation of call site behavior. Instead, to keep things simple, a check for private methods (which don't participate in virtual dispatch) can be introduced. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739466411 From vlivanov at openjdk.org Fri Aug 30 22:31:24 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 30 Aug 2024 22:31:24 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 07:25:59 GMT, kuaiwei wrote: > I found sometimes C1 will miss type profile and I tried to write a test to demonstrate it. > It will happen with these conditions > 1 It's a virtual call and the callee is a final method. So c1 will think it's static bound. > 2 The interpreter never touch the callsite, so interpreter does not add type profile. > 3 In c1 compilation, it will be inlined based on CHA. > 4 In c2 compilation, the CHA is broken, but type profile is missing, so c2 can not inline it. test/hotspot/jtreg/compiler/cha/TypeProfileFinalMethod.java line 43: > 41: public class TypeProfileFinalMethod { > 42: public static void main(String[] args) throws Exception { > 43: if (args.length == 1 && args[0].equals("Run")) { Instead of a check at runtime, you can introduce a separate class which drives test logic. Take a look at `compiler/jsr292/MHInlineTest.java` for an example (or grep for `class Launcher` under `test/hotspot/jtreg`). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739478622 From dlong at openjdk.org Fri Aug 30 22:37:32 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 30 Aug 2024 22:37:32 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 22:21:10 GMT, Vladimir Ivanov wrote: >> src/hotspot/share/c1/c1_LIR.hpp line 2037: >> >>> 2035: bool callee_is_static = _profiled_callee->is_loaded() && _profiled_callee->is_static(); >>> 2036: Bytecodes::Code bc = _profiled_method->java_code_at_bci(_profiled_bci); >>> 2037: bool call_is_virtual = (bc == Bytecodes::_invokevirtual && (UseCHA || !_profiled_callee->can_be_statically_bound())) || bc == Bytecodes::_invokeinterface; >> >> UseCHA is on by default, so this effectively always turns on profiling. I think it would be better if we set a flag that says if CHA was used or not. > > IMO a better option is to simply drop `can_be_statically_bound()` check here. As this bug demonstrates, selected method at call site (represented by `_profiled_callee`) is not an accurate representation of call site behavior. > > Instead, to keep things simple, a check for private methods (which don't participate in virtual dispatch) can be introduced. Would it be more appropriate to use can_be_statically_bound(ciInstanceKlass* context) here, where in this case _profiled_callee is Child1 and context is Parent? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739484828 From vlivanov at openjdk.org Fri Aug 30 22:41:18 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 30 Aug 2024 22:41:18 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: <-b1JTFIJbwhOHz4a5fasgVmF-aeaOuUuD-UWvNr_XSs=.1e9fa592-c984-40e7-b317-21a072e583b9@github.com> On Fri, 30 Aug 2024 07:25:59 GMT, kuaiwei wrote: > I found sometimes C1 will miss type profile and I tried to write a test to demonstrate it. > It will happen with these conditions > 1 It's a virtual call and the callee is a final method. So c1 will think it's static bound. > 2 The interpreter never touch the callsite, so interpreter does not add type profile. > 3 In c1 compilation, it will be inlined based on CHA. > 4 In c2 compilation, the CHA is broken, but type profile is missing, so c2 can not inline it. test/hotspot/jtreg/compiler/cha/cha_control.txt line 1: > 1: [ Currently, the prevalent way to specify compiler directives is through WhiteBox API at runtime (through `WhiteBox.addCompilerDirective(String directive)`). Please, follow the same pattern here. I find it more convenient to reason about test logic when all the pieces are present in a single place. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739488135 From vlivanov at openjdk.org Fri Aug 30 23:06:22 2024 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 30 Aug 2024 23:06:22 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 22:34:35 GMT, Dean Long wrote: >> IMO a better option is to simply drop `can_be_statically_bound()` check here. As this bug demonstrates, selected method at call site (represented by `_profiled_callee`) is not an accurate representation of call site behavior. >> >> Instead, to keep things simple, a check for private methods (which don't participate in virtual dispatch) can be introduced. > > Would it be more appropriate to use can_be_statically_bound(ciInstanceKlass* context) here, where in this case _profiled_callee is Child1 and context is Parent? That's definitely an option, but then you need to determine the context. Do you propose to use declared method holder as a context here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739499938 From dlong at openjdk.org Fri Aug 30 23:38:28 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 30 Aug 2024 23:38:28 GMT Subject: RFR: 8339242: Fix overflow issues in AdlArena In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 15:07:46 GMT, Casper Norrbin wrote: > Hi everyone, > > This PR addresses an issue in `adlArena` where some allocations lack checks for overflow. This could potentially result in successful allocations when called with unrealistic values. > > The fix includes: > > - Adding assertions to check for potential overflow. > - Reordering some operations to guard against overflow. src/hotspot/share/adlc/adlArena.cpp line 154: > 152: if( (c_old+old_size == _hwm) && // Adjusting recent thing > 153: ((size_t)(_max-c_old) >= new_size) ) { // Still fits where it sits, safe from overflow > 154: This code appears to be a copy of Arena::Arealloc, so we should probably fix both at the same time. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20774#discussion_r1739516231 From dlong at openjdk.org Fri Aug 30 23:48:28 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 30 Aug 2024 23:48:28 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 23:03:38 GMT, Vladimir Ivanov wrote: >> Would it be more appropriate to use can_be_statically_bound(ciInstanceKlass* context) here, where in this case _profiled_callee is Child1 and context is Parent? > > That's definitely an option, but then you need to determine the context. Do you propose to use declared method holder as a context here? Yes, exactly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739524607 From dlong at openjdk.org Fri Aug 30 23:54:18 2024 From: dlong at openjdk.org (Dean Long) Date: Fri, 30 Aug 2024 23:54:18 GMT Subject: RFR: 8338924: C1: assert(0 <= i && i < _len) failed: illegal index 5 for length 5 In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 17:28:16 GMT, Coleen Phillimore wrote: >> The change in [JDK-8335664](https://bugs.openjdk.org/browse/JDK-8335664) caused a crash when initializing basic blocks with `-Xcomp`. This change introduces a check to see if JSR is the last bytecode in its method so that expected behavior matches the previous patch. Verified with tier 1-6 tests. > > test/hotspot/jtreg/runtime/interpreter/LastJsrTest.java line 39: > >> 37: public class LastJsrTest { >> 38: public static void main(String[] args) { >> 39: for (int i = 0; i < 1000; ++i) { > > Don't you need 10,000 in your loop to trigger compilation? Yes for C2, but this is enough for C1, the only compiler that needs this fix. I wanted to make sure C1 compilation was triggered by default without -Xcomp. Testing tiers that use -Xcomp will make sure it passes with C2. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20732#discussion_r1739532956 From dlong at openjdk.org Sat Aug 31 00:04:20 2024 From: dlong at openjdk.org (Dean Long) Date: Sat, 31 Aug 2024 00:04:20 GMT Subject: RFR: 8330159: [C2] Remove or clarify Compile::init_start [v3] In-Reply-To: <4tc9doDDziDT16yv9jghJoWtiPO3AEWOuO0wfPk1QGs=.9ff1f3c3-82e0-4ab1-8437-bc12ea34820d@github.com> References: <7Q9VAVqBUDDnqEcCOWeRh5N-YC0dsg9lyQsOv6xbO80=.6d55dc58-0281-45fd-a92f-d0f6ddd910cc@github.com> <4tc9doDDziDT16yv9jghJoWtiPO3AEWOuO0wfPk1QGs=.9ff1f3c3-82e0-4ab1-8437-bc12ea34820d@github.com> Message-ID: On Thu, 29 Aug 2024 11:23:50 GMT, Yagmur Eren wrote: >> Compile::init_start method contained only an assertion. To cleanup, this method is removed and the locations where this method is called are replaced with the corresponding assertion. See issue: [JDK-8330159](https://bugs.openjdk.org/browse/JDK-8330159) > > Yagmur Eren has updated the pull request incrementally with one additional commit since the last revision: > > remove method header src/hotspot/share/opto/generateOptoStub.cpp line 264: > 262: returnadr()); > 263: root()->add_req(_gvn.transform(to_exc)); // bind to root to keep live > 264: DEBUG_ONLY(C->verify_start(start);) This looks fine, but instead of marking every call site with DEBUG_ONLY, how about adding NOT_DEBUG_RETURN to the declaration of verify_start(), so it is a no-op in non-debug builds? For an example, see check_no_dead_use(). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20715#discussion_r1739548132 From kvn at openjdk.org Sat Aug 31 00:15:19 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 31 Aug 2024 00:15:19 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 22:23:22 GMT, Vladimir Ivanov wrote: > > It looks like the test invalidates the initial CHA optimazation that was done when there was only one subclass by loading a 2nd subclass. > > Moreover, CHA has to discover a final method in order to satisfy `can_be_statically_bound()` predicate. Got it. I assume C1 compiled code will be deoptimized in this case (CHA changed). But C2 compilation could be triggered before that and new CH will be used by it. This looks like rare case. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20786#issuecomment-2322625237 From kvn at openjdk.org Sat Aug 31 00:19:19 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 31 Aug 2024 00:19:19 GMT Subject: RFR: 8339299: C1 will miss type profile when inline final method In-Reply-To: References: Message-ID: On Fri, 30 Aug 2024 23:45:39 GMT, Dean Long wrote: >> That's definitely an option, but then you need to determine the context. Do you propose to use declared method holder as a context here? > > Yes, exactly. I am not sure how Dean proposal will help. I agree with Vladimir's suggestion - C1 should not optimize call sites in Level 3 compilation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20786#discussion_r1739553478 From kvn at openjdk.org Sat Aug 31 00:23:19 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 31 Aug 2024 00:23:19 GMT Subject: RFR: 8338924: C1: assert(0 <= i && i < _len) failed: illegal index 5 for length 5 In-Reply-To: References: Message-ID: On Tue, 27 Aug 2024 18:01:16 GMT, Matias Saavedra Silva wrote: > The change in [JDK-8335664](https://bugs.openjdk.org/browse/JDK-8335664) caused a crash when initializing basic blocks with `-Xcomp`. This change introduces a check to see if JSR is the last bytecode in its method so that expected behavior matches the previous patch. Verified with tier 1-6 tests. test/hotspot/jtreg/runtime/interpreter/LastJsrTest.java line 34: > 32: * @run main/othervm > 33: * -Xbatch > 34: * LastJsrTest I suggest to keep old `command` too. And I don't like spliting new command into several lines when it is very short. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20732#discussion_r1739554211 From kvn at openjdk.org Sat Aug 31 00:26:18 2024 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 31 Aug 2024 00:26:18 GMT Subject: RFR: 8338924: C1: assert(0 <= i && i < _len) failed: illegal index 5 for length 5 In-Reply-To: References: Message-ID: On Tue, 27 Aug 2024 18:01:16 GMT, Matias Saavedra Silva wrote: > The change in [JDK-8335664](https://bugs.openjdk.org/browse/JDK-8335664) caused a crash when initializing basic blocks with `-Xcomp`. This change introduces a check to see if JSR is the last bytecode in its method so that expected behavior matches the previous patch. Verified with tier 1-6 tests. src/hotspot/share/compiler/methodLiveness.cpp line 227: > 225: > 226: if (bci + Bytecodes::length_for(code) >= method_len) break; > 227: Please, use code style which use `{}` and put body on separate line as you did in `c1_GraphBuilder.cpp` You will not need empty lines around it then. src/hotspot/share/compiler/methodLiveness.cpp line 240: > 238: > 239: if (bci + Bytecodes::length_for(code) >= method_len) break; > 240: Same here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20732#discussion_r1739554607 PR Review Comment: https://git.openjdk.org/jdk/pull/20732#discussion_r1739554749 From stuefe at openjdk.org Sat Aug 31 04:52:20 2024 From: stuefe at openjdk.org (Thomas Stuefe) Date: Sat, 31 Aug 2024 04:52:20 GMT Subject: RFR: 8339242: Fix overflow issues in AdlArena In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 15:07:46 GMT, Casper Norrbin wrote: > Hi everyone, > > This PR addresses an issue in `adlArena` where some allocations lack checks for overflow. This could potentially result in successful allocations when called with unrealistic values. > > The fix includes: > > - Adding assertions to check for potential overflow. > - Reordering some operations to guard against overflow. If the aim is to increase security, would it not make more sense to test against hardcoded "reasonable max" values? Anything larger than a few MB is likely to be an error anyway, or? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20774#issuecomment-2322770725 From mli at openjdk.org Sat Aug 31 10:48:18 2024 From: mli at openjdk.org (Hamlin Li) Date: Sat, 31 Aug 2024 10:48:18 GMT Subject: RFR: 8338407: Support grouping several of existing regs into a new one In-Reply-To: References: Message-ID: On Thu, 29 Aug 2024 15:11:32 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch to add `group` support to operand? > > ### Some background about this pr > > In some platforms, there is some concept like a group of registers, for example on riscv there is vector group, which is a group of other single vectors. For example, m2 could be v2+v3, or v4+v5, m4 could be v4+v5+v6+v7, or v8+v9+v10+v11. > And, it's helpful to represent these vector group explicitly, otherwise it's tedious and error-prone. For example, in existing code, there's some like below: > > instruct vstring_compareUL(iRegP_R11 str1, iRegI_R12 cnt1, iRegP_R13 str2, iRegI_R14 cnt2, > iRegI_R10 result, vReg_V4 v4, vReg_V5 v5, vReg_V6 v6, vReg_V7 v7, > vReg_V8 v8, vReg_V9 v9, vReg_V10 v10, vReg_V11 v11, > iRegP_R28 tmp1, iRegL_R29 tmp2) > // ... > effect(KILL tmp1, KILL tmp2, USE_KILL str1, USE_KILL str2, USE_KILL cnt1, USE_KILL cnt2, > TEMP v4, TEMP v5, TEMP v6, TEMP v7, TEMP v8, TEMP v9, TEMP v10, TEMP v11); > // ... > __ string_compare_v($str1$$Register, $str2$$Register, > $cnt1$$Register, $cnt2$$Register, $result$$Register, > $tmp1$$Register, $tmp2$$Register, > StrIntrinsicNode::UL); > > The potential problems of the above code are that we need to > 1. write v4~v11 explicitly in its `instruct` and its `effect`, it's tedious; > 2. vector group are represented implicitly, which is not clear and error-prone; > 3. in its encoding `string_compare_v`, we need to specify m4, and v4/v8 explicitly. > 4. if some day we need to adjust from m4 to m2 or m8, it's really tedious and error-prone to make that change in both ad file and macro assembler files. > > > ### This PR > > The proposed solution is to represent a group of vector registers with a real vector group, e.g. `vReg_V4 v4, vReg_V5 v5, vReg_V6 v6, vReg_V7 v7` with `vReg_V4M4 v4m4`, `TEMP v4, TEMP v5, TEMP v6, TEMP v7` with `TEMP v4m4` and in `string_compare_v` implementation, we could query the length of of vector group (i.e. m4 in this case) and set its vtype automatically. > This solution solve the above listed issues, especially the last issue, that means in the future if we need to adjust m4 to m2 or m8, we only need to change the code in ad file and the change is simpler, and no change in string_compare_v is needed. > > ### What it looks like > > For more usage details, please please check [here](https://github.com/openjdk/jdk/compare/master...Hamlin-Li:jdk:explicit-v-reg-gro... Thanks for taking a look and suggestions! > Are you trying to support arbitrary groups of vectors, or only aligned and sequential groups, like on riscv? I'm trying supporting the latter. Some more backgroud: On riscv, its vector only support scale one (that means it's vectorA in hotspot, length of one single vector register is dynamic, it could be 128/256/512... bits). The start vector reg in a vector group must be aligned to the length of vector group, e.g. if a vector group length is 8, then it can only be one of (v0-v7), (v8-v15), (v16-v23), (v24-v31). > > > 1. One of them is to just add some new reg class and operand, it kindly worked, but can only prevent other regs in an instruct using the one of the vector regs in a vector group. > > I don't see why this wouldn't work. As long as the register mask is correct, it should prevent/exclude all the vector regs in a vector group. I believe this is how arm32 implements vecD on the dflt_low_reg registers. 64-bit "D" vectors are composed of two adjacent 32-bit "S" registers. (I could be wrong about VecD on arm below.) [Here](https://github.com/openjdk/jdk/compare/master...Hamlin-Li:jdk:explicit-v-reg-group-v1) is the alternative implementation of "`just add some new reg class and operand`" mentioned above, please check changes in src/hotspot/cpu/riscv/riscv.ad and other related changes. I think the reasons why this change (or the way you suggested above) does not work as expected are: * vecD and vecA are different way to control vector register allocation in hotspot, vecD is fixed size, vecA is dynamic, so that vecD works does not directly mean vecA could work as well in the same way. (riscv only support vecA.) * with new reg_class and operand introduced (take v2m2_reg, vReg_V2_m2 as example), current code base only ensure vReg_V2_m2 will take one vector register from between v2 and v3, but will not ensure both v2 and v3 are allocated at the same time for it. So if an instruct in ad file has (..., vReg_V2_m2 v2m2, vReg vx, ...) as its inputs (and suppose both are `TEMP`), then there is a chance vx will finally be allocated at v2 or v3, which is unexpected. But on the other hand, if an instruct in ad file has (..., vReg vx, vReg vy, ...) as its inputs (and suppose both are `TEMP`), then there is no chance vx will be same as vy. Hope the above could answer your question, but it could also be that I'm doing some wrong things in [that implementation](https://github.com/openjdk/jdk/compare/master...Hamlin-Li:jdk:explicit-v-reg-group-v1). ------------- PR Comment: https://git.openjdk.org/jdk/pull/20775#issuecomment-2322858439