From dfenacci at openjdk.org Mon Sep 1 06:50:28 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 1 Sep 2025 06:50:28 GMT Subject: RFR: 8355354: C2 crashed: assert(_callee == nullptr || _callee == m) failed: repeated inline attempt with different callee [v3] In-Reply-To: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> References: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> Message-ID: > # Issue > The CTW test `applications/ctw/modules/java_xml.java` crashes when trying to repeat late inlining of a virtual method (after IGVN passes through the method's call node again). The failure originates [here](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callGenerator.cpp#L473) because `_callee != m`. Apparently when running IGVN a second time after a first late inline failure and [setting the callee in the call generator](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callnode.cpp#L1240) we notice that the previous callee is not the same as the current one. > In this specific instance it seems that the issue happens when CTW is compiling Apache Xalan. > > # Cause > The root of the issue has to do with repeated late inlining, class hierarchy analysis and dynamic class loading. > > For this particular issue the two differing methods are `org.apache.xalan.xsltc.compiler.LocationPathPattern::translate` first and `org.apache.xalan.xsltc.compiler.AncestorPattern::translate` the second time. `LocationPathPattern` is an abstract class but has a concrete `translate` method. `AncestorPattern` is a concrete class that extends another abstract class `RelativePathPattern` that extends `LocationPathPattern`. `AncestorPattern` overrides the translate method. > What seems to be happening is the following: we compile a virtual call `RelativePathPattern::translate` and at compile time. Only the abstract classes `RelativePathPattern` <: `LocationPathPattern` are loaded. CHA then finds out that the call must always call `LocationPathPattern::translate` because the method is not overwritten anywhere else. However, there is still no non-abstract class in the entire class hierarchy, i.e. as soon as `AncestorPattern` is loaded, this class is then the only non-abstract class in the class hierarchy and therefore the receiver type must be `AncestorPattern`. > > More in general, when late inlining is repeated and classes are loaded dynamically, it is possible that the resolved method between a late inlining attempt and the next one is not the same. > > # Fix > > This looks like a very edge-case. If CHA is affected by class loading the original recorded dependency becomes invalid. So, we change the assert to **check for invalid dependencies if the current callee and the previous one don't match**. > > # Testing > > This issue is very very, very intermittent and depending on a number of factors. This ... Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8355354: avoid resetting callee in call node ideal ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26441/files - new: https://git.openjdk.org/jdk/pull/26441/files/15bcb65e..ce807553 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26441&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26441&range=01-02 Stats: 38 lines in 1 file changed: 4 ins; 12 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/26441.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26441/head:pull/26441 PR: https://git.openjdk.org/jdk/pull/26441 From epeter at openjdk.org Mon Sep 1 06:59:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 06:59:53 GMT Subject: RFR: 8366357: C2 SuperWord: refactor VTransformNode::apply with VTransformApplyState [v3] In-Reply-To: References: <-URf_iP7rH-Ev5PzEhDseBTqTTCuHiMEYkTdeksxP_0=.14d9721e-b5f9-4d0e-932f-78ca4a6ad12b@github.com> Message-ID: On Thu, 28 Aug 2025 14:47:43 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> For Christian: use phase->intcon instead > > Marked as reviewed by mhaessig (Committer). @mhaessig @vnkozlov @chhagedorn Thanks f?r the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26987#issuecomment-3241080195 From epeter at openjdk.org Mon Sep 1 06:59:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 06:59:53 GMT Subject: Integrated: 8366357: C2 SuperWord: refactor VTransformNode::apply with VTransformApplyState In-Reply-To: References: Message-ID: On Thu, 28 Aug 2025 12:57:44 GMT, Emanuel Peter wrote: > I'm working on **cost-modeling**, and am integrating some smaller changes from this proof-of-concept PR: https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a **pure refactoring** - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > The goal here is that `VTransformNode::apply` only needs a single argument. This is important, as we will soon add more components that need to be updated during apply. That way, we can simply add more parts to `VTransformApplyState`, and do not need to add more arguments to VTransformNode::apply. > > And yes: I have considering passing the `apply_state` as `const`. While this may be possible with the current code state, the upcoming changes from https://github.com/openjdk/jdk/pull/20964 will require non-const access to the `apply_state` (e.g. for `set_memory_state`). > > Also: Christian asked me to squeeze in some other change: `igvn.intcon` -> `phase->intcon`, so that we also set the control to root. It's not been strictly necessary, but probably better to do it. This pull request has now been integrated. Changeset: dbac620b Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/dbac620b996713087f0d1b1189e543e51a0bb09f Stats: 131 lines in 3 files changed: 31 ins; 26 del; 74 mod 8366357: C2 SuperWord: refactor VTransformNode::apply with VTransformApplyState Reviewed-by: chagedorn, kvn, mhaessig ------------- PR: https://git.openjdk.org/jdk/pull/26987 From epeter at openjdk.org Mon Sep 1 07:00:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 07:00:43 GMT Subject: RFR: 8366427: C2 SuperWord: refactor VTransform scalar nodes [v2] In-Reply-To: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> References: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > The goal is to split up some cases that are currently treated the same, but will alter have different behavior. There may be a little bit of code duplication, but the code will soon be made different ;) > > We split the `VTransformScalarNode`: > - `VTransformMemopScalarNode` > - Uses that only wanted scalar mem nodes can now directly check for `isa_MemopScalar`. > - We can directly store the `_vpointer` in a field, that way we don't need to do a lookup via `vloop_analyzer`. This could also be helpful later on if we ever do widening (unrolling during auto vectorization): we could then do the necessary modifications to the `vpointer`. > - `VTransformLoopPhiNode` > - Later on, they will play a more special role, they will give us easy access to the beginning state of the loop body and the backedges. > - `VTransformCFGNode` > - Calling them scalar nodes is not 100% accurate. We'll probably have to further refine them later on. But splitting them off now seems like a reasonable choice. Once we do if-conversion we'll have to do more work on CFG. > - `VTransformDataScalarNode` > - These represent all the normal "calculation" nodes in the loop. > - `VTransformInputScalarNode` -> `VTransformOuterNode`: > - For now, we are still just tracking input nodes, but soon we will need to track input and output nodes: basically just the 1-hop neighbourhood of nodes outside the loop. I'm already renaming them now, so it will be less noise later. > > I decided to rather split up more, and avoid the `VTransformScalarNode` together, avoiding having to override overrides - that can be really confusing (e.g. what I had with `is_load_in_loop`). Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/vtransform.hpp Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27002/files - new: https://git.openjdk.org/jdk/pull/27002/files/197d0896..86dac36b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27002&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27002&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27002.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27002/head:pull/27002 PR: https://git.openjdk.org/jdk/pull/27002 From mhaessig at openjdk.org Mon Sep 1 07:00:44 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 1 Sep 2025 07:00:44 GMT Subject: RFR: 8366427: C2 SuperWord: refactor VTransform scalar nodes [v2] In-Reply-To: References: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> Message-ID: On Mon, 1 Sep 2025 06:56:51 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> The goal is to split up some cases that are currently treated the same, but will alter have different behavior. There may be a little bit of code duplication, but the code will soon be made different ;) >> >> We split the `VTransformScalarNode`: >> - `VTransformMemopScalarNode` >> - Uses that only wanted scalar mem nodes can now directly check for `isa_MemopScalar`. >> - We can directly store the `_vpointer` in a field, that way we don't need to do a lookup via `vloop_analyzer`. This could also be helpful later on if we ever do widening (unrolling during auto vectorization): we could then do the necessary modifications to the `vpointer`. >> - `VTransformLoopPhiNode` >> - Later on, they will play a more special role, they will give us easy access to the beginning state of the loop body and the backedges. >> - `VTransformCFGNode` >> - Calling them scalar nodes is not 100% accurate. We'll probably have to further refine them later on. But splitting them off now seems like a reasonable choice. Once we do if-conversion we'll have to do more work on CFG. >> - `VTransformDataScalarNode` >> - These represent all the normal "calculation" nodes in the loop. >> - `VTransformInputScalarNode` -> `VTransformOuterNode`: >> - For now, we are still just tracking input nodes, but soon we will need to track input and output nodes: basically just the 1-hop neighbourhood of nodes outside the loop. I'm already renaming them now, so it will be less noise later. >> >> I decided to rather split up more, and avoid the `VTransformScalarNode` together, avoiding having to override overrides - that can be really confusing (e.g. what I had with `is_load_in_loop`). > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/vtransform.hpp > > Co-authored-by: Manuel H?ssig Thank you for your continued efforts, @eme64. The suspense is building for your big change... This looks good to me, bar one typo. Marked as reviewed by mhaessig (Committer). src/hotspot/share/opto/vtransform.hpp line 454: > 452: }; > 453: > 454: // Identity ransform for scalar loads and stores. Suggestion: // Identity transform for scalar loads and stores. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27002#pullrequestreview-3172282140 PR Review: https://git.openjdk.org/jdk/pull/27002#pullrequestreview-3172310479 PR Review Comment: https://git.openjdk.org/jdk/pull/27002#discussion_r2313027649 From epeter at openjdk.org Mon Sep 1 07:08:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 07:08:56 GMT Subject: RFR: 8366361: C2 SuperWord: rename VTransformNode::set_req -> init_req, analogue to Node::init_req [v2] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > The current implementation of `VTransformNode::set_req` has `init_req` semantics, it verifies that the corresponding input is still nullptr. We should thus rename it. It will also free up the "set_req" name for later use in VTransform optimizations, where we want to modify the graph. > > See `VTransformReductionVectorNode::optimize_move_non_strict_order_reductions_out_of_loop` in the proof-of-concept PR. > > FYI: this PR is dependent on https://github.com/openjdk/jdk/pull/26987. I'll rebase once that one is integrated. We can still already review, so that the process is a little faster later on. (I have more small changes coming, but separating makes them more reviewable.) Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'master' into JDK-8366361-vtn-init_req - JDK-8366361 - For Christian: use phase->intcon instead - Update src/hotspot/share/opto/vtransform.hpp Co-authored-by: Christian Hagedorn - JDK-8366357 ------------- Changes: https://git.openjdk.org/jdk/pull/26991/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26991&range=01 Stats: 26 lines in 3 files changed: 0 ins; 0 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/26991.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26991/head:pull/26991 PR: https://git.openjdk.org/jdk/pull/26991 From dskantz at openjdk.org Mon Sep 1 07:11:25 2025 From: dskantz at openjdk.org (Daniel Skantz) Date: Mon, 1 Sep 2025 07:11:25 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis Message-ID: This PR addresses a wrong compilation during string optimizations. During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. Testing: T1-3 (aed5952). Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. ------------- Commit messages: - ws - add an assert - revert to unfolded version of is_diamond - fix Changes: https://git.openjdk.org/jdk/pull/27028/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27028&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362117 Stats: 85 lines in 2 files changed: 85 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27028.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27028/head:pull/27028 PR: https://git.openjdk.org/jdk/pull/27028 From chagedorn at openjdk.org Mon Sep 1 07:14:44 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 1 Sep 2025 07:14:44 GMT Subject: RFR: 8366361: C2 SuperWord: rename VTransformNode::set_req -> init_req, analogue to Node::init_req [v2] In-Reply-To: References: Message-ID: <-xtJXkBZ8TsKvj1zsyDeaAlXECBhIju5TZzfxc3iuYg=.dd473b6c-ff01-4fd5-90d7-701e0407f9bc@github.com> On Mon, 1 Sep 2025 07:08:56 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> The current implementation of `VTransformNode::set_req` has `init_req` semantics, it verifies that the corresponding input is still nullptr. We should thus rename it. It will also free up the "set_req" name for later use in VTransform optimizations, where we want to modify the graph. >> >> See `VTransformReductionVectorNode::optimize_move_non_strict_order_reductions_out_of_loop` in the proof-of-concept PR. >> >> FYI: this PR is dependent on https://github.com/openjdk/jdk/pull/26987. I'll rebase once that one is integrated. We can still already review, so that the process is a little faster later on. (I have more small changes coming, but separating makes them more reviewable.) > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' into JDK-8366361-vtn-init_req > - JDK-8366361 > - For Christian: use phase->intcon instead > - Update src/hotspot/share/opto/vtransform.hpp > > Co-authored-by: Christian Hagedorn > - JDK-8366357 Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26991#pullrequestreview-3172354981 From dfenacci at openjdk.org Mon Sep 1 07:17:43 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 1 Sep 2025 07:17:43 GMT Subject: RFR: 8355354: C2 crashed: assert(_callee == nullptr || _callee == m) failed: repeated inline attempt with different callee [v3] In-Reply-To: References: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> Message-ID: On Fri, 29 Aug 2025 16:50:26 GMT, Damon Fenacci wrote: >> I second that. And it aligns with our effort to make CI queries report stable results. >> >> (FTR here's what I proposed to Damon privately: "Another alternative is to cache and reuse cg->callee_method() when it becomes non-null. And turn repeated CHA requests (Compile::optimize_inlining) into verification logic.") > >> I'm wondering if there might be other reasons that the callee might change, like JVMTI class redefinition > > I guess there could be. For JVMTI we could possibly check for `Method::is_old` or `Method::is_obsolete`? But still, it might not be the only reason... > >> so the easiest fix for class redefinition and CHA would be to ignore the new callee and keep the old one here. > > I'm tempted by setting the callee if it is null and just removing the original assert but @iwanowww suggested moving the assert to the `Ideal` function. I've just pushed a change that should be doing that. > so the easiest fix for class redefinition and CHA would be to ignore the new callee and keep the old one here. > Another alternative is to cache and reuse cg->callee_method() when it becomes non-null. Actually, I changed my mind after looking at Vladimir's advice: this alternative (ignoring the new callee if it is already set) is cleaner and simpler. Thanks @iwanowww and @dean-long for the suggestion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26441#discussion_r2313085032 From epeter at openjdk.org Mon Sep 1 07:37:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 07:37:00 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 09:38:58 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: > > - Restore modified java/lang/invoke tests > - Sort includes (new requirement) > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Add clarifying comments at definitions of register mask sizes > - Fix implicit zero and nullptr checks > - Add deep copy comment > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Fix typo > - Updates after Emanuel's comments > - Refactor and improve TestNestedSynchronize.java > - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 Nice, looks like the old test issues are now gone. Great to see that ? I was looking for tests that verify what your PR title promises: that we successfully compile methods with many arguments. The test you have looks like a good start: `TestMaxMethodArguments.java` Do you think it would make sense to have more tests? I'm imagining something like this: - Generate tests with 0-255 arguments. You could use the template framework. - Take different types (e.g. various primitive types, also those that take 2 stack slots like long and double). You could use the template library `PrimitiveType` if you want. - Test that we actually get the method compiled. Maybe an IR rule could be used here? - And do some rudamentary result verification - Make sure it does not just work with `Xcomp` but also under "normal" circumstances (tiered, profiling, etc). I'll look a bit at your VM changes now ;) test/hotspot/jtreg/compiler/arguments/TestMaxMethodArguments.java line 57: > 55: try { > 56: test(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217 , 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255); > 57: } catch (TestException e) { This seems to be the only test that actually tests what your PR title promises: it has a method with many arguments. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-3172429642 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313120394 From fyang at openjdk.org Mon Sep 1 07:40:43 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 1 Sep 2025 07:40:43 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) In-Reply-To: References: Message-ID: On Tue, 26 Aug 2025 14:43:05 GMT, Robbin Ehn wrote: > Hey, please consider! > > A bunch of info in JBS entry, please read that also. > > I narrowed this issue down to the old jal optimization, making direct calls when in reach. > This patch restores them and removes this regression. > > In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. > > Please test on your hardware! > > > Chi Square (100 runs each, 10 fastest iterations of each run, P550) > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. That looks fine to me. I don't have other concerns modulo two minor typos. FYI: My local hs:tier1-hs:tier3 test with fastdebug build is good. src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 125: > 123: set_stub_address_destination_at(stub_addr, dest); > 124: > 125: // patches jalr -> jal/jal -> jalr depeding on dest Suggestion: s/depeding/depending/ src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 146: > 144: > 145: address dest = stub_address_destination_at(stub_addr); > 146: optimize_call(dest, false); // patches jalr -> jal/jal -> jalr depeding on dest Suggestion: s/depeding/depending/ ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26944#pullrequestreview-3172332378 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313065100 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313095316 From jbhateja at openjdk.org Mon Sep 1 07:54:50 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 1 Sep 2025 07:54:50 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v3] In-Reply-To: References: <_Wv0Roo5xUHjswP_JUy6yzoU5KCwNpIoX3S2QBceUbE=.05b5bbbd-840b-4162-a454-94a9ddc2a69f@github.com> Message-ID: On Mon, 1 Sep 2025 07:50:57 GMT, Jatin Bhateja wrote: >> Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix input size enum values for AVX 10.2 conversion instructions that take memory as the source > > src/hotspot/cpu/x86/x86.ad line 7804: > >> 7802: predicate(VM_Version::supports_avx10_2() && >> 7803: is_integral_type(Matcher::vector_element_basic_type(n))); >> 7804: match(Set dst (VectorCastD2X src)); > > I assume your intent here is to feed the memory operand to the vector cast IR, a memory operand is first loaded into register using LoadVector IR, so a CISC / memory variant of pattern should consume the Load IR such that the operand is directly exposed to the instruction. Checkout https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L8986 Make a similar change in all the newly added memory patterns. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2313167705 From jbhateja at openjdk.org Mon Sep 1 07:54:50 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 1 Sep 2025 07:54:50 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v3] In-Reply-To: <_Wv0Roo5xUHjswP_JUy6yzoU5KCwNpIoX3S2QBceUbE=.05b5bbbd-840b-4162-a454-94a9ddc2a69f@github.com> References: <_Wv0Roo5xUHjswP_JUy6yzoU5KCwNpIoX3S2QBceUbE=.05b5bbbd-840b-4162-a454-94a9ddc2a69f@github.com> Message-ID: On Fri, 29 Aug 2025 23:46:18 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> >> [1] https://www.intel.com/content/www/us/en/content-details/856721/intel-adv... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Fix input size enum values for AVX 10.2 conversion instructions that take memory as the source src/hotspot/cpu/x86/x86.ad line 7804: > 7802: predicate(VM_Version::supports_avx10_2() && > 7803: is_integral_type(Matcher::vector_element_basic_type(n))); > 7804: match(Set dst (VectorCastD2X src)); I assume your intent here is to feed the memory operand to the vector cast IR, a memory operand is first loaded into register using LoadVector IR, so a CISC / memory variant of pattern should consume the Load IR such that the operand is directly exposed to the instruction. Checkout https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L8986 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2313165676 From galder at openjdk.org Mon Sep 1 08:19:44 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 08:19:44 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v4] In-Reply-To: <0VA9QnuPSb55PbioO1XWtSmrAC-sQet0hb_ldRgKdFQ=.95f56a0b-3b08-4654-8f1e-7217cd9bcabe@github.com> References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <0VA9QnuPSb55PbioO1XWtSmrAC-sQet0hb_ldRgKdFQ=.95f56a0b-3b08-4654-8f1e-7217cd9bcabe@github.com> Message-ID: <5xrZ-TcQ9OaMFIAMGIMTDCwGdexIMs0eJd6Li-T1aQc=.fc863cb9-0ce2-488f-a7d6-3aa211248798@github.com> On Wed, 27 Aug 2025 09:56:29 GMT, Emanuel Peter wrote: > Can you find out why we don't vectorize with AVX1 here? This was a fun little rabbit hole. The explanation below is for `test6` but I think the same logic applies to `test9`: The problem comes from the IR node definition, what JTreg does with that, and the what HotSpot code actually does. The annotation definition is: @IR(counts = {IRNode.LOAD_VECTOR_F, "> 0", So JTreg assumes that the regex should match a vector size of 8. With `UseAVX=1` and floats, `IRNode.getMaxElementsForTypeOnX86` returns 8 and so that's how the constraint is set: * Constraint 1: "(\d+(\s){2}(LoadVector.*)+(\s){2}===.*vector[A-Za-z])" But the issue is that at runtime the vector size is 4: 844 LoadVector === ... #vectorx HotSpot logic is more nuanced, with the key being what happens in `SuperWord::unrolling_analysis`. The thing that JTreg doesn't know is that there are 2 types involved in the loop, float **and** int: for (int i = 0; i < a.length; i++) { a[i] = Float.floatToRawIntBits(b[i]); } With `UseAVX=1`, the max vector size for floats is 8, but for ints is 4. So the JVM picks the minimum value and uses that. Hence that is how unrolling is 4... all the way to the load vector size which is 4. IMO the right thing to do would be to fix the annotation to be: @IR(counts = {IRNode.LOAD_VECTOR_F, IRNode.VECTOR_SIZE_4, "> 0", And explain it in javadoc why the expected size is 4. The same with `test9` WDYT @eme64? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3241348514 From epeter at openjdk.org Mon Sep 1 08:37:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 08:37:07 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 09:38:58 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: > > - Restore modified java/lang/invoke tests > - Sort includes (new requirement) > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Add clarifying comments at definitions of register mask sizes > - Fix implicit zero and nullptr checks > - Add deep copy comment > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Fix typo > - Updates after Emanuel's comments > - Refactor and improve TestNestedSynchronize.java > - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 I thought I'd dive straight back into `regmask.hpp`. I'm remembering some of what we discussed, but I'll need you help to fill in the picture ;) I wonder if we could do some renamings in a prior PR, just to make this a little easier to review. src/hotspot/share/opto/regmask.hpp line 44: > 42: // statements in Java. > 43: const int BoxLockNode_SLOT_LIMIT = 200; > 44: Even before this constant, it would be nice to have an introductory comment, that lays out what the regmask is for, and what its basic design is. src/hotspot/share/opto/regmask.hpp line 63: > 61: // RM_SIZE is the base size of a register mask in 32-bit words. > 62: // RM_SIZE_MIN is the theoretical minimum size of a register mask in 32-bit > 63: // words. It seems this is a bad pattern that was already here before you. But it really makes me a little scared here. Having two variable names differ in just an underscore `_` but with different semantics is a bit confusing to me. It is hard for the reader to keep track of what is what going forward. It would be really easy for someone to confuse the two in the future and have bugs creap in that way (just because of an underscore). It may be more useful to use the units in at least one of the two names. I would love to see names like `RM_SIZE` and `RM_SIZE_IN_LONGS`, rather than `RM_SIZE` and `_RM_SIZE`. Even better would be `RM_SIZE_IN_INTS` and `RM_SIZE_IN_LONGS`. That way, you rould save a lot of comments. Maybe you could come up with even better names. "slots" and "words"? You could consider doing a renaming PR first before the patch here. Maybe you can even automate the renaming with a command/script, and then apply the same renaming to the changes here? src/hotspot/share/opto/regmask.hpp line 96: > 94: (((RM_SIZE_MIN << 5) + // Slots for machine registers > 95: (max_method_parameter_length * 2) + // Slots for incoming arguments > 96: (max_method_parameter_length * 2) + // Slots for outgoing arguments What's the meaning of incoming vs outgoing arguments? Like this? Incoming = from caller (outer nesting) Outgoing = to nested call (inner nesting) src/hotspot/share/opto/regmask.hpp line 122: > 120: > 121: // Viewed as an array of machine words > 122: uintptr_t _RM_UP[_RM_SIZE]; Do you know what `UP` stands for? Could we rename it maybe? Would be nice if we could have the same "units" for these arrays than for the sizes above. src/hotspot/share/opto/regmask.hpp line 128: > 126: // extend the register mask with dynamically allocated memory. We keep the > 127: // base statically allocated _RM_UP, and arena allocate the extended mask > 128: // (RM_UP_EXT) separately. Another, perhaps more elegant, option would be to Suggestion: // (_RM_UP_EXT) separately. Another, perhaps more elegant, option would be to Underscore for consistency? Or does it reference something else? src/hotspot/share/opto/regmask.hpp line 161: > 159: // cases, we can allow read-only sharing. > 160: bool _read_only = false; > 161: #endif Can you explain why this happens? Is this something we could clean up? It smells a bit like tech-dept. But maybe it is a really necessary performance optimization. Would be nice if there was an explanation which one it is ;) src/hotspot/share/opto/regmask.hpp line 170: > 168: // variable indicates how many words we offset with. We consider all > 169: // registers before the offset to not be included in the register mask. > 170: unsigned int _offset; Does that mean we make different slices of the mask? src/hotspot/share/opto/regmask.hpp line 175: > 173: // mask can currently represent to be included. If _all_stack = false, we > 174: // consider the registers not included. > 175: bool _all_stack = false; I'd prefer to have some kind of `_is_...` name here. Because when I read `all_stack` and see it is a bool, I wonder what it means - it does not tell me quickly. Does it mean that all registers are on the stack? Is everything that is beyond the register mask purely on the stack? Is everything from the stack always beyond the register mask? I'm confused :face_with_peeking_eye: src/hotspot/share/opto/regmask.hpp line 179: > 177: // The low and high watermarks represent the lowest and highest word that > 178: // might contain set register mask bits, respectively. We guarantee that > 179: // there are no bits in words outside this range, but any word at and between In the example below, you have 1 bits above the `_hwm`. Is that intentional? Are those bits to be ignored? Can you please add some extra info to the example about that? src/hotspot/share/opto/regmask.hpp line 217: > 215: // necessarily representing stack locations) to 1. Here is how the above > 216: // register mask looks like after clearing, setting _all_stack to true, and > 217: // successfully rolling over: I'm still struggling to follow here. Maybe `_offset` is not clear to me yet. What is the value here for it? How is it changed with the `rollover`? src/hotspot/share/opto/regmask.hpp line 230: > 228: // \_______________________________________________________________________________/ > 229: // | > 230: // _rm_size Ah, I remember this now. Really helpful. Maybe we could link to this layout explanation from the comment at the very top of the file? ------------- PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-3172500942 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313199061 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313162130 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313223912 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313184547 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313195111 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313207232 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313263478 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313219662 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313253475 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313264670 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313256455 From epeter at openjdk.org Mon Sep 1 08:37:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 08:37:07 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 07:49:26 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: >> >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Fix typo >> - Updates after Emanuel's comments >> - Refactor and improve TestNestedSynchronize.java >> - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 > > src/hotspot/share/opto/regmask.hpp line 63: > >> 61: // RM_SIZE is the base size of a register mask in 32-bit words. >> 62: // RM_SIZE_MIN is the theoretical minimum size of a register mask in 32-bit >> 63: // words. > > It seems this is a bad pattern that was already here before you. But it really makes me a little scared here. > > Having two variable names differ in just an underscore `_` but with different semantics is a bit confusing to me. It is hard for the reader to keep track of what is what going forward. It would be really easy for someone to confuse the two in the future and have bugs creap in that way (just because of an underscore). It may be more useful to use the units in at least one of the two names. > > I would love to see names like `RM_SIZE` and `RM_SIZE_IN_LONGS`, rather than `RM_SIZE` and `_RM_SIZE`. > Even better would be `RM_SIZE_IN_INTS` and `RM_SIZE_IN_LONGS`. That way, you rould save a lot of comments. Maybe you could come up with even better names. "slots" and "words"? > You could consider doing a renaming PR first before the patch here. Maybe you can even automate the renaming with a command/script, and then apply the same renaming to the changes here? Oh gosh, I just realized: machine word of course depends on 32bit vs 64bit architecture. Yikes. So maybe the names need to be stack-slots vs words? And there should probably be a quick reminder somewhere that words can be different sizes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2313237509 From epeter at openjdk.org Mon Sep 1 08:43:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 08:43:44 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v4] In-Reply-To: <5xrZ-TcQ9OaMFIAMGIMTDCwGdexIMs0eJd6Li-T1aQc=.fc863cb9-0ce2-488f-a7d6-3aa211248798@github.com> References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <0VA9QnuPSb55PbioO1XWtSmrAC-sQet0hb_ldRgKdFQ=.95f56a0b-3b08-4654-8f1e-7217cd9bcabe@github.com> <5xrZ-TcQ9OaMFIAMGIMTDCwGdexIMs0eJd6Li-T1aQc=.fc863cb9-0ce2-488f-a7d6-3aa211248798@github.com> Message-ID: On Mon, 1 Sep 2025 08:17:08 GMT, Galder Zamarre?o wrote: >> @galderz I got a failure in out testing: >> >> With VM flag: `-XX:UseAVX=1`. >> >> >> Failed IR Rules (2) of Methods (2) >> ---------------------------------- >> 1) Method "static java.lang.Object[] compiler.loopopts.superword.TestCompatibleUseDefTypeSize.test6(int[],float[])" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true", "rvv", "true"}, counts={"_#V#LOAD_VECTOR_F#_", "> 0", "_#STORE_VECTOR#_", "> 0", "_#VECTOR_REINTERPRET#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" >> > Phase "PrintIdeal": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z])" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! >> >> 2) Method "static java.lang.Object[] compiler.loopopts.superword.TestCompatibleUseDefTypeSize.test9(long[],double[])" - [Failed IR rules: 1]: >> * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"sse4.1", "true", "asimd", "true", "rvv", "true"}, counts={"_#V#LOAD_VECTOR_D#_", "> 0", "_#STORE_VECTOR#_", "> 0", "_#VECTOR_REINTERPRET#_", "> 0"}, applyIfPlatformOr={}, applyIfPlatform={"64-bit", "true"}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" >> > Phase "PrintIdeal": >> - counts: Graph contains wrong number of nodes: >> * Constraint 1: "(\\d+(\\s){2}(LoadVector.*)+(\\s){2}===.*vector[A-Za-z])" >> - Failed comparison: [found] 0 > 0 [given] >> - No nodes matched! >> >> >> I suspect that `test6` with `floatToRawIntBits` and `test9` with `doubleToRawLongBits` are only supported with `AVX2`. Question is if that is really supposed to be like that, or if we should even file an RFE to extend support for `AVX1` and lower. >> >> Can you find out why we don't vectorize with `AVX1` here? > >> Can you find out why we don't vectorize with AVX1 here? > > This was a fun little rabbit hole. The explanation below is for `test6` but I think the same logic applies to `test9`: > > The problem comes from the IR node definition, what JTreg does with that, and the what HotSpot code actually does. > > The annotation definition is: > > @IR(counts = {IRNode.LOAD_VECTOR_F, "> 0", > > > So JTreg assumes that the regex should match a vector size of 8. With `UseAVX=1` and floats, `IRNode.getMaxElementsForTypeOnX86` returns 8 and so that's how the constraint is set: > > > * Constraint 1: "(\d+(\s){2}(LoadVector.*)+(\s){2}===.*vector[A-Za-z])" > > > But the issue is that at runtime the vector size is 4: > > 844 LoadVector === ... #vectorx > > > HotSpot logic is more nuanced, with the key being what happens in `SuperWord::unrolling_analysis`. The thing that JTreg doesn't know is that there are 2 types involved in the loop, float **and** int: > > > for (int i = 0; i < a.length; i++) { > a[i] = Float.floatToRawIntBits(b[i]); > } > > > With `UseAVX=1`, the max vector size for floats is 8, but for ints is 4. So the JVM picks the minimum value and uses that. Hence that is how unrolling is 4... all the way to the load vector size which is 4. > > IMO the right thing to do would be to fix the annotation to be: > > > @IR(counts = {IRNode.LOAD_VECTOR_F, IRNode.VECTOR_SIZE_4, "> 0", > > > And explain it in javadoc why the expected size is 4. > > The same with `test9` > > WDYT @eme64? @galderz Ah, maybe we just need to do it like here then: `test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java:192:50: counts = {IRNode.VECTOR_CAST_I2F, IRNode.VECTOR_SIZE + "min(max_int, max_float)", ">0"})` When doing cast/reinterpret/move between types this always happens ;) I think this should generalize over all platforms. Does that work? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3241438142 From epeter at openjdk.org Mon Sep 1 08:47:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 08:47:51 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> On Mon, 25 Aug 2025 07:13:43 GMT, Galder Zamarre?o wrote: >> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: >> >> >> Benchmark (seed) (size) Mode Cnt Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: > > - Merge branch 'master' into topic.fp-bits-vector > - Add more IR node positive assertions > - Fix source of data for benchmarks > - Refactor benchmarks to TypeVectorOperations > - Check at the very least that auto vectorization is supported > - Avoid VectorReinterpret::implemented > - Refactor and add copyright header > - Rephrase comment > - Removed unnecessary assert methods > - Adjust IR test after adding Move* vector support > - ... and 12 more: https://git.openjdk.org/jdk/compare/fc6e0b6f...e7e4d801 test/hotspot/jtreg/compiler/loopopts/superword/TestCompatibleUseDefTypeSize.java line 460: > 458: @IR(counts = {IRNode.LOAD_VECTOR_L, "> 0", > 459: IRNode.STORE_VECTOR, "> 0", > 460: IRNode.VECTOR_REINTERPRET, "> 0"}, Ah, I just saw that `VECTOR_REINTERPRET` is no `vectorNode`, so we don't check the size for it. Would it have a type and size though? If so, we could consider making it more precise, like all the vector casts. Would be a little bit of work, but it would make the rules more precise. Could also be a separate RFE. 2458 public static final String VECTOR_REINTERPRET = PREFIX + "VECTOR_REINTERPRET" + POSTFIX; 2459 static { 2460 beforeMatchingNameRegex(VECTOR_REINTERPRET, "VectorReinterpret"); 2461 } 2462 2463 public static final String VECTOR_UCAST_B2S = VECTOR_PREFIX + "VECTOR_UCAST_B2S" + POSTFIX; 2464 static { 2465 vectorNode(VECTOR_UCAST_B2S, "VectorUCastB2X", TYPE_SHORT); 2466 } Depending on the dump, it may not be so easy though. Not sure. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2313298675 From epeter at openjdk.org Mon Sep 1 08:50:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 08:50:58 GMT Subject: RFR: 8366361: C2 SuperWord: rename VTransformNode::set_req -> init_req, analogue to Node::init_req [v2] In-Reply-To: <-xtJXkBZ8TsKvj1zsyDeaAlXECBhIju5TZzfxc3iuYg=.dd473b6c-ff01-4fd5-90d7-701e0407f9bc@github.com> References: <-xtJXkBZ8TsKvj1zsyDeaAlXECBhIju5TZzfxc3iuYg=.dd473b6c-ff01-4fd5-90d7-701e0407f9bc@github.com> Message-ID: <8zauGgBGFELEkml3ODhhsoVJDGJUyKNhg3cQbxF60RU=.820949e0-a455-4a5d-8c35-63af12a24e97@github.com> On Mon, 1 Sep 2025 07:12:08 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'master' into JDK-8366361-vtn-init_req >> - JDK-8366361 >> - For Christian: use phase->intcon instead >> - Update src/hotspot/share/opto/vtransform.hpp >> >> Co-authored-by: Christian Hagedorn >> - JDK-8366357 > > Marked as reviewed by chagedorn (Reviewer). @chhagedorn @vnkozlov Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26991#issuecomment-3241456255 From epeter at openjdk.org Mon Sep 1 08:51:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 08:51:00 GMT Subject: Integrated: 8366361: C2 SuperWord: rename VTransformNode::set_req -> init_req, analogue to Node::init_req In-Reply-To: References: Message-ID: On Thu, 28 Aug 2025 15:30:31 GMT, Emanuel Peter wrote: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > The current implementation of `VTransformNode::set_req` has `init_req` semantics, it verifies that the corresponding input is still nullptr. We should thus rename it. It will also free up the "set_req" name for later use in VTransform optimizations, where we want to modify the graph. > > See `VTransformReductionVectorNode::optimize_move_non_strict_order_reductions_out_of_loop` in the proof-of-concept PR. > > FYI: this PR is dependent on https://github.com/openjdk/jdk/pull/26987. I'll rebase once that one is integrated. We can still already review, so that the process is a little faster later on. (I have more small changes coming, but separating makes them more reviewable.) This pull request has now been integrated. Changeset: 56713817 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/56713817c0fd060f7106a538b0e795081f4f9d4b Stats: 26 lines in 3 files changed: 0 ins; 0 del; 26 mod 8366361: C2 SuperWord: rename VTransformNode::set_req -> init_req, analogue to Node::init_req Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/26991 From galder at openjdk.org Mon Sep 1 08:51:51 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 08:51:51 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: On Mon, 25 Aug 2025 07:13:43 GMT, Galder Zamarre?o wrote: >> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: >> >> >> Benchmark (seed) (size) Mode Cnt Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. > > Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: > > - Merge branch 'master' into topic.fp-bits-vector > - Add more IR node positive assertions > - Fix source of data for benchmarks > - Refactor benchmarks to TypeVectorOperations > - Check at the very least that auto vectorization is supported > - Avoid VectorReinterpret::implemented > - Refactor and add copyright header > - Rephrase comment > - Removed unnecessary assert methods > - Adjust IR test after adding Move* vector support > - ... and 12 more: https://git.openjdk.org/jdk/compare/54d7c4b3...e7e4d801 One correction about my suggested fix above: This one would work for `UseAVX=1` but would fail with other `UseAVX` values. @IR(counts = {IRNode.LOAD_VECTOR_F, IRNode.VECTOR_SIZE_4, "> 0", It would need to be something like this to work in all cases: @IR(counts = {IRNode.LOAD_VECTOR_F, IRNode.VECTOR_SIZE_ANY, "> 0", ------------- PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3241465858 From rehn at openjdk.org Mon Sep 1 08:56:25 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 1 Sep 2025 08:56:25 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: > Hey, please consider! > > A bunch of info in JBS entry, please read that also. > > I narrowed this issue down to the old jal optimization, making direct calls when in reach. > This patch restores them and removes this regression. > > In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. > > Please test on your hardware! > > > Chi Square (100 runs each, 10 fastest iterations of each run, P550) > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into 8365926 - Spelling - Merge branch 'master' into 8365926 - draft jal<->jalr ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26944/files - new: https://git.openjdk.org/jdk/pull/26944/files/03505f8d..b81779cb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=00-01 Stats: 10832 lines in 300 files changed: 8871 ins; 705 del; 1256 mod Patch: https://git.openjdk.org/jdk/pull/26944.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26944/head:pull/26944 PR: https://git.openjdk.org/jdk/pull/26944 From rehn at openjdk.org Mon Sep 1 08:56:25 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 1 Sep 2025 08:56:25 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 07:04:20 GMT, Fei Yang wrote: >> Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge branch 'master' into 8365926 >> - Spelling >> - Merge branch 'master' into 8365926 >> - draft jal<->jalr > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 125: > >> 123: set_stub_address_destination_at(stub_addr, dest); >> 124: >> 125: // patches jalr -> jal/jal -> jalr depeding on dest > > Suggestion: s/depeding/depending/ Fixed > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 146: > >> 144: >> 145: address dest = stub_address_destination_at(stub_addr); >> 146: optimize_call(dest, false); // patches jalr -> jal/jal -> jalr depeding on dest > > Suggestion: s/depeding/depending/ Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313319741 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313319894 From rcastanedalo at openjdk.org Mon Sep 1 08:58:50 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 1 Sep 2025 08:58:50 GMT Subject: RFR: 8365791: IGV: Update build dependencies In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 06:37:30 GMT, Roberto Casta?eda Lozano wrote: > This changeset updates IGV's Apache Batik dependency, which is used for exporting graphs into SVG files (`File -> Export current graph...`), to its latest version. > > **Testing:** checked manually that a few graphs are correctly exported as SVG files. Thanks for reviewing, Christian and Albert! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27000#issuecomment-3241484828 From rcastanedalo at openjdk.org Mon Sep 1 08:58:51 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 1 Sep 2025 08:58:51 GMT Subject: Integrated: 8365791: IGV: Update build dependencies In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 06:37:30 GMT, Roberto Casta?eda Lozano wrote: > This changeset updates IGV's Apache Batik dependency, which is used for exporting graphs into SVG files (`File -> Export current graph...`), to its latest version. > > **Testing:** checked manually that a few graphs are correctly exported as SVG files. This pull request has now been integrated. Changeset: fc77e760 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/fc77e7600f217cc91c24d4e512c685e176a66e4a Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8365791: IGV: Update build dependencies Reviewed-by: chagedorn, ayang ------------- PR: https://git.openjdk.org/jdk/pull/27000 From epeter at openjdk.org Mon Sep 1 08:58:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 08:58:57 GMT Subject: RFR: 8366427: C2 SuperWord: refactor VTransform scalar nodes [v3] In-Reply-To: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> References: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > The goal is to split up some cases that are currently treated the same, but will alter have different behavior. There may be a little bit of code duplication, but the code will soon be made different ;) > > We split the `VTransformScalarNode`: > - `VTransformMemopScalarNode` > - Uses that only wanted scalar mem nodes can now directly check for `isa_MemopScalar`. > - We can directly store the `_vpointer` in a field, that way we don't need to do a lookup via `vloop_analyzer`. This could also be helpful later on if we ever do widening (unrolling during auto vectorization): we could then do the necessary modifications to the `vpointer`. > - `VTransformLoopPhiNode` > - Later on, they will play a more special role, they will give us easy access to the beginning state of the loop body and the backedges. > - `VTransformCFGNode` > - Calling them scalar nodes is not 100% accurate. We'll probably have to further refine them later on. But splitting them off now seems like a reasonable choice. Once we do if-conversion we'll have to do more work on CFG. > - `VTransformDataScalarNode` > - These represent all the normal "calculation" nodes in the loop. > - `VTransformInputScalarNode` -> `VTransformOuterNode`: > - For now, we are still just tracking input nodes, but soon we will need to track input and output nodes: basically just the 1-hop neighbourhood of nodes outside the loop. I'm already renaming them now, so it will be less noise later. > > I decided to rather split up more, and avoid the `VTransformScalarNode` together, avoiding having to override overrides - that can be really confusing (e.g. what I had with `is_load_in_loop`). Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - Merge branch 'JDK-8366427-VTransform-scalar-node-refactor' of https://github.com/eme64/jdk into JDK-8366427-VTransform-scalar-node-refactor - Update src/hotspot/share/opto/vtransform.hpp Co-authored-by: Manuel H?ssig - manual merge - improve print_spec - rm comment - InputScalar -> Outer renaming - rm useless methods - rm vloop_analyzer from vpointer method - JDK-8366427 - JDK-8366361 - ... and 3 more: https://git.openjdk.org/jdk/compare/56713817...86e88f43 ------------- Changes: https://git.openjdk.org/jdk/pull/27002/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27002&range=02 Stats: 157 lines in 4 files changed: 114 ins; 0 del; 43 mod Patch: https://git.openjdk.org/jdk/pull/27002.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27002/head:pull/27002 PR: https://git.openjdk.org/jdk/pull/27002 From dfenacci at openjdk.org Mon Sep 1 09:02:46 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 1 Sep 2025 09:02:46 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v2] In-Reply-To: References: <5CGrcWjFZ7Zqj_Tm0LO6Tqg9cUA-xxvcaa2J-yWW8BE=.af4dea7c-e39d-491d-b924-c89fa82e757a@github.com> Message-ID: On Thu, 21 Aug 2025 00:27:02 GMT, Dean Long wrote: > This look OK on the surface, but isn't handling MemBarStoreStore and MemBarRelease differently asking for trouble? Is there a reason why they need to be handled in different passes? I'm not sure of the reason why EA handles `MemBarStoreStore` separately. Maybe @vnkozlov can shed some light... BTW the original assert with condition `Opcode() == Op_Initialize` seems to have been added because that was the case of the [JDK-8269771](https://bugs.openjdk.org/browse/JDK-8269771) bug ([PR](https://github.com/openjdk/jdk17/pull/193)). I'm not sure that there couldn't be any other additional case (apart from the current two) that makes the membar node have only one out edge. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26556#issuecomment-3241505556 From galder at openjdk.org Mon Sep 1 09:03:24 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 09:03:24 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v6] In-Reply-To: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: > I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. > > Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: > > > Benchmark (seed) (size) Mode Cnt Base Patch Units Diff > VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% > VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% > VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% > VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% > VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% > VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% > > > The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. > > I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: Adjust vector size expectations ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26457/files - new: https://git.openjdk.org/jdk/pull/26457/files/e7e4d801..632408ba Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26457&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26457&range=04-05 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26457.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26457/head:pull/26457 PR: https://git.openjdk.org/jdk/pull/26457 From galder at openjdk.org Mon Sep 1 09:03:25 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 09:03:25 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v4] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <0VA9QnuPSb55PbioO1XWtSmrAC-sQet0hb_ldRgKdFQ=.95f56a0b-3b08-4654-8f1e-7217cd9bcabe@github.com> <5xrZ-TcQ9OaMFIAMGIMTDCwGdexIMs0eJd6Li-T1aQc=.fc863cb9-0ce2-488f-a7d6-3aa211248798@github.com> Message-ID: On Mon, 1 Sep 2025 08:40:52 GMT, Emanuel Peter wrote: > Does that work? Yeah that works, I'll push an update shortly @eme64 I've just pushed an update that fixes the vector size expectations. I didn't end up writing a javadoc since the proposed solution makes it clearer what the expected size should be. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3241495486 PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3241505284 From galder at openjdk.org Mon Sep 1 09:03:28 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 09:03:28 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> Message-ID: On Mon, 1 Sep 2025 08:44:10 GMT, Emanuel Peter wrote: >> Galder Zamarre?o has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: >> >> - Merge branch 'master' into topic.fp-bits-vector >> - Add more IR node positive assertions >> - Fix source of data for benchmarks >> - Refactor benchmarks to TypeVectorOperations >> - Check at the very least that auto vectorization is supported >> - Avoid VectorReinterpret::implemented >> - Refactor and add copyright header >> - Rephrase comment >> - Removed unnecessary assert methods >> - Adjust IR test after adding Move* vector support >> - ... and 12 more: https://git.openjdk.org/jdk/compare/57cf332d...e7e4d801 > > test/hotspot/jtreg/compiler/loopopts/superword/TestCompatibleUseDefTypeSize.java line 460: > >> 458: @IR(counts = {IRNode.LOAD_VECTOR_L, "> 0", >> 459: IRNode.STORE_VECTOR, "> 0", >> 460: IRNode.VECTOR_REINTERPRET, "> 0"}, > > Ah, I just saw that `VECTOR_REINTERPRET` is no `vectorNode`, so we don't check the size for it. Would it have a type and size though? > > If so, we could consider making it more precise, like all the vector casts. > Would be a little bit of work, but it would make the rules more precise. > Could also be a separate RFE. > > > 2458 public static final String VECTOR_REINTERPRET = PREFIX + "VECTOR_REINTERPRET" + POSTFIX; > 2459 static { > 2460 beforeMatchingNameRegex(VECTOR_REINTERPRET, "VectorReinterpret"); > 2461 } > 2462 > 2463 public static final String VECTOR_UCAST_B2S = VECTOR_PREFIX + "VECTOR_UCAST_B2S" + POSTFIX; > 2464 static { > 2465 vectorNode(VECTOR_UCAST_B2S, "VectorUCastB2X", TYPE_SHORT); > 2466 } > > > Depending on the dump, it may not be so easy though. Not sure. That makes sense, I'll create a separate RFE for that ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2313333399 From mhaessig at openjdk.org Mon Sep 1 09:08:44 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 1 Sep 2025 09:08:44 GMT Subject: RFR: 8366427: C2 SuperWord: refactor VTransform scalar nodes [v3] In-Reply-To: References: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> Message-ID: <5n2PYjLiIaiBx20TGMaRL-nWU-eRDKs-mYGVAnHQMQc=.df9923d2-963f-4589-a21a-5cd56c0467c3@github.com> On Mon, 1 Sep 2025 08:58:57 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> The goal is to split up some cases that are currently treated the same, but will alter have different behavior. There may be a little bit of code duplication, but the code will soon be made different ;) >> >> We split the `VTransformScalarNode`: >> - `VTransformMemopScalarNode` >> - Uses that only wanted scalar mem nodes can now directly check for `isa_MemopScalar`. >> - We can directly store the `_vpointer` in a field, that way we don't need to do a lookup via `vloop_analyzer`. This could also be helpful later on if we ever do widening (unrolling during auto vectorization): we could then do the necessary modifications to the `vpointer`. >> - `VTransformLoopPhiNode` >> - Later on, they will play a more special role, they will give us easy access to the beginning state of the loop body and the backedges. >> - `VTransformCFGNode` >> - Calling them scalar nodes is not 100% accurate. We'll probably have to further refine them later on. But splitting them off now seems like a reasonable choice. Once we do if-conversion we'll have to do more work on CFG. >> - `VTransformDataScalarNode` >> - These represent all the normal "calculation" nodes in the loop. >> - `VTransformInputScalarNode` -> `VTransformOuterNode`: >> - For now, we are still just tracking input nodes, but soon we will need to track input and output nodes: basically just the 1-hop neighbourhood of nodes outside the loop. I'm already renaming them now, so it will be less noise later. >> >> I decided to rather split up more, and avoid the `VTransformScalarNode` together, avoiding having to override overrides - that can be really confusing (e.g. what I had with `is_load_in_loop`). > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merge branch 'JDK-8366427-VTransform-scalar-node-refactor' of https://github.com/eme64/jdk into JDK-8366427-VTransform-scalar-node-refactor > - Update src/hotspot/share/opto/vtransform.hpp > > Co-authored-by: Manuel H?ssig > - manual merge > - improve print_spec > - rm comment > - InputScalar -> Outer renaming > - rm useless methods > - rm vloop_analyzer from vpointer method > - JDK-8366427 > - JDK-8366361 > - ... and 3 more: https://git.openjdk.org/jdk/compare/56713817...86e88f43 Still good. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27002#pullrequestreview-3172793634 From epeter at openjdk.org Mon Sep 1 09:12:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 09:12:44 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v6] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: On Mon, 1 Sep 2025 09:03:24 GMT, Galder Zamarre?o wrote: >> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: >> >> >> Benchmark (seed) (size) Mode Cnt Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. > > Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: > > Adjust vector size expectations Perfect, thanks for the update! I'll submit testing again :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3241529779 From epeter at openjdk.org Mon Sep 1 09:12:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 09:12:46 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> Message-ID: On Mon, 1 Sep 2025 09:07:46 GMT, Galder Zamarre?o wrote: >> That makes sense, I'll create a separate RFE for that > > Ideal output for `VectorReinterpret` seems to follow a similar pattern to `LoadVector`...etc with regards to the vector size. So seems like a similar solution could be implemented: > > > 1306 VectorReinterpret === _ 1307 [[ 1286 ]] #vectory !orig=1179,979,[846],[738],[646],[145] !jvms: TestCompatibleUseDefTypeSize::test7 @ bci:13 (line 427) Very nice. That would be a good follow-up RFE. Do you want to work on that one? Otherwise you could tag it as a `starter` task, and we'll eventually find someone to do it ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2313364841 From galder at openjdk.org Mon Sep 1 09:12:45 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 09:12:45 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> Message-ID: On Mon, 1 Sep 2025 08:57:28 GMT, Galder Zamarre?o wrote: >> test/hotspot/jtreg/compiler/loopopts/superword/TestCompatibleUseDefTypeSize.java line 460: >> >>> 458: @IR(counts = {IRNode.LOAD_VECTOR_L, "> 0", >>> 459: IRNode.STORE_VECTOR, "> 0", >>> 460: IRNode.VECTOR_REINTERPRET, "> 0"}, >> >> Ah, I just saw that `VECTOR_REINTERPRET` is no `vectorNode`, so we don't check the size for it. Would it have a type and size though? >> >> If so, we could consider making it more precise, like all the vector casts. >> Would be a little bit of work, but it would make the rules more precise. >> Could also be a separate RFE. >> >> >> 2458 public static final String VECTOR_REINTERPRET = PREFIX + "VECTOR_REINTERPRET" + POSTFIX; >> 2459 static { >> 2460 beforeMatchingNameRegex(VECTOR_REINTERPRET, "VectorReinterpret"); >> 2461 } >> 2462 >> 2463 public static final String VECTOR_UCAST_B2S = VECTOR_PREFIX + "VECTOR_UCAST_B2S" + POSTFIX; >> 2464 static { >> 2465 vectorNode(VECTOR_UCAST_B2S, "VectorUCastB2X", TYPE_SHORT); >> 2466 } >> >> >> Depending on the dump, it may not be so easy though. Not sure. > > That makes sense, I'll create a separate RFE for that Ideal output for `VectorReinterpret` seems to follow a similar pattern to `LoadVector`...etc with regards to the vector size. So seems like a similar solution could be implemented: 1306 VectorReinterpret === _ 1307 [[ 1286 ]] #vectory !orig=1179,979,[846],[738],[646],[145] !jvms: TestCompatibleUseDefTypeSize::test7 @ bci:13 (line 427) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2313358195 From galder at openjdk.org Mon Sep 1 09:19:51 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 09:19:51 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> Message-ID: On Mon, 1 Sep 2025 09:10:29 GMT, Emanuel Peter wrote: >> Ideal output for `VectorReinterpret` seems to follow a similar pattern to `LoadVector`...etc with regards to the vector size. So seems like a similar solution could be implemented: >> >> >> 1306 VectorReinterpret === _ 1307 [[ 1286 ]] #vectory !orig=1179,979,[846],[738],[646],[145] !jvms: TestCompatibleUseDefTypeSize::test7 @ bci:13 (line 427) > > Very nice. That would be a good follow-up RFE. Do you want to work on that one? Otherwise you could tag it as a `starter` task, and we'll eventually find someone to do it ;) Yeah I'd like to work on it. Seems like a good one to work in between bigger tasks. I had a question about it though. I noticed that `STORE_VECTOR` is also in a similar situation. Any specific reason to leave that one as is? Or was it just an oversight? If an oversight, a second RFE could be added for that one? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2313379507 From bkilambi at openjdk.org Mon Sep 1 09:21:54 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 1 Sep 2025 09:21:54 GMT Subject: RFR: 8361582: AArch64: Some ConH values cannot be replicated with SVE [v8] In-Reply-To: References: Message-ID: On Thu, 28 Aug 2025 15:01:29 GMT, Aleksey Shipilev wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Modified JTREG testcase to address review comments > > Let's go, we need this patch in JDK 25, which requires some soak time in mainline :) @shipilev Could I please ask you to sponsor this patch? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26589#issuecomment-3236177433 From shade at openjdk.org Mon Sep 1 09:21:55 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 1 Sep 2025 09:21:55 GMT Subject: RFR: 8361582: AArch64: Some ConH values cannot be replicated with SVE [v8] In-Reply-To: References: Message-ID: On Tue, 26 Aug 2025 13:07:21 GMT, Bhavana Kilambi wrote: >> After this commit - https://github.com/openjdk/jdk/commit/a49ecb26c5ff2f949851937f3bb036d7946a103e, the JTREG test - >> `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java` fails for some of the tests which contain constant values such as - >> >> >> public void vectorAddConstInputFloat16() { >> for (int i = 0; i < LEN; ++i) { >> output[i] = float16ToRawShortBits(add(shortBitsToFloat16(input1[i]), FP16_CONST)); >> } >> } >> >> >> >> >> >> The current code in the JDK results in the generation of sve_dup instruction for every 16-bit immediate while the acceptable range is [-128, 127] for 8-bit immediates and [-127 << 8, 128 << 8] with a multiple of 256 for 16-bit signed immediates. >> >> This patch allows the generation of sve_dup instruction for only those 16-bit values which are within the limits as specified above and for the values which are out of range, the immediate half float value is loaded from the constant pool into a register ("loadConH" mach node) which is then replicated or broadcasted to an SVE register ("replicateHF" mach node). >> >> Both the tests - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java` and `test/hotspot/jtreg/compiler/c2/aarch64/TestFloat16Replicate.java` pass on 256-bit SVE machine. JTREG tests - hotspot (hotspot_all), langtools (tier1) and jdk(tier 1-3) pass on the same machine. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Modified JTREG testcase to address review comments Huh, I don't see the integration message from bot. Let's see if this message gets the PR on bot notification queue. There it is. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26589#issuecomment-3241560416 PR Comment: https://git.openjdk.org/jdk/pull/26589#issuecomment-3241563354 From bkilambi at openjdk.org Mon Sep 1 09:21:56 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 1 Sep 2025 09:21:56 GMT Subject: Integrated: 8361582: AArch64: Some ConH values cannot be replicated with SVE In-Reply-To: References: Message-ID: On Fri, 1 Aug 2025 09:31:40 GMT, Bhavana Kilambi wrote: > After this commit - https://github.com/openjdk/jdk/commit/a49ecb26c5ff2f949851937f3bb036d7946a103e, the JTREG test - > `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java` fails for some of the tests which contain constant values such as - > > > public void vectorAddConstInputFloat16() { > for (int i = 0; i < LEN; ++i) { > output[i] = float16ToRawShortBits(add(shortBitsToFloat16(input1[i]), FP16_CONST)); > } > } > > > > > > The current code in the JDK results in the generation of sve_dup instruction for every 16-bit immediate while the acceptable range is [-128, 127] for 8-bit immediates and [-127 << 8, 128 << 8] with a multiple of 256 for 16-bit signed immediates. > > This patch allows the generation of sve_dup instruction for only those 16-bit values which are within the limits as specified above and for the values which are out of range, the immediate half float value is loaded from the constant pool into a register ("loadConH" mach node) which is then replicated or broadcasted to an SVE register ("replicateHF" mach node). > > Both the tests - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java` and `test/hotspot/jtreg/compiler/c2/aarch64/TestFloat16Replicate.java` pass on 256-bit SVE machine. JTREG tests - hotspot (hotspot_all), langtools (tier1) and jdk(tier 1-3) pass on the same machine. This pull request has now been integrated. Changeset: 7f0cd648 Author: Bhavana Kilambi Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/7f0cd6488ba969d5cffe8ebe9b95e4ad70982188 Stats: 220 lines in 7 files changed: 182 ins; 7 del; 31 mod 8361582: AArch64: Some ConH values cannot be replicated with SVE Reviewed-by: shade, epeter, aph ------------- PR: https://git.openjdk.org/jdk/pull/26589 From epeter at openjdk.org Mon Sep 1 09:27:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 09:27:45 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> Message-ID: On Mon, 1 Sep 2025 09:16:39 GMT, Galder Zamarre?o wrote: >> Very nice. That would be a good follow-up RFE. Do you want to work on that one? Otherwise you could tag it as a `starter` task, and we'll eventually find someone to do it ;) > > Yeah I'd like to work on it. Seems like a good one to work in between bigger tasks. > > I had a question about it though. I noticed that `STORE_VECTOR` is also in a similar situation. Any specific reason to leave that one as is? Or was it just an oversight? If an oversight, a second RFE could be added for that one? It would probably also be good if we did stores as well, yes. But you'll touch many many tests, having to specify the type of the store. Still I would say it is worth it ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2313398780 From galder at openjdk.org Mon Sep 1 09:46:46 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 1 Sep 2025 09:46:46 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v5] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <8_JXUPiLQNWEmDTbAnwB1jdYu6mTE3_NbETZkQabPwU=.78227d3e-8312-47da-bb2b-0a84017fc724@github.com> Message-ID: On Mon, 1 Sep 2025 09:24:45 GMT, Emanuel Peter wrote: >> Yeah I'd like to work on it. Seems like a good one to work in between bigger tasks. >> >> I had a question about it though. I noticed that `STORE_VECTOR` is also in a similar situation. Any specific reason to leave that one as is? Or was it just an oversight? If an oversight, a second RFE could be added for that one? > > It would probably also be good if we did stores as well, yes. But you'll touch many many tests, having to specify the type of the store. Still I would say it is worth it ? I've created [JDK-8366531](https://bugs.openjdk.org/browse/JDK-8366531) for `VectorReinterpret` and [JDK-8366532](https://bugs.openjdk.org/browse/JDK-8366532) for `StoreVector`. I've assigned `VectorReinterpret` one to myself and I left the other one unassigned for someone else to maybe pick it in the future? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2313446621 From mli at openjdk.org Mon Sep 1 09:53:44 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 1 Sep 2025 09:53:44 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 08:56:25 GMT, Robbin Ehn wrote: >> Hey, please consider! >> >> A bunch of info in JBS entry, please read that also. >> >> I narrowed this issue down to the old jal optimization, making direct calls when in reach. >> This patch restores them and removes this regression. >> >> In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. >> >> Please test on your hardware! >> >> >> Chi Square (100 runs each, 10 fastest iterations of each run, P550) >> JDK-23 (last version with trampoline calls) >> Mean: 3189.5827 >> Standard Deviation: 284.6478 >> >> JDK-25 >> Mean: 3424.8905 >> Standard Deviation: 222.2208 >> >> Patch: >> Mean: 3144.8535 >> Standard Deviation: 229.2577 >> >> >> No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. > > Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into 8365926 > - Spelling > - Merge branch 'master' into 8365926 > - draft jal<->jalr src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 60: > 58: assert(cb != nullptr && cb->is_nmethod(), "nmethod expected"); > 59: nmethod *nm = (nmethod *)cb; > 60: assert(nm != nullptr, "Sanity"); This line can be removed. src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 62: > 60: assert(nm != nullptr, "Sanity"); > 61: assert(nm->stub_contains(stub_addr), "Sanity"); > 62: assert(stub_addr!= nullptr, "Sanity"); Suggestion: assert(stub_addr != nullptr, "Sanity"); src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 95: > 93: // Skip over auipc + ld > 94: address jal_pc = instruction_address() + 2 * NativeInstruction::instruction_size; > 95: uint32_t *jal_pos = (uint32_t *)jal_pc; Is it possible to lose some data in this conversion? If not, maybe an assert here? src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 103: > 101: } else if (!MacroAssembler::is_jalr_at(jal_pc)) { // The jalr is always identical: jalr ra, 0(t1) > 102: uint32_t new_jal = Assembler::encode_jalr(ra, t1, 0); > 103: Atomic::store(jal_pos, new_jal); Suggestion: uint32_t new_jalr = Assembler::encode_jalr(ra, t1, 0); Atomic::store(jal_pos, new_jalr); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313464396 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313464121 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313463904 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313463847 From mli at openjdk.org Mon Sep 1 10:06:44 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 1 Sep 2025 10:06:44 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 08:56:25 GMT, Robbin Ehn wrote: >> Hey, please consider! >> >> A bunch of info in JBS entry, please read that also. >> >> I narrowed this issue down to the old jal optimization, making direct calls when in reach. >> This patch restores them and removes this regression. >> >> In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. >> >> Please test on your hardware! >> >> >> Chi Square (100 runs each, 10 fastest iterations of each run, P550) >> JDK-23 (last version with trampoline calls) >> Mean: 3189.5827 >> Standard Deviation: 284.6478 >> >> JDK-25 >> Mean: 3424.8905 >> Standard Deviation: 222.2208 >> >> Patch: >> Mean: 3144.8535 >> Standard Deviation: 229.2577 >> >> >> No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. > > Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into 8365926 > - Spelling > - Merge branch 'master' into 8365926 > - draft jal<->jalr Nice fix! Thanks! Got some questions. src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 110: > 108: // We changed instruction stream > 109: if (mt_safe) { > 110: OrderAccess::release(); If we have relese here, do we still need the release in `set_stub_address_destination_at`? src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 111: > 109: if (mt_safe) { > 110: OrderAccess::release(); > 111: ICache::invalidate_range(jal_pc, NativeInstruction::instruction_size); should `jal_pc` be `instruction_address()`? ------------- PR Review: https://git.openjdk.org/jdk/pull/26944#pullrequestreview-3173008459 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313495692 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2313495802 From mli at openjdk.org Mon Sep 1 10:12:43 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 1 Sep 2025 10:12:43 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 08:56:25 GMT, Robbin Ehn wrote: >> Hey, please consider! >> >> A bunch of info in JBS entry, please read that also. >> >> I narrowed this issue down to the old jal optimization, making direct calls when in reach. >> This patch restores them and removes this regression. >> >> In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. >> >> Please test on your hardware! >> >> >> Chi Square (100 runs each, 10 fastest iterations of each run, P550) >> JDK-23 (last version with trampoline calls) >> Mean: 3189.5827 >> Standard Deviation: 284.6478 >> >> JDK-25 >> Mean: 3424.8905 >> Standard Deviation: 222.2208 >> >> Patch: >> Mean: 3144.8535 >> Standard Deviation: 229.2577 >> >> >> No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. > > Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into 8365926 > - Spelling > - Merge branch 'master' into 8365926 > - draft jal<->jalr JDK-23 (last version with trampoline calls) Mean: 3189.5827 Standard Deviation: 284.6478 JDK-25 Mean: 3424.8905 Standard Deviation: 222.2208 Patch: Mean: 3144.8535 Standard Deviation: 229.2577 For the performance data, do you have some data for applying this fix on top of the next commit after`JDK-23 (last version with trampoline calls)`? I think this data might be more helpful to understand the performance comparison. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3241745418 From jbhateja at openjdk.org Mon Sep 1 13:05:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 1 Sep 2025 13:05:42 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 In-Reply-To: References: Message-ID: On Thu, 28 Aug 2025 21:09:03 GMT, Srinivas Vamsi Parasa wrote: > This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. > > Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). > > For example: > `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding > `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding Hi @vamsi-parasa , thanks for working on this, I am process of validating https://github.com/openjdk/jdk/pull/26283 and find that additional RA biasing will enable demotion for more cases, with a minimal test case I see following results Test point:- image Baseline:- image With this patch:- image With additional RA biasing image ------------- PR Comment: https://git.openjdk.org/jdk/pull/26997#issuecomment-3242279829 From thartmann at openjdk.org Mon Sep 1 13:14:50 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 1 Sep 2025 13:14:50 GMT Subject: RFR: 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation [v5] In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 23:12:18 GMT, Man Cao wrote: >> Hi, >> >> Could anyone review this change that fixes https://bugs.openjdk.org/browse/JDK-8366118? When this bug happens, it is difficult or almost impossible to debug due to the lack of stack trace, hs-err log or core dump. Fortunately we are also experimenting with sigaltstack for https://bugs.openjdk.org/browse/JDK-8364654, and it helped immensely to identify the root cause. >> >> I will also try adding a test case for DontCompileHugeMethod under -XX:-TieredCompilation. >> >> -Man > > Man Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into JDK-8366118-DontCompileHugeMethods > - Add -Xbatch to test > - Use List.of in test > - Add a jtreg test > - 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation Thanks! Testing all passed now on our side. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26932#issuecomment-3242321745 From epeter at openjdk.org Mon Sep 1 13:50:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 13:50:52 GMT Subject: Integrated: 8366427: C2 SuperWord: refactor VTransform scalar nodes In-Reply-To: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> References: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> Message-ID: On Fri, 29 Aug 2025 09:49:46 GMT, Emanuel Peter wrote: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > The goal is to split up some cases that are currently treated the same, but will alter have different behavior. There may be a little bit of code duplication, but the code will soon be made different ;) > > We split the `VTransformScalarNode`: > - `VTransformMemopScalarNode` > - Uses that only wanted scalar mem nodes can now directly check for `isa_MemopScalar`. > - We can directly store the `_vpointer` in a field, that way we don't need to do a lookup via `vloop_analyzer`. This could also be helpful later on if we ever do widening (unrolling during auto vectorization): we could then do the necessary modifications to the `vpointer`. > - `VTransformLoopPhiNode` > - Later on, they will play a more special role, they will give us easy access to the beginning state of the loop body and the backedges. > - `VTransformCFGNode` > - Calling them scalar nodes is not 100% accurate. We'll probably have to further refine them later on. But splitting them off now seems like a reasonable choice. Once we do if-conversion we'll have to do more work on CFG. > - `VTransformDataScalarNode` > - These represent all the normal "calculation" nodes in the loop. > - `VTransformInputScalarNode` -> `VTransformOuterNode`: > - For now, we are still just tracking input nodes, but soon we will need to track input and output nodes: basically just the 1-hop neighbourhood of nodes outside the loop. I'm already renaming them now, so it will be less noise later. > > I decided to rather split up more, and avoid the `VTransformScalarNode` together, avoiding having to override overrides - that can be really confusing (e.g. what I had with `is_load_in_loop`). This pull request has now been integrated. Changeset: 99223eea Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/99223eea03e2ed714f7a5408c356fdf06efc9200 Stats: 157 lines in 4 files changed: 114 ins; 0 del; 43 mod 8366427: C2 SuperWord: refactor VTransform scalar nodes Reviewed-by: mhaessig, chagedorn, kvn ------------- PR: https://git.openjdk.org/jdk/pull/27002 From chagedorn at openjdk.org Mon Sep 1 13:50:51 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 1 Sep 2025 13:50:51 GMT Subject: RFR: 8366427: C2 SuperWord: refactor VTransform scalar nodes [v3] In-Reply-To: References: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> Message-ID: <1sMQoalYAvK2WnOtHEzQP6PFAvYM9BvF6cLfDCai7Xc=.b298d47f-fcc8-413f-bd3b-2dcebd3f099b@github.com> On Mon, 1 Sep 2025 08:58:57 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> The goal is to split up some cases that are currently treated the same, but will alter have different behavior. There may be a little bit of code duplication, but the code will soon be made different ;) >> >> We split the `VTransformScalarNode`: >> - `VTransformMemopScalarNode` >> - Uses that only wanted scalar mem nodes can now directly check for `isa_MemopScalar`. >> - We can directly store the `_vpointer` in a field, that way we don't need to do a lookup via `vloop_analyzer`. This could also be helpful later on if we ever do widening (unrolling during auto vectorization): we could then do the necessary modifications to the `vpointer`. >> - `VTransformLoopPhiNode` >> - Later on, they will play a more special role, they will give us easy access to the beginning state of the loop body and the backedges. >> - `VTransformCFGNode` >> - Calling them scalar nodes is not 100% accurate. We'll probably have to further refine them later on. But splitting them off now seems like a reasonable choice. Once we do if-conversion we'll have to do more work on CFG. >> - `VTransformDataScalarNode` >> - These represent all the normal "calculation" nodes in the loop. >> - `VTransformInputScalarNode` -> `VTransformOuterNode`: >> - For now, we are still just tracking input nodes, but soon we will need to track input and output nodes: basically just the 1-hop neighbourhood of nodes outside the loop. I'm already renaming them now, so it will be less noise later. >> >> I decided to rather split up more, and avoid the `VTransformScalarNode` together, avoiding having to override overrides - that can be really confusing (e.g. what I had with `is_load_in_loop`). > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merge branch 'JDK-8366427-VTransform-scalar-node-refactor' of https://github.com/eme64/jdk into JDK-8366427-VTransform-scalar-node-refactor > - Update src/hotspot/share/opto/vtransform.hpp > > Co-authored-by: Manuel H?ssig > - manual merge > - improve print_spec > - rm comment > - InputScalar -> Outer renaming > - rm useless methods > - rm vloop_analyzer from vpointer method > - JDK-8366427 > - JDK-8366361 > - ... and 3 more: https://git.openjdk.org/jdk/compare/56713817...86e88f43 Looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27002#pullrequestreview-3173752945 From epeter at openjdk.org Mon Sep 1 13:50:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 13:50:51 GMT Subject: RFR: 8366427: C2 SuperWord: refactor VTransform scalar nodes [v3] In-Reply-To: <1sMQoalYAvK2WnOtHEzQP6PFAvYM9BvF6cLfDCai7Xc=.b298d47f-fcc8-413f-bd3b-2dcebd3f099b@github.com> References: <0BaZ4QsDU5cQnZpcb3WzmX8UDIaomZOKkg0_BjuzLJY=.1d891297-dc22-4c79-a951-5d7456bac0cd@github.com> <1sMQoalYAvK2WnOtHEzQP6PFAvYM9BvF6cLfDCai7Xc=.b298d47f-fcc8-413f-bd3b-2dcebd3f099b@github.com> Message-ID: On Mon, 1 Sep 2025 13:46:23 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: >> >> - Merge branch 'JDK-8366427-VTransform-scalar-node-refactor' of https://github.com/eme64/jdk into JDK-8366427-VTransform-scalar-node-refactor >> - Update src/hotspot/share/opto/vtransform.hpp >> >> Co-authored-by: Manuel H?ssig >> - manual merge >> - improve print_spec >> - rm comment >> - InputScalar -> Outer renaming >> - rm useless methods >> - rm vloop_analyzer from vpointer method >> - JDK-8366427 >> - JDK-8366361 >> - ... and 3 more: https://git.openjdk.org/jdk/compare/56713817...86e88f43 > > Looks good to me, too! @chhagedorn @mhaessig @vnkozlov Thanks a lot for all the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27002#issuecomment-3242452006 From epeter at openjdk.org Mon Sep 1 14:08:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 14:08:43 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 07:04:25 GMT, Daniel Skantz wrote: > This PR addresses a wrong compilation during string optimizations. > > During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. > > After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. > > The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. > > Testing: T1-3 (aed5952). > > Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. src/hotspot/share/opto/stringopts.cpp line 1072: > 1070: > 1071: // First exclude the following pattern: > 1072: // append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString; 2 things: - I was a bit confused about the `->` directionality. Just to confirm: `toString` happens first, then the if-diamond, then the append, right? If yes: I would have reversed the order here. Then again, I'm not super familiar with string opts, so maybe the convention is different here than elsewhere. - Are you sure this can only happen with diamonds? What about nested diamonds? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27028#discussion_r2314059400 From epeter at openjdk.org Mon Sep 1 14:13:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 14:13:42 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 07:04:25 GMT, Daniel Skantz wrote: > This PR addresses a wrong compilation during string optimizations. > > During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. > > After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. > > The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. > > Testing: T1-3 (aed5952). > > Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. test/hotspot/jtreg/compiler/stringopts/TestStackedConcatsPhiUseOfDiamondRegion.java line 57: > 55: return s; > 56: } > 57: } I wonder if we could write some kind of `StringBuilder` fuzzer. Not saying it has to happen as part of this fix. But it seems we have issues with very similar patterns. And they seem quite basic: chains, diamonds, etc. Would probably not be too hard to use the template framework to generate some random shapes, and verify the result the compiled code gives vs the interpreter. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27028#discussion_r2314076685 From epeter at openjdk.org Mon Sep 1 14:17:47 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 14:17:47 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 07:04:25 GMT, Daniel Skantz wrote: > This PR addresses a wrong compilation during string optimizations. > > During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. > > After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. > > The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. > > Testing: T1-3 (aed5952). > > Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. @danielogh Thanks for working on this! I'd love to review, but I'm not very familiar with string opts. Would you mind explaining in a bit more detail what would have gone wrong here? > Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in copy_string. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27028#issuecomment-3242540283 From epeter at openjdk.org Mon Sep 1 14:29:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 14:29:42 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 14:05:58 GMT, Emanuel Peter wrote: >> This PR addresses a wrong compilation during string optimizations. >> >> During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. >> >> After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. >> >> The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. >> >> Testing: T1-3 (aed5952). >> >> Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. > > src/hotspot/share/opto/stringopts.cpp line 1072: > >> 1070: >> 1071: // First exclude the following pattern: >> 1072: // append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString; > > 2 things: > - I was a bit confused about the `->` directionality. Just to confirm: `toString` happens first, then the if-diamond, then the append, right? If yes: I would have reversed the order here. Then again, I'm not super familiar with string opts, so maybe the convention is different here than elsewhere. > - Are you sure this can only happen with diamonds? What about nested diamonds? Ah I see the condition above already checks that it can only be a diamond. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27028#discussion_r2314106464 From epeter at openjdk.org Mon Sep 1 14:29:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 1 Sep 2025 14:29:43 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 07:04:25 GMT, Daniel Skantz wrote: > This PR addresses a wrong compilation during string optimizations. > > During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. > > After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. > > The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. > > Testing: T1-3 (aed5952). > > Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. src/hotspot/share/opto/stringopts.cpp line 1078: > 1076: assert(ptr->in(1)->in(0)->in(1)->is_Bool(), "unexpected if shape"); > 1077: Node* v1 = ptr->in(1)->in(0)->in(1)->in(1)->in(1); > 1078: Node* v2 = ptr->in(1)->in(0)->in(1)->in(1)->in(2); You may want to use some intermediate results and give them names. For example: `Node* iff = ptr->in(1)->in(0)` You seem to make an assumption that the input of the bool is a cmp, right? Did you check that? Or is it somehow guaranteed? What if in some edge-case of an edge-case it is something else that has only one input? Could that happen? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27028#discussion_r2314106612 From aph at openjdk.org Mon Sep 1 14:38:41 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 1 Sep 2025 14:38:41 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: References: Message-ID: On Wed, 27 Aug 2025 01:34:25 GMT, erifan wrote: > The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: > > 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: > - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. > - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. > - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. > - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. > > 2. Additionally, the encoding of the negative floating-point number is incorrect: > - The imm8?field can fall outside the valid range of?**[-128, 127]**. > - Bit **13** should be encoded as **0** for floating-point numbers. > > This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. > > Some test cases are added to aarch64-asmtest.py, and all tests passed. Thanks. I'm not convinced that the refactoring is necessary. Why not write a replacement for `checked_cast(pack(d))` that does the right thing and fix the first `sve_cpy()` so that it does the right thing for float args? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3242601638 From rehn at openjdk.org Mon Sep 1 14:38:49 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 1 Sep 2025 14:38:49 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 10:01:58 GMT, Hamlin Li wrote: >> Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge branch 'master' into 8365926 >> - Spelling >> - Merge branch 'master' into 8365926 >> - draft jal<->jalr > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 110: > >> 108: // We changed instruction stream >> 109: if (mt_safe) { >> 110: OrderAccess::release(); > > If we have relese here, do we still need the release in `set_stub_address_destination_at`? >From JBS entry, the point is to do it in a sane order: The release in make_jal_opt so to make sure the store to instruction stream happens before I-cache flush. 1: store destination to stub 2: release 3: store destination to instruction stream 4: release 5: i-cache flush ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2314129918 From rehn at openjdk.org Mon Sep 1 14:47:42 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 1 Sep 2025 14:47:42 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 10:10:31 GMT, Hamlin Li wrote: > ``` > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > ``` > > For the performance data, do you have some data for applying this fix on top of the next commit after`JDK-23 (last version with trampoline calls)`? I think this data might be more helpful to understand the performance comparison between old trampoline, stub and this pr. JDK-23 is last released version with trampoline calls. JDK-24 is first released version with load calls. What I can do is run ~jdk-24-prelease version which have both and backport to it... running that now... > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 111: > >> 109: if (mt_safe) { >> 110: OrderAccess::release(); >> 111: ICache::invalidate_range(jal_pc, NativeInstruction::instruction_size); > > should `jal_pc` be `instruction_address()`? We have: auipc // instruction_address() # Never changed ld // instruction_address() + NativeInstruction::instruction_size # Never changed jal(r) // instruction_address() + 2 * NativeInstruction::instruction_size (jal_pc) # jal<->jalr We only change the instruction at "instruction_address() + 2 * NativeInstruction::instruction_size". Note that jal_pos and jal_pc means a "jump and link instruction", not specifically jal or jalr. Make sense? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3242627359 PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2314145607 From dskantz at openjdk.org Mon Sep 1 15:31:47 2025 From: dskantz at openjdk.org (Daniel Skantz) Date: Mon, 1 Sep 2025 15:31:47 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 14:24:51 GMT, Emanuel Peter wrote: >> This PR addresses a wrong compilation during string optimizations. >> >> During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. >> >> After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. >> >> The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. >> >> Testing: T1-3 (aed5952). >> >> Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. > > src/hotspot/share/opto/stringopts.cpp line 1078: > >> 1076: assert(ptr->in(1)->in(0)->in(1)->is_Bool(), "unexpected if shape"); >> 1077: Node* v1 = ptr->in(1)->in(0)->in(1)->in(1)->in(1); >> 1078: Node* v2 = ptr->in(1)->in(0)->in(1)->in(1)->in(2); > > You may want to use some intermediate results and give them names. > For example: > `Node* iff = ptr->in(1)->in(0)` > You seem to make an assumption that the input of the bool is a cmp, right? Did you check that? Or is it somehow guaranteed? What if in some edge-case of an edge-case it is something else that has only one input? Could that happen? I'm not sure if there is a guarantee, but it appears to be a pre-existing assumption that is asserted later in `eliminate_unneeded_control`: https://github.com/openjdk/jdk/blob/b06459d3a83c13c0fbc7a0a7698435f17265982e/src/hotspot/share/opto/stringopts.cpp#L268 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27028#discussion_r2314233957 From dskantz at openjdk.org Mon Sep 1 15:25:43 2025 From: dskantz at openjdk.org (Daniel Skantz) Date: Mon, 1 Sep 2025 15:25:43 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis In-Reply-To: References: Message-ID: <76Nc115yY4tjDPTNDrfY6LrtPvFevss4IVo6D-0abOg=.4bffc940-f1ac-40b0-a892-7a1d5bbd39ca@github.com> On Mon, 1 Sep 2025 14:10:56 GMT, Emanuel Peter wrote: >> This PR addresses a wrong compilation during string optimizations. >> >> During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. >> >> After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. >> >> The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. >> >> Testing: T1-3 (aed5952). >> >> Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. > > test/hotspot/jtreg/compiler/stringopts/TestStackedConcatsPhiUseOfDiamondRegion.java line 57: > >> 55: return s; >> 56: } >> 57: } > > I wonder if we could write some kind of `StringBuilder` fuzzer. Not saying it has to happen as part of this fix. But it seems we have issues with very similar patterns. And they seem quite basic: chains, diamonds, etc. > > Would probably not be too hard to use the template framework to generate some random shapes, and verify the result the compiled code gives vs the interpreter. I think this is a good idea for sure. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27028#discussion_r2314224094 From dlunden at openjdk.org Mon Sep 1 15:58:01 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 15:58:01 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v23] In-Reply-To: References: Message-ID: On Wed, 27 Aug 2025 09:08:09 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Add clarifying comments at definitions of register mask sizes > >> For reference, here is now the changeset adding an IFG bailout: #26118 > > Since that is now integrated: do we need to make any changes to the patch here? I thought the goal was to use the bailouts instead of increasing `MaxNodeLimit`. > > Because looking at the discussions above: we were worried that there could be compile-time regressions - even if quite rare. But they were in the range of 40s which is quite scary. Are these now gone? Thanks @eme64! > Do you think it would make sense to have more tests? I'm imagining something like this: > > * Generate tests with 0-255 arguments. You could use the template framework. > * Take different types (e.g. various primitive types, also those that take 2 stack slots like long and double). You could use the template library `PrimitiveType` if you want. > * Test that we actually get the method compiled. Maybe an IR rule could be used here? > * And do some rudamentary result verification > * Make sure it does not just work with `Xcomp` but also under "normal" circumstances (tiered, profiling, etc). Sure, I can expand upon the testing. It's also a good opportunity to have a look at the template framework. Note that for `TestMaxMethodArguments.java`, I do already check that it compiles via `-XX:+AbortVMOnCompilationFailure`. > I'll look a bit at your VM changes now ;) Thanks, I'll have a look and respond in the individual threads. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-3242808695 From dlunden at openjdk.org Mon Sep 1 16:03:57 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 16:03:57 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: <8L3IGg5YYgi2EjlC-v5U3FkkWvK1swESQFAMwX02I84=.d597910f-0aca-4eb2-b68c-fbe565e73291@github.com> On Mon, 1 Sep 2025 08:20:42 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/regmask.hpp line 63: >> >>> 61: // RM_SIZE is the base size of a register mask in 32-bit words. >>> 62: // RM_SIZE_MIN is the theoretical minimum size of a register mask in 32-bit >>> 63: // words. >> >> It seems this is a bad pattern that was already here before you. But it really makes me a little scared here. >> >> Having two variable names differ in just an underscore `_` but with different semantics is a bit confusing to me. It is hard for the reader to keep track of what is what going forward. It would be really easy for someone to confuse the two in the future and have bugs creap in that way (just because of an underscore). It may be more useful to use the units in at least one of the two names. >> >> I would love to see names like `RM_SIZE` and `RM_SIZE_IN_LONGS`, rather than `RM_SIZE` and `_RM_SIZE`. >> Even better would be `RM_SIZE_IN_INTS` and `RM_SIZE_IN_LONGS`. That way, you rould save a lot of comments. Maybe you could come up with even better names. "slots" and "words"? >> You could consider doing a renaming PR first before the patch here. Maybe you can even automate the renaming with a command/script, and then apply the same renaming to the changes here? > > Oh gosh, I just realized: machine word of course depends on 32bit vs 64bit architecture. Yikes. > So maybe the names need to be stack-slots vs words? And there should probably be a quick reminder somewhere that words can be different sizes. Sure, we can rename them. I think `RM_SIZE_IN_INTS` and `RM_SIZE_IN_WORDS` would be most suitable. I avoided such a change in this changeset to not make it bigger than it already is. Isn't it easier to do the renaming in a follow-up RFE though, instead of before this PR? I'm fine with both though, not that much extra work to do it before. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314286169 From rehn at openjdk.org Mon Sep 1 16:11:34 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 1 Sep 2025 16:11:34 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v3] In-Reply-To: References: Message-ID: > Hey, please consider! > > A bunch of info in JBS entry, please read that also. > > I narrowed this issue down to the old jal optimization, making direct calls when in reach. > This patch restores them and removes this regression. > > In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. > > Please test on your hardware! > > > Chi Square (100 runs each, 10 fastest iterations of each run, P550) > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: Review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26944/files - new: https://git.openjdk.org/jdk/pull/26944/files/b81779cb..f0f7f20e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=01-02 Stats: 9 lines in 1 file changed: 3 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/26944.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26944/head:pull/26944 PR: https://git.openjdk.org/jdk/pull/26944 From dlunden at openjdk.org Mon Sep 1 16:11:44 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 16:11:44 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v25] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/regmask.hpp Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/80c6cf47..c4a706b5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=23-24 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From dlunden at openjdk.org Mon Sep 1 16:11:49 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 16:11:49 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 08:05:04 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: >> >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Fix typo >> - Updates after Emanuel's comments >> - Refactor and improve TestNestedSynchronize.java >> - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 > > src/hotspot/share/opto/regmask.hpp line 44: > >> 42: // statements in Java. >> 43: const int BoxLockNode_SLOT_LIMIT = 200; >> 44: > > Even before this constant, it would be nice to have an introductory comment, that lays out what the regmask is for, and what its basic design is. Yes, you are right. I'll add it! > src/hotspot/share/opto/regmask.hpp line 122: > >> 120: >> 121: // Viewed as an array of machine words >> 122: uintptr_t _RM_UP[_RM_SIZE]; > > Do you know what `UP` stands for? Could we rename it maybe? > Would be nice if we could have the same "units" for these arrays than for the sizes above. I would guess it stands for **u**int**p**tr, and the `I` in `_RM_I` is for **i**nteger. Maybe `_RM_INT` and `_RM_WORD`? > src/hotspot/share/opto/regmask.hpp line 128: > >> 126: // extend the register mask with dynamically allocated memory. We keep the >> 127: // base statically allocated _RM_UP, and arena allocate the extended mask >> 128: // (RM_UP_EXT) separately. Another, perhaps more elegant, option would be to > > Suggestion: > > // (_RM_UP_EXT) separately. Another, perhaps more elegant, option would be to > > Underscore for consistency? Or does it reference something else? Yes, thanks (typo). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314295280 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314290338 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314292191 From dlunden at openjdk.org Mon Sep 1 16:18:01 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 16:18:01 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 08:08:27 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: >> >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Fix typo >> - Updates after Emanuel's comments >> - Refactor and improve TestNestedSynchronize.java >> - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 > > src/hotspot/share/opto/regmask.hpp line 161: > >> 159: // cases, we can allow read-only sharing. >> 160: bool _read_only = false; >> 161: #endif > > Can you explain why this happens? Is this something we could clean up? It smells a bit like tech-dept. But maybe it is a really necessary performance optimization. Would be nice if there was an explanation which one it is ;) The main issue is that register masks are stored as part of certain nodes, and nodes get copied by `Node::clone`. If someone in the future decide to add a register mask to some type of node, and forget to add a special case (like what I've now added for `MachProj`) in `Node::clone` for the node type, this safeguard will catch it and complain. Register masks are used in peculiar ways throughout C2, and there may be other unexpected cases as well that this safeguard catches. I doubt the `_read_only` part has a measurable performance effect, I only added it because it was easy and couldn't hurt. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314305615 From dlunden at openjdk.org Mon Sep 1 16:30:00 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 16:30:00 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 08:15:53 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: >> >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Fix typo >> - Updates after Emanuel's comments >> - Refactor and improve TestNestedSynchronize.java >> - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 > > src/hotspot/share/opto/regmask.hpp line 96: > >> 94: (((RM_SIZE_MIN << 5) + // Slots for machine registers >> 95: (max_method_parameter_length * 2) + // Slots for incoming arguments >> 96: (max_method_parameter_length * 2) + // Slots for outgoing arguments > > What's the meaning of incoming vs outgoing arguments? Like this? > > Incoming = from caller (outer nesting) > Outgoing = to nested call (inner nesting) Yes, you are correct. There is a detailed explanation in `x86_64.ad` ("Definition of frame structure and management information"). > src/hotspot/share/opto/regmask.hpp line 175: > >> 173: // mask can currently represent to be included. If _all_stack = false, we >> 174: // consider the registers not included. >> 175: bool _all_stack = false; > > I'd prefer to have some kind of `_is_...` name here. Because when I read `all_stack` and see it is a bool, I wonder what it means - it does not tell me quickly. Does it mean that all registers are on the stack? > > Is everything that is beyond the register mask purely on the stack? Is everything from the stack always beyond the register mask? I'm confused :face_with_peeking_eye: Right, we should probably update this terminology as well. It comes from the fact that register masks can always represent all registers (+ a few stack slots), and anything beyond the mask is necessarily additional stack slots. So, if `_all_stack` is set, it means the register mask includes all of the stack slots. Any suggestion for a better name? > src/hotspot/share/opto/regmask.hpp line 179: > >> 177: // The low and high watermarks represent the lowest and highest word that >> 178: // might contain set register mask bits, respectively. We guarantee that >> 179: // there are no bits in words outside this range, but any word at and between > > In the example below, you have 1 bits above the `_hwm`. Is that intentional? Are those bits to be ignored? Can you please add some extra info to the example about that? Right, `_lwm` and `_hwm` does not apply for `_all_stack` bits. I'll clarify! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314315615 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314312930 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314317882 From dlunden at openjdk.org Mon Sep 1 16:35:00 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 16:35:00 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> On Mon, 1 Sep 2025 08:30:19 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: >> >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Fix typo >> - Updates after Emanuel's comments >> - Refactor and improve TestNestedSynchronize.java >> - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 > > src/hotspot/share/opto/regmask.hpp line 170: > >> 168: // variable indicates how many words we offset with. We consider all >> 169: // registers before the offset to not be included in the register mask. >> 170: unsigned int _offset; > > Does that mean we make different slices of the mask? I don't quite understand the question, can you please elaborate? The `_offset` means we shift the register mask to the right, so that the first bit of the first `_RM_UP` element no longer represents `OptoReg` 0 (but rather `OptoReg` `_offset * BitsPerWord`). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314325238 From dlunden at openjdk.org Mon Sep 1 16:40:01 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 1 Sep 2025 16:40:01 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: <4qDpmwc0x5HCjDTzH_JUV3YtxNAFUremZZhu6G1usgM=.bc9b101a-d1f9-4ba5-bcc1-0b1afdb9d2a0@github.com> On Mon, 1 Sep 2025 08:30:47 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: >> >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Fix typo >> - Updates after Emanuel's comments >> - Refactor and improve TestNestedSynchronize.java >> - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 > > src/hotspot/share/opto/regmask.hpp line 217: > >> 215: // necessarily representing stack locations) to 1. Here is how the above >> 216: // register mask looks like after clearing, setting _all_stack to true, and >> 217: // successfully rolling over: > > I'm still struggling to follow here. Maybe `_offset` is not clear to me yet. What is the value here for it? How is it changed with the `rollover`? This `_offset` stuff is really only for a very specific use case in `PhaseChaitin::Select`, so I understand it can be hard to follow. The value for `_offset` in the example after rollover is 5 = `_rm_size`, since we have rolled over once. When we roll over the next time, the `_offset` is 10, and so on. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2314330646 From djelinski1 at gmail.com Mon Sep 1 18:50:50 2025 From: djelinski1 at gmail.com (=?UTF-8?Q?Daniel_Jeli=C5=84ski?=) Date: Mon, 1 Sep 2025 20:50:50 +0200 Subject: Delay slot handling Message-ID: Hi all, Does anyone still use the delay slot handling code? Can we remove it? The code was used by the SPARC port, which was removed in JDK 15. Looking at the list of architectures that use delay slots [1], the removal of delay slot support could possibly affect the MIPS port. The arm (32-bit) AD file mentions delay slots in a few places, but as far as I can tell, that's a copy-paste error that can be easily corrected. The cleanup would involve at least: - removing the LIR_OpDelay class (C1) - removing support for ADL "branch_has_delay_slot", "one_instruction_with_delay_slot", "single_instruction_with_delay_slot", and "has_delay_slot" Thoughts? [1] https://en.wikipedia.org/wiki/Delay_slot#Implementations From fyang at openjdk.org Tue Sep 2 02:08:49 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 2 Sep 2025 02:08:49 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 14:42:47 GMT, Robbin Ehn wrote: >> src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 111: >> >>> 109: if (mt_safe) { >>> 110: OrderAccess::release(); >>> 111: ICache::invalidate_range(jal_pc, NativeInstruction::instruction_size); >> >> should `jal_pc` be `instruction_address()`? > > We have: > > auipc // instruction_address() # Never changed > ld // instruction_address() + NativeInstruction::instruction_size # Never changed > jal(r) // instruction_address() + 2 * NativeInstruction::instruction_size (jal_pc) # jal<->jalr > > We only change the instruction at "instruction_address() + 2 * NativeInstruction::instruction_size". > > Note that jal_pos and jal_pc means a "jump and link instruction", not specifically jal or jalr. > > Make sense? Maybe we can give it a new name to avoid possible confusion? `jmp_pc` or simply `pc`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2314762084 From jbhateja at openjdk.org Tue Sep 2 02:55:45 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 2 Sep 2025 02:55:45 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 In-Reply-To: References: Message-ID: On Thu, 28 Aug 2025 21:09:03 GMT, Srinivas Vamsi Parasa wrote: > This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. > > Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). > > For example: > `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding > `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding src/hotspot/cpu/x86/assembler_x86.cpp line 12932: > 12930: if (is_commutative && is_demotable(no_flags, dst->encoding(), src2->encoding())) { > 12931: if (size == EVEX_64bit) { > 12932: emit_prefix_and_int8(get_prefixq(src1, dst, is_map1), opcode_byte + 2); It will be good to write a comment on top of opcode_byte adjustment on account of opcode mismatch b/w NDD and equivalent demotable variant. EVEX.LLZ.NP.MAP4.SCALABLE 21 /r AND {NF} {ND=1} rv, rv/mv, rv `REX.W + 23 /r AND r64, r/m64 | RM | Valid | N.E. | r64 AND r/m64 ` src/hotspot/cpu/x86/assembler_x86.cpp line 13055: > 13053: bool is_prefixq = (size == EVEX_64bit) ? true : false; > 13054: bool normal_demotion = is_demotable(no_flags, dst_enc, nds_enc); > 13055: bool commutative_demotion = is_commutative && is_demotable(no_flags, dst_enc, src_enc); Nomenclature change: instead of normal_demotion and commutative demotion, it will be more appropriate to use first/second_operand_demotable. src/hotspot/cpu/x86/x86_64.ad line 7121: > 7119: %{ > 7120: predicate(UseAPX); > 7121: match(Set dst (AddI (LoadI src1) src2)); Will this not be covered by the pattern at line 7103, since ADLC automatically generates a DFA to handle both cases? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2314775483 PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2313941101 PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2314792264 From duke at openjdk.org Tue Sep 2 03:04:46 2025 From: duke at openjdk.org (erifan) Date: Tue, 2 Sep 2025 03:04:46 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: References: Message-ID: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> On Mon, 1 Sep 2025 14:35:40 GMT, Andrew Haley wrote: >> The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > Thanks. > > I'm not convinced that the refactoring is necessary. Why not write a replacement for `checked_cast(pack(d))` that does the right thing and fix the first `sve_cpy()` so that it does the right thing for float args? Thanks @theRealAph . I've indeed considered and implemented your idea. The code diff: diff --git a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp index 11d302e9026..841d24f516b 100644 --- a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp +++ b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp @@ -3813,8 +3813,9 @@ template bool isMerge, bool isFloat) { starti; assert(T != Q, "invalid size"); + assert((!isFloat) || (isFloat && T != B), "invalid size"); int sh = 0; - if (imm8 <= 127 && imm8 >= -128) { + if ((imm8 <= 127 && imm8 >= -128) || (isFloat && (imm8 >> 8) == 0)) { sh = 0; } else if (T != B && imm8 <= 32512 && imm8 >= -32768 && (imm8 & 0xff) == 0) { sh = 1; @@ -3824,7 +3825,7 @@ template } int m = isMerge ? 1 : 0; f(0b00000101, 31, 24), f(T, 23, 22), f(0b01, 21, 20); - prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), sf(imm8, 12, 5), rf(Zd, 0); + prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), f(imm8&0xff, 12, 5), rf(Zd, 0); } public: @@ -3834,7 +3835,7 @@ template } // SVE copy floating-point immediate to vector elements (predicated) void sve_cpy(FloatRegister Zd, SIMD_RegVariant T, PRegister Pg, double d) { - sve_cpy(Zd, T, Pg, checked_cast(pack(d)), /*isMerge*/true, /*isFloat*/true); + sve_cpy(Zd, T, Pg, checked_cast(pack(d)), /*isMerge*/true, /*isFloat*/true); } // SVE conditionally select elements from two vectors However, some of my colleagues have differing opinions: 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . While some encoding code can be reused, separating the encodings makes the code clearer. I think both implementations are fine. If you think it's better to not refactor, I'll revert. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3243633607 From hgreule at openjdk.org Tue Sep 2 06:04:43 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 2 Sep 2025 06:04:43 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v2] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> <3BJWLK3FukQCp2FHGcyBDTZtbc5aS8VreNKYKAaQrdU=.43a7e821-8d56-4161-850a-9137d17d44de@github.com> Message-ID: On Mon, 25 Aug 2025 13:20:32 GMT, Emanuel Peter wrote: >>> @eme64 I merged master and hopefully addressed your latest comments. Now that we have #17508 integrated, I could also directly update the unsigned variant, but I'm also fine with doing that separately. WDYT? >>> >>> I also checked the constant folding part again (or generally whenever the RHS is a constant), these code paths are indeed not used by PhaseGVN directly (but by PhaseCCP and PhaseIdealLoop). That makes it a bit difficult to test that part properly. >> >> Let's keep the patch as it is. With #17508 we will have to also probably refactor and add more tests, if we want to do any unsigned and known-bit optimizations. >> >> ---------------- >> >> @SirYwell Thanks for the updates, I had a few more comments, but we are getting there :) > >> @eme64 I addressed your latest comments now, please re-review :) >> >> Regarding my previous observation >> >> > * If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> >> should I open a new RFE for that? Or generally, what's your opinion on this? > > Can you show some examples? Filing an RFE would surely not be wrong. @eme64 gentle ping in case you missed my latest changes :) Please let me know if there is more to do. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3243887966 From aph at openjdk.org Tue Sep 2 08:12:42 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 2 Sep 2025 08:12:42 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> References: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> Message-ID: <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> On Tue, 2 Sep 2025 03:01:36 GMT, erifan wrote: > 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. That's a fair point, but the Arch64 name for all four instructions is CPY, and they are distinguished by their operands. Deviation from the names in the Reference Manual is occasionally necessary, but it makes life painful for maintainers when they have to search for what we've called an instruction they want to use. > 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . Yes, that's right. > While some encoding code can be reused, separating the encodings makes the code clearer. I don't agree that it makes the code clearer. In fact, tight factoring emphasizes the fact that these instructions are similar, and explicitly shows where they are different. It is true that I have a strong bias against copy-and-paste programming. > I think both implementations are fine. If you think it's better to not refactor, I'll revert. I do. Thank you. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3244259237 From shade at openjdk.org Tue Sep 2 08:50:09 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Sep 2025 08:50:09 GMT Subject: RFR: 8231269: CompileTask::is_unloaded is slow due to JNIHandles type checks [v24] In-Reply-To: References: Message-ID: > [JDK-8163511](https://bugs.openjdk.org/browse/JDK-8163511) made the `CompileTask` improvement to avoid blocking class unloading if a relevant compile task is in queue. Current code does a sleight-of-hand to make sure the the `method*` in `CompileTask` are still valid before using them. Still a noble goal, so we keep trying to do this. > > The code tries to switch weak JNI handle with a strong one when it wants to capture the holder to block unloading. Since we are reusing the same field, we have to do type checks like `JNIHandles::is_weak_global_handle(_method_holder)`. Unfortunately, that type-check goes all the way to `OopStorage` allocation code to verify the handle is really allocated in the relevant `OopStorage`. This takes internal `OopStorage` locks, and thus is slow. > > This issue is clearly visible in Leyden, when there are lots of `CompileTask`-s in the queue, dumped by AOT code loader. It also does not help that `CompileTask::select_task` is effectively quadratic in number of methods in queue, so we end up calling `CompileTask::is_unloaded` very often. > > It is possible to mitigate this issue by splitting the related fields into weak and strong ones. But as Kim mentions in the bug, we should not be using JNI handles here at all, and instead go directly for relevant `OopStorage`-s. This is what this PR does, among other things that should hopefully make the whole mechanics clearer. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler/classUnloading`, 100x still passes; these tests are sensitive to bugs in this code > - [x] Linux x86_64 server fastdebug, `all` > - [x] Linux AArch64 server fastdebug, `all` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: - Fix build failure - Merge branch 'master' into JDK-8231269-compile-task-weaks - Docs touchup - Use enum class - Further simplify the API - Tune up for release builds - Move release() to destructor - Deal with things without spinlocks - Merge branch 'master' into JDK-8231269-compile-task-weaks - Merge branch 'master' into JDK-8231269-compile-task-weaks - ... and 35 more: https://git.openjdk.org/jdk/compare/af532cc1...ed7aef7e ------------- Changes: https://git.openjdk.org/jdk/pull/24018/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24018&range=23 Stats: 376 lines in 14 files changed: 332 ins; 23 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/24018.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24018/head:pull/24018 PR: https://git.openjdk.org/jdk/pull/24018 From epeter at openjdk.org Tue Sep 2 12:33:33 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 12:33:33 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops Message-ID: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> **Analysis** A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. **Fix:** add control to the `CastP2X` so that it cannot float too far. **Details** rbp = Allcoate array spill <- rbp + 0x20 call to allocateArrays -> allocates a lot, and triggers GC. That moves the allocated array behind rbp -> rbp is oop-mapped, so it is updated automatically to the new oop -> spill value remains based on the old oop We now compute the aliasing runtime check: -> one side of the comparison is computed from rbp (new oop) -> the other side is computed from the the spill value (old oop) -> the cmp returns a nonsensical value, and we take the wrong branch -> vectorize even though we have aliasing! ------------- Commit messages: - fix test flags - the fix - JDK-8366490 Changes: https://git.openjdk.org/jdk/pull/27045/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27045&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366490 Stats: 152 lines in 5 files changed: 139 ins; 1 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/27045.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27045/head:pull/27045 PR: https://git.openjdk.org/jdk/pull/27045 From mli at openjdk.org Tue Sep 2 12:50:46 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 2 Sep 2025 12:50:46 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 14:35:38 GMT, Robbin Ehn wrote: >> src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 110: >> >>> 108: // We changed instruction stream >>> 109: if (mt_safe) { >>> 110: OrderAccess::release(); >> >> If we have relese here, do we still need the release in `set_stub_address_destination_at`? > > From JBS entry, the point is to do it in a sane order: > > The release in make_jal_opt so to make sure the store to instruction stream happens before I-cache flush. > > 1: store destination to stub > 2: release > 3: store destination to instruction stream > 4: release > 5: i-cache flush I don't see a detailed discussion about why there needs to be 2 `release`. Seems the `2: release` is redundant? does a single release (step 4) after step 3 work as well? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2315984966 From thartmann at openjdk.org Tue Sep 2 12:50:45 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 2 Sep 2025 12:50:45 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops In-Reply-To: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: On Tue, 2 Sep 2025 10:45:33 GMT, Emanuel Peter wrote: > **Analysis** > > A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). > > With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. > > **Fix:** add control to the `CastP2X` so that it cannot float too far. > > **Details** > > > rbp = Allcoate array > spill <- rbp + 0x20 > > call to allocateArrays > -> allocates a lot, and triggers GC. That moves the allocated array behind rbp > -> rbp is oop-mapped, so it is updated automatically to the new oop > -> spill value remains based on the old oop > > We now compute the aliasing runtime check: > -> one side of the comparison is computed from rbp (new oop) > -> the other side is computed from the the spill value (old oop) > -> the cmp returns a nonsensical value, and we take the wrong branch > -> vectorize even though we have aliasing! Nice analysis! Looks good to me. src/hotspot/share/opto/vectorization.cpp line 1128: > 1126: Node* variable = (s.variable() == iv) ? iv_value : s.variable(); > 1127: if (variable->bottom_type()->isa_ptr() != nullptr) { > 1128: // Make sure that ctlr is late enough, so that we do not Suggestion: // Make sure that ctrl is late enough, so that we do not test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingCastP2XCtrl.java line 2: > 1: /* > 2: * Copyright (c) 2024, 2025, Oracle and/or its affiliates. All rights reserved. Suggestion: * Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27045#pullrequestreview-3176374409 PR Review Comment: https://git.openjdk.org/jdk/pull/27045#discussion_r2315976205 PR Review Comment: https://git.openjdk.org/jdk/pull/27045#discussion_r2315967541 From mli at openjdk.org Tue Sep 2 12:50:46 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 2 Sep 2025 12:50:46 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 02:06:16 GMT, Fei Yang wrote: >> We have: >> >> auipc // instruction_address() # Never changed >> ld // instruction_address() + NativeInstruction::instruction_size # Never changed >> jal(r) // instruction_address() + 2 * NativeInstruction::instruction_size (jal_pc) # jal<->jalr >> >> We only change the instruction at "instruction_address() + 2 * NativeInstruction::instruction_size". >> >> Note that jal_pos and jal_pc means a "jump and link instruction", not specifically jal or jalr. >> >> Make sense? > > Maybe we can give it a new name to avoid possible confusion? `jmp_pc` or simply `pc`? > We only change the instruction at "instruction_address() + 2 * NativeInstruction::instruction_size". Right! > Note that jal_pos and jal_pc means a "jump and link instruction", not specifically jal or jalr. As we're patching either `jal` or `jalr` instruction, so jal is misleading, I agree `jmp_xxx` is a better name. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2315982823 From epeter at openjdk.org Tue Sep 2 12:58:37 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 12:58:37 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v2] In-Reply-To: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: > **Analysis** > > A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). > > With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. > > **Fix:** add control to the `CastP2X` so that it cannot float too far. > > **Details** > > > rbp = Allcoate array > spill <- rbp + 0x20 > > call to allocateArrays > -> allocates a lot, and triggers GC. That moves the allocated array behind rbp > -> rbp is oop-mapped, so it is updated automatically to the new oop > -> spill value remains based on the old oop > > We now compute the aliasing runtime check: > -> one side of the comparison is computed from rbp (new oop) > -> the other side is computed from the the spill value (old oop) > -> the cmp returns a nonsensical value, and we take the wrong branch > -> vectorize even though we have aliasing! Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - fix test requires - Apply suggestions from code review Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27045/files - new: https://git.openjdk.org/jdk/pull/27045/files/91652115..13f70d31 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27045&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27045&range=00-01 Stats: 3 lines in 2 files changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27045.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27045/head:pull/27045 PR: https://git.openjdk.org/jdk/pull/27045 From chagedorn at openjdk.org Tue Sep 2 13:02:46 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 2 Sep 2025 13:02:46 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v2] In-Reply-To: References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: <_WggL2SoHOitvvIsgLxItXT9Tr8vk-gcfray1GOTvsw=.bf03d0f6-bafb-4010-b4d9-f927e0cbe944@github.com> On Tue, 2 Sep 2025 12:58:37 GMT, Emanuel Peter wrote: >> **Analysis** >> >> A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). >> >> With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. >> >> **Fix:** add control to the `CastP2X` so that it cannot float too far. >> >> **Details** >> >> >> rbp = Allcoate array >> spill <- rbp + 0x20 >> >> call to allocateArrays >> -> allocates a lot, and triggers GC. That moves the allocated array behind rbp >> -> rbp is oop-mapped, so it is updated automatically to the new oop >> -> spill value remains based on the old oop >> >> We now compute the aliasing runtime check: >> -> one side of the comparison is computed from rbp (new oop) >> -> the other side is computed from the the spill value (old oop) >> -> the cmp returns a nonsensical value, and we take the wrong branch >> -> vectorize even though we have aliasing! > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - fix test requires > - Apply suggestions from code review > > Co-authored-by: Tobias Hartmann Otherwise, looks good to me, too! test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingCastP2XCtrl.java line 59: > 57: * @test id=vanilla > 58: * @bug 8366490 > 59: * @run driver compiler.loopopts.superword.TestAliasingCastP2XCtrl Should be `main` to allow to run with passed in flags Suggestion: * @run main compiler.loopopts.superword.TestAliasingCastP2XCtrl ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27045#pullrequestreview-3176446699 PR Review Comment: https://git.openjdk.org/jdk/pull/27045#discussion_r2316014972 From mhaessig at openjdk.org Tue Sep 2 13:02:45 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 2 Sep 2025 13:02:45 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v2] In-Reply-To: References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: On Tue, 2 Sep 2025 12:58:37 GMT, Emanuel Peter wrote: >> **Analysis** >> >> A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). >> >> With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. >> >> **Fix:** add control to the `CastP2X` so that it cannot float too far. >> >> **Details** >> >> >> rbp = Allcoate array >> spill <- rbp + 0x20 >> >> call to allocateArrays >> -> allocates a lot, and triggers GC. That moves the allocated array behind rbp >> -> rbp is oop-mapped, so it is updated automatically to the new oop >> -> spill value remains based on the old oop >> >> We now compute the aliasing runtime check: >> -> one side of the comparison is computed from rbp (new oop) >> -> the other side is computed from the the spill value (old oop) >> -> the cmp returns a nonsensical value, and we take the wrong branch >> -> vectorize even though we have aliasing! > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - fix test requires > - Apply suggestions from code review > > Co-authored-by: Tobias Hartmann Thank you for the fix and the easy to follow analysis, @eme64. I just have a few minor comments. Otherwise, this looks good. src/hotspot/share/opto/vectorization.cpp line 1128: > 1126: Node* variable = (s.variable() == iv) ? iv_value : s.variable(); > 1127: if (variable->bottom_type()->isa_ptr() != nullptr) { > 1128: // Make sure that ctrl is late enough, so that we do not Suggestion: // Use a ctrl that is late enough, so that we do not At first, I read this as "we need to make sure here that the `ctrl` is late enough` when we really use a `ctrl` that is passed and we cannot really affect its place anymore. But feel free to ignore. test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingCastP2XCtrl.java line 31: > 29: * from floating over a SafePoint that could move the oop, > 30: * and render the cast value stale. > 31: * Suggestion: Nit: superfluous empty line test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingCastP2XCtrl.java line 71: > 69: int[] a = new int[N]; > 70: } > 71: // Makes GC more likely. No clue if this is the right use case, but maybe this would be a good use of `-XX:+GCALot`? ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27045#pullrequestreview-3176427125 PR Review Comment: https://git.openjdk.org/jdk/pull/27045#discussion_r2316012817 PR Review Comment: https://git.openjdk.org/jdk/pull/27045#discussion_r2316005451 PR Review Comment: https://git.openjdk.org/jdk/pull/27045#discussion_r2316002575 From epeter at openjdk.org Tue Sep 2 13:09:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 13:09:32 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v3] In-Reply-To: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: > **Analysis** > > A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). > > With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. > > **Fix:** add control to the `CastP2X` so that it cannot float too far. > > **Details** > > > rbp = Allcoate array > spill <- rbp + 0x20 > > call to allocateArrays > -> allocates a lot, and triggers GC. That moves the allocated array behind rbp > -> rbp is oop-mapped, so it is updated automatically to the new oop > -> spill value remains based on the old oop > > We now compute the aliasing runtime check: > -> one side of the comparison is computed from rbp (new oop) > -> the other side is computed from the the spill value (old oop) > -> the cmp returns a nonsensical value, and we take the wrong branch > -> vectorize even though we have aliasing! Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Co-authored-by: Manuel H?ssig Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27045/files - new: https://git.openjdk.org/jdk/pull/27045/files/13f70d31..d1a35d12 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27045&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27045&range=01-02 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27045.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27045/head:pull/27045 PR: https://git.openjdk.org/jdk/pull/27045 From epeter at openjdk.org Tue Sep 2 13:09:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 13:09:32 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v3] In-Reply-To: References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: <_e00ByDtLuvLzyrtZJik2E5wVLYm70Oj0d_7f3zi2oU=.af116453-1cf1-4c4c-bd2e-8fc33d6be943@github.com> On Tue, 2 Sep 2025 12:54:23 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review >> >> Co-authored-by: Manuel H?ssig >> Co-authored-by: Christian Hagedorn > > test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingCastP2XCtrl.java line 71: > >> 69: int[] a = new int[N]; >> 70: } >> 71: // Makes GC more likely. > > No clue if this is the right use case, but maybe this would be a good use of `-XX:+GCALot`? Maybe, you could be right! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27045#discussion_r2316033051 From chagedorn at openjdk.org Tue Sep 2 13:22:41 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 2 Sep 2025 13:22:41 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v3] In-Reply-To: References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: On Tue, 2 Sep 2025 13:09:32 GMT, Emanuel Peter wrote: >> **Analysis** >> >> A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). >> >> With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. >> >> **Fix:** add control to the `CastP2X` so that it cannot float too far. >> >> **Details** >> >> >> rbp = Allcoate array >> spill <- rbp + 0x20 >> >> call to allocateArrays >> -> allocates a lot, and triggers GC. That moves the allocated array behind rbp >> -> rbp is oop-mapped, so it is updated automatically to the new oop >> -> spill value remains based on the old oop >> >> We now compute the aliasing runtime check: >> -> one side of the comparison is computed from rbp (new oop) >> -> the other side is computed from the the spill value (old oop) >> -> the cmp returns a nonsensical value, and we take the wrong branch >> -> vectorize even though we have aliasing! > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Co-authored-by: Manuel H?ssig > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27045#pullrequestreview-3176542659 From epeter at openjdk.org Tue Sep 2 13:23:49 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 13:23:49 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v6] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: On Mon, 1 Sep 2025 09:03:24 GMT, Galder Zamarre?o wrote: >> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: >> >> >> Benchmark (seed) (size) Mode Cnt Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. > > Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: > > Adjust vector size expectations Testing passed, Approved! @galderz Thanks for working on this :) ------------- PR Review: https://git.openjdk.org/jdk/pull/26457#pullrequestreview-3176546977 From thartmann at openjdk.org Tue Sep 2 14:00:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 2 Sep 2025 14:00:53 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v3] In-Reply-To: References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: On Tue, 2 Sep 2025 13:09:32 GMT, Emanuel Peter wrote: >> **Analysis** >> >> A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). >> >> With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. >> >> **Fix:** add control to the `CastP2X` so that it cannot float too far. >> >> **Details** >> >> >> rbp = Allcoate array >> spill <- rbp + 0x20 >> >> call to allocateArrays >> -> allocates a lot, and triggers GC. That moves the allocated array behind rbp >> -> rbp is oop-mapped, so it is updated automatically to the new oop >> -> spill value remains based on the old oop >> >> We now compute the aliasing runtime check: >> -> one side of the comparison is computed from rbp (new oop) >> -> the other side is computed from the the spill value (old oop) >> -> the cmp returns a nonsensical value, and we take the wrong branch >> -> vectorize even though we have aliasing! > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review > > Co-authored-by: Manuel H?ssig > Co-authored-by: Christian Hagedorn Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27045#pullrequestreview-3176700213 From epeter at openjdk.org Tue Sep 2 14:08:05 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 14:08:05 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: <8L3IGg5YYgi2EjlC-v5U3FkkWvK1swESQFAMwX02I84=.d597910f-0aca-4eb2-b68c-fbe565e73291@github.com> References: <8L3IGg5YYgi2EjlC-v5U3FkkWvK1swESQFAMwX02I84=.d597910f-0aca-4eb2-b68c-fbe565e73291@github.com> Message-ID: On Mon, 1 Sep 2025 16:01:23 GMT, Daniel Lund?n wrote: >> Oh gosh, I just realized: machine word of course depends on 32bit vs 64bit architecture. Yikes. >> So maybe the names need to be stack-slots vs words? And there should probably be a quick reminder somewhere that words can be different sizes. > > Sure, we can rename them. I think `RM_SIZE_IN_INTS` and `RM_SIZE_IN_WORDS` would be most suitable. I avoided such a change in this changeset to not make it bigger than it already is. Isn't it easier to do the renaming in a follow-up RFE though, instead of before this PR? I'm fine with both though, not that much extra work to do it before. I think it would be easier to review if you do it first. That PR won't be super controversial, and just makes the code nicer. And then when we come back here, we may even be able to drop some comments, or be able to catch bugs just because the reviewers understand better what's going on ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2316205052 From epeter at openjdk.org Tue Sep 2 14:11:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 14:11:07 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 16:15:28 GMT, Daniel Lund?n wrote: > The main issue is that register masks are stored as part of certain nodes, and nodes get copied by Node::clone Ok, that answers it for me. Maybe you can expand the comment a little where you mention that masks are `shallowly copied` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2316213034 From epeter at openjdk.org Tue Sep 2 14:19:02 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 14:19:02 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 16:23:57 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 96: >> >>> 94: (((RM_SIZE_MIN << 5) + // Slots for machine registers >>> 95: (max_method_parameter_length * 2) + // Slots for incoming arguments >>> 96: (max_method_parameter_length * 2) + // Slots for outgoing arguments >> >> What's the meaning of incoming vs outgoing arguments? Like this? >> >> Incoming = from caller (outer nesting) >> Outgoing = to nested call (inner nesting) > > Yes, you are correct. There is a detailed explanation in `x86_64.ad` ("Definition of frame structure and management information"). Ok. But that's not immediately apparent here. If you already have a comment, why not mention caller/callee or inner/outer scope? >> src/hotspot/share/opto/regmask.hpp line 175: >> >>> 173: // mask can currently represent to be included. If _all_stack = false, we >>> 174: // consider the registers not included. >>> 175: bool _all_stack = false; >> >> I'd prefer to have some kind of `_is_...` name here. Because when I read `all_stack` and see it is a bool, I wonder what it means - it does not tell me quickly. Does it mean that all registers are on the stack? >> >> Is everything that is beyond the register mask purely on the stack? Is everything from the stack always beyond the register mask? I'm confused :face_with_peeking_eye: > > Right, we should probably update this terminology as well. It comes from the fact that register masks can always represent all registers (+ a few stack slots), and anything beyond the mask is necessarily additional stack slots. So, if `_all_stack` is set, it means the register mask includes all of the stack slots. Any suggestion for a better name? So that could mean that we have stack slots that are in the mask, and that are off, but we still have `_all_stack = true`, right? That sounds a little contradictory to me. Some ideas: - `_value_of_bits_above_mask` - though strictly speaking the mask also represents those bits, and so they are not really "above" the mask. - `_value_of_bits_above_...` ah it is above the register mask `size`, right? Of course it is a bit suboptimal that the `size` is only for those that we explicitly represent, and does not capture that we implicitly represent. Maybe you can think about naming here too. Optional. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2316237483 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2316234008 From epeter at openjdk.org Tue Sep 2 14:25:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 14:25:14 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> Message-ID: On Mon, 1 Sep 2025 16:31:58 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 170: >> >>> 168: // variable indicates how many words we offset with. We consider all >>> 169: // registers before the offset to not be included in the register mask. >>> 170: unsigned int _offset; >> >> Does that mean we make different slices of the mask? > > I don't quite understand the question, can you please elaborate? The `_offset` means we shift the register mask to the right, so that the first bit of the first `_RM_UP` element no longer represents `OptoReg` 0 (but rather `OptoReg` `_offset * BitsPerWord`). Hmm ok. Now I went to `rm_up` and thought that you would do `i - _offset`. But that's not what happens. Hmm but then here there is a subtraction: bool Member(OptoReg::Name reg) const { reg = reg - offset_bits(); Is that consistent? I hope you understand why I'm confused ? >> src/hotspot/share/opto/regmask.hpp line 217: >> >>> 215: // necessarily representing stack locations) to 1. Here is how the above >>> 216: // register mask looks like after clearing, setting _all_stack to true, and >>> 217: // successfully rolling over: >> >> I'm still struggling to follow here. Maybe `_offset` is not clear to me yet. What is the value here for it? How is it changed with the `rollover`? > > This `_offset` stuff is really only for a very specific use case in `PhaseChaitin::Select`, so I understand it can be hard to follow. The value for `_offset` in the example after rollover is 5 = `_rm_size`, since we have rolled over once. When we roll over the next time, the `_offset` is 10, and so on. Ok, just make sure you document it in the example :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2316251262 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2316252858 From dlunden at openjdk.org Tue Sep 2 14:41:58 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 2 Sep 2025 14:41:58 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> Message-ID: On Tue, 2 Sep 2025 14:20:58 GMT, Emanuel Peter wrote: >> I don't quite understand the question, can you please elaborate? The `_offset` means we shift the register mask to the right, so that the first bit of the first `_RM_UP` element no longer represents `OptoReg` 0 (but rather `OptoReg` `_offset * BitsPerWord`). > > Hmm ok. Now I went to `rm_up` and thought that you would do `i - _offset`. But that's not what happens. > > Hmm but then here there is a subtraction: > > bool Member(OptoReg::Name reg) const { > reg = reg - offset_bits(); > > > Is that consistent? I hope you understand why I'm confused ? Yes, the subtraction is consistent, because if the register mask is offset, we can no longer use the OptoReg to directly index the mask. Small simplified example: register mask with 5 bits, offset by 10. First bit (index 0) represents OptoReg 10, second bit (index 1) represents OptoReg 11, etc. If we call `Member(15)`, we need to subtract the offset so we look at the correct index in the register mask (index 5). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2316301804 From epeter at openjdk.org Tue Sep 2 14:56:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 2 Sep 2025 14:56:46 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v6] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: On Mon, 1 Sep 2025 09:03:24 GMT, Galder Zamarre?o wrote: >> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: >> >> >> Benchmark (seed) (size) Mode Cnt Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. > > Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: > > Adjust vector size expectations Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26457#pullrequestreview-3176939920 From galder at openjdk.org Tue Sep 2 14:56:47 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 2 Sep 2025 14:56:47 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v6] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: On Mon, 1 Sep 2025 09:07:07 GMT, Emanuel Peter wrote: >> Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: >> >> Adjust vector size expectations > > Perfect, thanks for the update! I'll submit testing again :) @eme64 thanks for running the tests! Did you actually mark the review as approved? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3245692537 From duke at openjdk.org Tue Sep 2 15:03:45 2025 From: duke at openjdk.org (duke) Date: Tue, 2 Sep 2025 15:03:45 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F [v6] In-Reply-To: References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: On Mon, 1 Sep 2025 09:03:24 GMT, Galder Zamarre?o wrote: >> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: >> >> >> Benchmark (seed) (size) Mode Cnt Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. > > Galder Zamarre?o has updated the pull request incrementally with one additional commit since the last revision: > > Adjust vector size expectations @galderz Your change (at version 632408ba2adf8f3bffe226a9c2bb0db022d4e8d1) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26457#issuecomment-3245723721 From vlivanov at openjdk.org Tue Sep 2 16:00:48 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 2 Sep 2025 16:00:48 GMT Subject: RFR: 8365407: Race condition in MethodTrainingData::verify() [v8] In-Reply-To: References: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> Message-ID: On Tue, 26 Aug 2025 22:59:54 GMT, Igor Veresov wrote: >> This change fixes multiple issue with training data verification. While the current state of things in the mainline will not cause any issues (because of the absence of the call to `TD::verify()` during the shutdown) it does problems in the leyden repo. This change strengthens verification in the mainline (by adding the shutdown verify call), and fixes the problems that prevent it from working reliably. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Relax verification invariant src/hotspot/share/oops/trainingData.cpp line 635: > 633: int init_deps_left2 = compute_init_deps_left(); > 634: > 635: bool invariant = (init_deps_left1 >= init_deps_left2); I assume this check takes concurrent class initialization into account and init notification events are processed on a dedicated thread. Can we strengthen the check by repeatedly performing it and ensuring the value converges? Also, maybe take event queue into account? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26866#discussion_r2316527758 From iveresov at openjdk.org Tue Sep 2 16:19:44 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 2 Sep 2025 16:19:44 GMT Subject: RFR: 8365407: Race condition in MethodTrainingData::verify() [v8] In-Reply-To: References: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> Message-ID: On Tue, 2 Sep 2025 15:57:42 GMT, Vladimir Ivanov wrote: >> Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: >> >> Relax verification invariant > > src/hotspot/share/oops/trainingData.cpp line 635: > >> 633: int init_deps_left2 = compute_init_deps_left(); >> 634: >> 635: bool invariant = (init_deps_left1 >= init_deps_left2); > > I assume this check takes concurrent class initialization into account and init notification events are processed on a dedicated thread. Can we strengthen the check by repeatedly performing it and ensuring the value converges? Also, maybe take event queue into account? It's very hard to do reliably given the way the vm shutdown currently works. There is no way to ensure that all the java threads are stopped, so checking the convergence is problematic. So, the best I can do right now is prove the `>=` property. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26866#discussion_r2316575038 From iveresov at openjdk.org Tue Sep 2 16:19:45 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 2 Sep 2025 16:19:45 GMT Subject: RFR: 8365407: Race condition in MethodTrainingData::verify() [v8] In-Reply-To: References: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> Message-ID: On Tue, 2 Sep 2025 16:15:42 GMT, Igor Veresov wrote: >> src/hotspot/share/oops/trainingData.cpp line 635: >> >>> 633: int init_deps_left2 = compute_init_deps_left(); >>> 634: >>> 635: bool invariant = (init_deps_left1 >= init_deps_left2); >> >> I assume this check takes concurrent class initialization into account and init notification events are processed on a dedicated thread. Can we strengthen the check by repeatedly performing it and ensuring the value converges? Also, maybe take event queue into account? > > It's very hard to do reliably given the way the vm shutdown currently works. There is no way to ensure that all the java threads are stopped, so checking the convergence is problematic. So, the best I can do right now is prove the `>=` property. I mean, I tired, but gave up on the convergence for now. Perhaps we'd make a stab at it another time. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26866#discussion_r2316577781 From vlivanov at openjdk.org Tue Sep 2 16:59:43 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 2 Sep 2025 16:59:43 GMT Subject: RFR: 8365407: Race condition in MethodTrainingData::verify() [v8] In-Reply-To: References: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> Message-ID: On Tue, 26 Aug 2025 22:59:54 GMT, Igor Veresov wrote: >> This change fixes multiple issue with training data verification. While the current state of things in the mainline will not cause any issues (because of the absence of the call to `TD::verify()` during the shutdown) it does problems in the leyden repo. This change strengthens verification in the mainline (by adding the shutdown verify call), and fixes the problems that prevent it from working reliably. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Relax verification invariant Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26866#pullrequestreview-3177412360 From vlivanov at openjdk.org Tue Sep 2 16:59:44 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 2 Sep 2025 16:59:44 GMT Subject: RFR: 8365407: Race condition in MethodTrainingData::verify() [v8] In-Reply-To: References: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> Message-ID: On Tue, 2 Sep 2025 16:16:41 GMT, Igor Veresov wrote: >> It's very hard to do reliably given the way the vm shutdown currently works. There is no way to ensure that all the java threads are stopped, so checking the convergence is problematic. So, the best I can do right now is prove the `>=` property. > > I mean, I tired, but gave up on the convergence for now. Perhaps we'd make a stab at it another time. Ok, sounds good. Thanks for the clarifications. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26866#discussion_r2316665607 From vlivanov at openjdk.org Tue Sep 2 17:58:41 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 2 Sep 2025 17:58:41 GMT Subject: RFR: 8358751: C2: Recursive inlining check for compiled lambda forms is broken In-Reply-To: References: Message-ID: On Fri, 22 Aug 2025 01:24:52 GMT, Vladimir Ivanov wrote: > Recursive inlining checks are relaxed for compiled LambdaForms. Since LambdaForms are heavily reused, the check is performed on `MethodHandle` receivers instead. > > Unfortunately, the current implementation is broken. JVMState doesn't guarantee presence of receivers for caller frames. > An attempt to fetch pruned receiver reports unrelated info, but, in the worst case, it ends up as an out-of-bounds access into node's input array and crashes the JVM. > > Proposed fix captures receiver information as part of inlining and preserves it on `JVMState` for every compiled LambdaForm frame, so it can be reliably recovered during subsequent inlining attempts. > > Testing: hs-tier1 - hs-tier8 > > (Special thanks to @mroth23 who prepared a reproducer of the bug.) Thanks for the reviews, Dean and Roland. > What about a regression test? I wasn't able to extract a regression test from the failing program. I added additional asserts to catch problematic accesses, so a similar bug should be easier to caught in the future. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26891#issuecomment-3246286042 From iklam at openjdk.org Tue Sep 2 18:10:43 2025 From: iklam at openjdk.org (Ioi Lam) Date: Tue, 2 Sep 2025 18:10:43 GMT Subject: RFR: 8365407: Race condition in MethodTrainingData::verify() [v8] In-Reply-To: References: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> Message-ID: On Tue, 26 Aug 2025 22:59:54 GMT, Igor Veresov wrote: >> This change fixes multiple issue with training data verification. While the current state of things in the mainline will not cause any issues (because of the absence of the call to `TD::verify()` during the shutdown) it does problems in the leyden repo. This change strengthens verification in the mainline (by adding the shutdown verify call), and fixes the problems that prevent it from working reliably. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Relax verification invariant LGTM ------------- Marked as reviewed by iklam (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26866#pullrequestreview-3177620970 From vlivanov at openjdk.org Tue Sep 2 18:26:28 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 2 Sep 2025 18:26:28 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v6] In-Reply-To: References: Message-ID: > This PR introduces C2 support for `Reference.reachabilityFence()`. > > After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. > > `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. > > Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. > > Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 > "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." > > Testing: > - [x] hs-tier1 - hs-tier8 > - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations > - [x] java/lang/foreign microbenchmarks Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: Unconditionally schedule RF nodes for IGVN ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25315/files - new: https://git.openjdk.org/jdk/pull/25315/files/8b1c6dff..0762dda9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=04-05 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/25315.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25315/head:pull/25315 PR: https://git.openjdk.org/jdk/pull/25315 From manc at openjdk.org Tue Sep 2 18:51:48 2025 From: manc at openjdk.org (Man Cao) Date: Tue, 2 Sep 2025 18:51:48 GMT Subject: RFR: 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation [v5] In-Reply-To: References: Message-ID: <1F0kx1dn34B_lxtY-AMLLkLR3PFXeB3kVUudkbQNuS4=.58673aae-839a-40b2-8814-2972e27d85cb@github.com> On Fri, 29 Aug 2025 23:12:18 GMT, Man Cao wrote: >> Hi, >> >> Could anyone review this change that fixes https://bugs.openjdk.org/browse/JDK-8366118? When this bug happens, it is difficult or almost impossible to debug due to the lack of stack trace, hs-err log or core dump. Fortunately we are also experimenting with sigaltstack for https://bugs.openjdk.org/browse/JDK-8364654, and it helped immensely to identify the root cause. >> >> I will also try adding a test case for DontCompileHugeMethod under -XX:-TieredCompilation. >> >> -Man > > Man Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into JDK-8366118-DontCompileHugeMethods > - Add -Xbatch to test > - Use List.of in test > - Add a jtreg test > - 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation Could anyone give another approval on the latest change? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26932#issuecomment-3246433933 From fferrari at openjdk.org Tue Sep 2 19:05:45 2025 From: fferrari at openjdk.org (Francisco Ferrari Bihurriet) Date: Tue, 2 Sep 2025 19:05:45 GMT Subject: RFR: 8364970: Redo JDK-8327381 by updating the CmpU type instead of the Bool type [v3] In-Reply-To: <_EN6o6Jwu73CNwvSXYt2cHSHu6Yglkp86f1t7lywwi4=.a84b6fac-327a-48a5-8f1e-772b31d8da10@github.com> References: <_EN6o6Jwu73CNwvSXYt2cHSHu6Yglkp86f1t7lywwi4=.a84b6fac-327a-48a5-8f1e-772b31d8da10@github.com> Message-ID: On Fri, 29 Aug 2025 13:17:48 GMT, Christian Hagedorn wrote: >> # Absence note >> >> Today is the last day before a ~2 weeks vacation, so my next working day is Monday, September 1st. >> >> Please feel free to keep giving feedback and/or reviews, and I will continue when I'm back. >> >> Cheers, >> Francisco > > Hi @franferrax, hope you had a good vacation! > >> Hi @chhagedorn, >> >> I added the new tests in [e6b1cb8](https://github.com/openjdk/jdk/commit/e6b1cb897d9c75b34744c7d24f72abcec9986b0b). One problem I'm facing is that I'm unable to generate `Bool` nodes with arbitrary `BoolTest` values. Even if I try the assert inversions I removed in [10e1e3f](https://github.com/openjdk/jdk/commit/10e1e3f4f796d05dcd5c56bc2365d5d564d93952), C2 has preference for `BoolTest::ne`, `BoolTest::le` and `BoolTest::lt`. Instead of using `BoolTest::eq`, `BoolTest::gt` or `BoolTest::ge`, it swaps what is put in `IfTrue` and `IfFalse`. >> >> Even if `javac` generates an `ifeq` and an `ifne` with the same inputs, instead of a single `CmpU` with two `Bool`s (`BoolTest::eq` and `BoolTest::ne`), I get a single `Bool` (`BoolTest::ne`) with two `If` (one of them swapping `IfTrue` with `IfFalse`). I guess this is some sort of canonicalization to enable further optimizations. >> >> Do you know a way to influence the `Bool`'s `BoolTest` value? Or @rwestrel do you? >> >> This means the following 8 cases are not really testing what they claim, but repeating other cases with `IfTrue` and `IfFalse` swapped: >> >> * `testCase1aOptimizeAsFalseForGT(xm|mx)` (they should use `BoolTest::gt`, but use `BoolTest::le`) >> * `testCase1bOptimizeAsFalseForEQ(xm|mx)` (they should use `BoolTest::eq`, but use `BoolTest::ne`) >> * `testCase1bOptimizeAsFalseForGE(xm|mx)` (they should use `BoolTest::ge`, but use `BoolTest::lt`) >> * `testCase1bOptimizeAsFalseForGT(xm|mx)` (they should use `BoolTest::gt`, but use `BoolTest::le`) >> >> Even if we don't find a way to influence the `BoolTest`, the cases are still valid and can be kept (just in case the described behaviour changes). > > Hm, that's a good point. `Parse::do_if()` indeed always canonicalizes the `Bool` nodes... But I was sure we can still somehow end up with non-canonicalized versions again with some tricks. I was curious and played around with some examples and could indeed find test cases for `gt`, `ge` , and `eq`. > > I was then also thinking about notification code in IGVN. We already concluded further up that it's not needed for CCP because `CmpU` nodes below `AddI` nodes are put to the worklist again. However, with IGVN, we could modify the graph above the `AndI` as well. We miss notification code for `CmpU` below `AndI`. I changed my test cases further to also run into such a missing optimization case. When run with `-XX:VerifyIterativeGVN=1110`, we indeed get su... Hi @chhagedorn, thank you for the additional work and your insights. This is much appreciated from a learner perspective. I didn't fully analyze the `Test.java` you provided yet, but wanted to check if you are aiming to include the missing IGVN notification code as part of this issue (and its corresponding test). Or are you working on an independent issue? My availability will be limited as the October CPU approaches, but it will try to find some timeboxes to make `TestBoolNodeGVN.java` emit the right test cases for `gt`, `ge` , and `eq`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26666#issuecomment-3246471074 From rehn at openjdk.org Tue Sep 2 19:31:45 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 2 Sep 2025 19:31:45 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 10:10:31 GMT, Hamlin Li wrote: >> Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge branch 'master' into 8365926 >> - Spelling >> - Merge branch 'master' into 8365926 >> - draft jal<->jalr > > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > For the performance data, do you have some data for applying this fix on top of the next commit after`JDK-23 (last version with trampoline calls)`? I think this data might be more helpful to understand the performance comparison between old trampoline, stub and this pr. @Hamlin-Li This is the first version which hade the new auipc+ld+jalr, i.e. we could toogle with UseTrampolines. I backported optimize_call to it. This version is still using t0/x5 for calls, thus return predictions are all messed up. An even better comparison would be to also back-port use t1/x6 for calls. Anyways here are the numbers, from 400 benchmarks runs each using the last iteration: Using trampolines: ############## --- Statistical Analysis --- Average (Mean): 3610.00 Median: 3645.09 Standard Deviation: 297.11 -------------------------- Using load calls: ############## --- Statistical Analysis --- Average (Mean): 3691.09 Median: 3793.11 Standard Deviation: 403.80 -------------------------- ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3246536967 From dlong at openjdk.org Tue Sep 2 20:52:32 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 2 Sep 2025 20:52:32 GMT Subject: RFR: 8366461: Remove obsolete method handle invoke logic [v3] In-Reply-To: References: Message-ID: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> > At one time, JSR292 support needed special logic to save and restore SP across method handle instrinsic calls, but that is no longer the case. The only platform that still does the save/restore is arm32, which is no longer necessary. The save/restore can be removed along with related APIs and logic. Note that the arm32 port is largely based on the x86 port, which stopped doing the save/restore in jdk9 ([JDK-8068945](https://bugs.openjdk.org/browse/JDK-8068945)). Dean Long has updated the pull request incrementally with three additional commits since the last revision: - revert whitespace change - undo debug changes - cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27059/files - new: https://git.openjdk.org/jdk/pull/27059/files/303305ae..eac482a5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27059&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27059&range=01-02 Stats: 7 lines in 4 files changed: 1 ins; 6 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27059.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27059/head:pull/27059 PR: https://git.openjdk.org/jdk/pull/27059 From vlivanov at openjdk.org Tue Sep 2 21:11:46 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 2 Sep 2025 21:11:46 GMT Subject: RFR: 8366461: Remove obsolete method handle invoke logic [v3] In-Reply-To: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> References: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> Message-ID: On Tue, 2 Sep 2025 20:52:32 GMT, Dean Long wrote: >> At one time, JSR292 support needed special logic to save and restore SP across method handle instrinsic calls, but that is no longer the case. The only platform that still does the save/restore is arm32, which is no longer necessary. The save/restore can be removed along with related APIs and logic. Note that the arm32 port is largely based on the x86 port, which stopped doing the save/restore in jdk9 ([JDK-8068945](https://bugs.openjdk.org/browse/JDK-8068945)). > > Dean Long has updated the pull request incrementally with three additional commits since the last revision: > > - revert whitespace change > - undo debug changes > - cleanup Nice cleanup! Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27059#pullrequestreview-3178139499 From iveresov at openjdk.org Tue Sep 2 21:30:52 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 2 Sep 2025 21:30:52 GMT Subject: Integrated: 8365407: Race condition in MethodTrainingData::verify() In-Reply-To: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> References: <0795hrryFZveQb4GgjNhdGJSYwIz98RHoJx3JX8LSDY=.dae4d10e-5a1f-49da-bec7-e77360f8026e@github.com> Message-ID: On Wed, 20 Aug 2025 18:19:25 GMT, Igor Veresov wrote: > This change fixes multiple issue with training data verification. While the current state of things in the mainline will not cause any issues (because of the absence of the call to `TD::verify()` during the shutdown) it does problems in the leyden repo. This change strengthens verification in the mainline (by adding the shutdown verify call), and fixes the problems that prevent it from working reliably. This pull request has now been integrated. Changeset: 991ac9e6 Author: Igor Veresov URL: https://git.openjdk.org/jdk/commit/991ac9e6168b2573f78772e2d7936792a43fe336 Stats: 90 lines in 6 files changed: 32 ins; 17 del; 41 mod 8365407: Race condition in MethodTrainingData::verify() Reviewed-by: kvn, vlivanov, iklam ------------- PR: https://git.openjdk.org/jdk/pull/26866 From dlong at openjdk.org Tue Sep 2 21:54:45 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 2 Sep 2025 21:54:45 GMT Subject: RFR: 8355354: C2 crashed: assert(_callee == nullptr || _callee == m) failed: repeated inline attempt with different callee [v3] In-Reply-To: References: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> Message-ID: On Mon, 1 Sep 2025 06:50:28 GMT, Damon Fenacci wrote: >> # Issue >> The CTW test `applications/ctw/modules/java_xml.java` crashes when trying to repeat late inlining of a virtual method (after IGVN passes through the method's call node again). The failure originates [here](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callGenerator.cpp#L473) because `_callee != m`. Apparently when running IGVN a second time after a first late inline failure and [setting the callee in the call generator](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callnode.cpp#L1240) we notice that the previous callee is not the same as the current one. >> In this specific instance it seems that the issue happens when CTW is compiling Apache Xalan. >> >> # Cause >> The root of the issue has to do with repeated late inlining, class hierarchy analysis and dynamic class loading. >> >> For this particular issue the two differing methods are `org.apache.xalan.xsltc.compiler.LocationPathPattern::translate` first and `org.apache.xalan.xsltc.compiler.AncestorPattern::translate` the second time. `LocationPathPattern` is an abstract class but has a concrete `translate` method. `AncestorPattern` is a concrete class that extends another abstract class `RelativePathPattern` that extends `LocationPathPattern`. `AncestorPattern` overrides the translate method. >> What seems to be happening is the following: we compile a virtual call `RelativePathPattern::translate` and at compile time. Only the abstract classes `RelativePathPattern` <: `LocationPathPattern` are loaded. CHA then finds out that the call must always call `LocationPathPattern::translate` because the method is not overwritten anywhere else. However, there is still no non-abstract class in the entire class hierarchy, i.e. as soon as `AncestorPattern` is loaded, this class is then the only non-abstract class in the class hierarchy and therefore the receiver type must be `AncestorPattern`. >> >> More in general, when late inlining is repeated and classes are loaded dynamically, it is possible that the resolved method between a late inlining attempt and the next one is not the same. >> >> # Fix >> >> This looks like a very edge-case. If CHA is affected by class loading the original recorded dependency becomes invalid. This can possibly happen in other situations (e.g JVMTI class redefinition). So, instead of modifying the assert (to check for invalid dependencies) we avoid re-setting the callee method ... > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8355354: avoid resetting callee in call node ideal src/hotspot/share/opto/compile.cpp line 2117: > 2115: cg->call_node()->set_generator(cg); > 2116: C->igvn_worklist()->push(cg->call_node()); > 2117: should_stress = true; I have a guess what this stress code is doing, but a good comment would help. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26441#discussion_r2317250134 From dlong at openjdk.org Tue Sep 2 22:16:44 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 2 Sep 2025 22:16:44 GMT Subject: RFR: 8366461: Remove obsolete method handle invoke logic [v3] In-Reply-To: References: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> Message-ID: On Tue, 2 Sep 2025 21:09:27 GMT, Vladimir Ivanov wrote: >> Dean Long has updated the pull request incrementally with three additional commits since the last revision: >> >> - revert whitespace change >> - undo debug changes >> - cleanup > > Nice cleanup! Looks good. Thanks @iwanowww ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27059#issuecomment-3246957430 From wenanjian at openjdk.org Wed Sep 3 01:30:56 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Wed, 3 Sep 2025 01:30:56 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers Message-ID: According to JDK-8353216?Add extra verification logic into MethodHandle::invokeBasic/linkTo* to ensure that holder classes are properly initialized on riscv platform. ------------- Commit messages: - RISC-V: Improve VerifyMethodHandles for method handle linkers Changes: https://git.openjdk.org/jdk/pull/26938/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26938&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366747 Stats: 52 lines in 2 files changed: 46 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/26938.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26938/head:pull/26938 PR: https://git.openjdk.org/jdk/pull/26938 From fyang at openjdk.org Wed Sep 3 02:04:43 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 3 Sep 2025 02:04:43 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers In-Reply-To: References: Message-ID: On Tue, 26 Aug 2025 09:18:14 GMT, Anjian Wen wrote: > According to JDK-8353216?Add extra verification logic into MethodHandle::invokeBasic/linkTo* to ensure that holder classes are properly initialized on riscv platform. Seems fine to me. Two minor comments. src/hotspot/cpu/riscv/methodHandles_riscv.cpp line 100: > 98: __ verify_method_ptr(method); > 99: if (VerifyMethodHandles) { > 100: Label L_ok; Can you add an assertion here about the registers? Like: `assert_different_registers(method, t0, t1);` src/hotspot/cpu/riscv/methodHandles_riscv.cpp line 102: > 100: Label L_ok; > 101: const Register method_holder = t1; > 102: __ load_method_holder(method_holder, method); Please leave a new line before the swith-case structure. ------------- PR Review: https://git.openjdk.org/jdk/pull/26938#pullrequestreview-3178689768 PR Review Comment: https://git.openjdk.org/jdk/pull/26938#discussion_r2317554909 PR Review Comment: https://git.openjdk.org/jdk/pull/26938#discussion_r2317575345 From dzhang at openjdk.org Wed Sep 3 02:13:50 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 3 Sep 2025 02:13:50 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers In-Reply-To: References: Message-ID: On Tue, 26 Aug 2025 09:18:14 GMT, Anjian Wen wrote: > According to JDK-8353216?Add extra verification logic into MethodHandle::invokeBasic/linkTo* to ensure that holder classes are properly initialized on riscv platform. LGTM, thanks! ------------- Marked as reviewed by dzhang (Author). PR Review: https://git.openjdk.org/jdk/pull/26938#pullrequestreview-3178726464 From wenanjian at openjdk.org Wed Sep 3 02:40:27 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Wed, 3 Sep 2025 02:40:27 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers [v2] In-Reply-To: References: Message-ID: > According to JDK-8353216?Add extra verification logic into MethodHandle::invokeBasic/linkTo* to ensure that holder classes are properly initialized on riscv platform. Anjian Wen has updated the pull request incrementally with one additional commit since the last revision: Add assertion and modify format ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26938/files - new: https://git.openjdk.org/jdk/pull/26938/files/52f76be1..b5eb3bd1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26938&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26938&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26938.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26938/head:pull/26938 PR: https://git.openjdk.org/jdk/pull/26938 From wenanjian at openjdk.org Wed Sep 3 02:40:27 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Wed, 3 Sep 2025 02:40:27 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 02:11:07 GMT, Dingli Zhang wrote: >> Anjian Wen has updated the pull request incrementally with one additional commit since the last revision: >> >> Add assertion and modify format > > LGTM, thanks! @DingliZhang Thanks for your review and approve! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26938#issuecomment-3247480195 From wenanjian at openjdk.org Wed Sep 3 02:40:28 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Wed, 3 Sep 2025 02:40:28 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 01:40:43 GMT, Fei Yang wrote: >> Anjian Wen has updated the pull request incrementally with one additional commit since the last revision: >> >> Add assertion and modify format > > src/hotspot/cpu/riscv/methodHandles_riscv.cpp line 100: > >> 98: __ verify_method_ptr(method); >> 99: if (VerifyMethodHandles) { >> 100: Label L_ok; > > Can you add an assertion here about the registers? Like: `assert_different_registers(method, t0, t1);` Thanks for the review, I have added the assertion. > src/hotspot/cpu/riscv/methodHandles_riscv.cpp line 102: > >> 100: Label L_ok; >> 101: const Register method_holder = t1; >> 102: __ load_method_holder(method_holder, method); > > Please leave a new line before the swith-case structure. done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26938#discussion_r2317610255 PR Review Comment: https://git.openjdk.org/jdk/pull/26938#discussion_r2317610296 From fyang at openjdk.org Wed Sep 3 02:46:43 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 3 Sep 2025 02:46:43 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers [v2] In-Reply-To: References: Message-ID: <-VWFI0Aldis7LKGbm9HRzfokwCOnrjROtPgdE4Hqogw=.d3631ecd-02a6-413c-83ef-d7f2e877c216@github.com> On Wed, 3 Sep 2025 02:40:27 GMT, Anjian Wen wrote: >> According to JDK-8353216?Add extra verification logic into MethodHandle::invokeBasic/linkTo* to ensure that holder classes are properly initialized on riscv platform. > > Anjian Wen has updated the pull request incrementally with one additional commit since the last revision: > > Add assertion and modify format Thanks for the update. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26938#pullrequestreview-3178764845 From galder at openjdk.org Wed Sep 3 06:40:49 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 3 Sep 2025 06:40:49 GMT Subject: Integrated: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F In-Reply-To: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: <28rS8Vqoyrc09J9cdn1tWXByIZUV9GL-_Hjcn3bMLBk=.ff9e5545-9f30-42cd-b2e1-56954bbdfbf2@github.com> On Thu, 24 Jul 2025 10:29:15 GMT, Galder Zamarre?o wrote: > I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. > > Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: > > > Benchmark (seed) (size) Mode Cnt Base Patch Units Diff > VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% > VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% > VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% > VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% > VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% > VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% > > > The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. > > I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. This pull request has now been integrated. Changeset: 8c4090c2 Author: Galder Zamarre?o Committer: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/8c4090c2cfa00f9c3550669a0726a785b30ac1d5 Stats: 67 lines in 4 files changed: 57 ins; 4 del; 6 mod 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F Reviewed-by: epeter, qamai ------------- PR: https://git.openjdk.org/jdk/pull/26457 From dfenacci at openjdk.org Wed Sep 3 06:50:26 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 3 Sep 2025 06:50:26 GMT Subject: RFR: 8355354: C2 crashed: assert(_callee == nullptr || _callee == m) failed: repeated inline attempt with different callee [v4] In-Reply-To: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> References: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> Message-ID: > # Issue > The CTW test `applications/ctw/modules/java_xml.java` crashes when trying to repeat late inlining of a virtual method (after IGVN passes through the method's call node again). The failure originates [here](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callGenerator.cpp#L473) because `_callee != m`. Apparently when running IGVN a second time after a first late inline failure and [setting the callee in the call generator](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callnode.cpp#L1240) we notice that the previous callee is not the same as the current one. > In this specific instance it seems that the issue happens when CTW is compiling Apache Xalan. > > # Cause > The root of the issue has to do with repeated late inlining, class hierarchy analysis and dynamic class loading. > > For this particular issue the two differing methods are `org.apache.xalan.xsltc.compiler.LocationPathPattern::translate` first and `org.apache.xalan.xsltc.compiler.AncestorPattern::translate` the second time. `LocationPathPattern` is an abstract class but has a concrete `translate` method. `AncestorPattern` is a concrete class that extends another abstract class `RelativePathPattern` that extends `LocationPathPattern`. `AncestorPattern` overrides the translate method. > What seems to be happening is the following: we compile a virtual call `RelativePathPattern::translate` and at compile time. Only the abstract classes `RelativePathPattern` <: `LocationPathPattern` are loaded. CHA then finds out that the call must always call `LocationPathPattern::translate` because the method is not overwritten anywhere else. However, there is still no non-abstract class in the entire class hierarchy, i.e. as soon as `AncestorPattern` is loaded, this class is then the only non-abstract class in the class hierarchy and therefore the receiver type must be `AncestorPattern`. > > More in general, when late inlining is repeated and classes are loaded dynamically, it is possible that the resolved method between a late inlining attempt and the next one is not the same. > > # Fix > > This looks like a very edge-case. If CHA is affected by class loading the original recorded dependency becomes invalid. This can possibly happen in other situations (e.g JVMTI class redefinition). So, instead of modifying the assert (to check for invalid dependencies) we avoid re-setting the callee method if it is already defined. > > # T... Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8355354: add stress comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26441/files - new: https://git.openjdk.org/jdk/pull/26441/files/ce807553..bf92e244 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26441&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26441&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26441.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26441/head:pull/26441 PR: https://git.openjdk.org/jdk/pull/26441 From dfenacci at openjdk.org Wed Sep 3 06:50:27 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 3 Sep 2025 06:50:27 GMT Subject: RFR: 8355354: C2 crashed: assert(_callee == nullptr || _callee == m) failed: repeated inline attempt with different callee [v3] In-Reply-To: References: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> Message-ID: On Tue, 2 Sep 2025 21:52:27 GMT, Dean Long wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8355354: avoid resetting callee in call node ideal > > src/hotspot/share/opto/compile.cpp line 2117: > >> 2115: cg->call_node()->set_generator(cg); >> 2116: C->igvn_worklist()->push(cg->call_node()); >> 2117: should_stress = true; > > I have a guess what this stress code is doing, but a good comment would help. Sure! Comment added. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26441#discussion_r2317935692 From rehn at openjdk.org Wed Sep 3 06:54:29 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 3 Sep 2025 06:54:29 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v4] In-Reply-To: References: Message-ID: > Hey, please consider! > > A bunch of info in JBS entry, please read that also. > > I narrowed this issue down to the old jal optimization, making direct calls when in reach. > This patch restores them and removes this regression. > > In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. > > Please test on your hardware! > > > Chi Square (100 runs each, 10 fastest iterations of each run, P550) > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision: Review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26944/files - new: https://git.openjdk.org/jdk/pull/26944/files/f0f7f20e..72e3ba6a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=02-03 Stats: 10 lines in 1 file changed: 1 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/26944.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26944/head:pull/26944 PR: https://git.openjdk.org/jdk/pull/26944 From mhaessig at openjdk.org Wed Sep 3 07:17:44 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 3 Sep 2025 07:17:44 GMT Subject: RFR: 8366461: Remove obsolete method handle invoke logic [v3] In-Reply-To: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> References: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> Message-ID: On Tue, 2 Sep 2025 20:52:32 GMT, Dean Long wrote: >> At one time, JSR292 support needed special logic to save and restore SP across method handle instrinsic calls, but that is no longer the case. The only platform that still does the save/restore is arm32, which is no longer necessary. The save/restore can be removed along with related APIs and logic. Note that the arm32 port is largely based on the x86 port, which stopped doing the save/restore in jdk9 ([JDK-8068945](https://bugs.openjdk.org/browse/JDK-8068945)). > > Dean Long has updated the pull request incrementally with three additional commits since the last revision: > > - revert whitespace change > - undo debug changes > - cleanup Thank you for cleaning this up, @dean-long. I just have a drive-by comment. src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/aarch64/AARCH64Frame.java line 372: > 370: // DEBUG_ONLY(verifyDeoptriginalPc(senderNm, raw_unextendedSp)); > 371: } > 372: } `Frame.java adjustUnextendedSP()` do not seem to do anything? Perhaps these could be cleaned up as well? ------------- PR Review: https://git.openjdk.org/jdk/pull/27059#pullrequestreview-3179245014 PR Review Comment: https://git.openjdk.org/jdk/pull/27059#discussion_r2317990499 From duke at openjdk.org Wed Sep 3 07:22:27 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 07:22:27 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: > The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: > > 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: > - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. > - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. > - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. > - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. > > 2. Additionally, the encoding of the negative floating-point number is incorrect: > - The imm8?field can fall outside the valid range of?**[-128, 127]**. > - Bit **13** should be encoded as **0** for floating-point numbers. > > This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. > > Some test cases are added to aarch64-asmtest.py, and all tests passed. erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Don't rename sve_cpy as sve_fcpy - Merge branch 'master' into JDK-8365911 - 8365911: AArch64: Fix encoding error in sve_cpy for negative floats The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. 2. Additionally, the encoding of the negative floating-point number is incorrect: - The imm8?field can fall outside the valid range of?**[-128, 127]**. - Bit **13** should be encoded as **0** for floating-point numbers. This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. Some test cases are added to aarch64-asmtest.py, and all tests passed. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26951/files - new: https://git.openjdk.org/jdk/pull/26951/files/dad0e011..16a06948 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26951&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26951&range=00-01 Stats: 16389 lines in 782 files changed: 11717 ins; 2213 del; 2459 mod Patch: https://git.openjdk.org/jdk/pull/26951.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26951/head:pull/26951 PR: https://git.openjdk.org/jdk/pull/26951 From duke at openjdk.org Wed Sep 3 07:22:27 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 07:22:27 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> References: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> Message-ID: On Tue, 2 Sep 2025 08:10:02 GMT, Andrew Haley wrote: > I do. Thank you. Ok, I have reverted the refactoring. Please help take another look, thanks~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3247994550 From rehn at openjdk.org Wed Sep 3 07:56:43 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 3 Sep 2025 07:56:43 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 12:48:11 GMT, Hamlin Li wrote: >> From JBS entry, the point is to do it in a sane order: >> >> The release in make_jal_opt so to make sure the store to instruction stream happens before I-cache flush. >> >> 1: store destination to stub >> 2: release >> 3: store destination to instruction stream >> 4: release >> 5: i-cache flush > > I don't see a detailed discussion about why there needs to be 2 `release`. > Seems the `2: release` is redundant? does a single release (step 4) after step 3 work as well? Regarding 4: Now, from this code perspective i-cache invalidate is a bit opaque. We do know that we don't want the store to happen after the flush. The risc-v implementation do emit a full fence before flush, as stores may be reordered over fence.i. But the AbstractICache::invalidate_range is not documented to guarantee to have this effect. Regarding 2: If someone executes the new instruction when changed to jalr(3), we did want them to call the new location we stored to the stub(1). By saying 1 happens before 3, we convey our intent. Aarch64 also have this. So non of the releases (2,4) is truly need AFIACT, as this code must support both calling old dest and new dest. E.g. if you are context switch after loading old dest, context switch back and executes jalr, you will be calling old dest, which is fine as that method is marked not-entrant. Causing you to resolve this call then you will see the new dest. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2318105086 From rehn at openjdk.org Wed Sep 3 07:56:44 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 3 Sep 2025 07:56:44 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 12:47:23 GMT, Hamlin Li wrote: >> Maybe we can give it a new name to avoid possible confusion? `jmp_pc` or simply `pc`? > >> We only change the instruction at "instruction_address() + 2 * NativeInstruction::instruction_size". > > Right! > >> Note that jal_pos and jal_pc means a "jump and link instruction", not specifically jal or jalr. > > As we're patching either `jal` or `jalr` instruction, so jal is misleading, I agree `jmp_xxx` is a better name. fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2318105646 From dskantz at openjdk.org Wed Sep 3 08:02:04 2025 From: dskantz at openjdk.org (Daniel Skantz) Date: Wed, 3 Sep 2025 08:02:04 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis [v2] In-Reply-To: References: Message-ID: <11lcsXkMGpKMQr60NCKofzldqpnJka1XZtrGRrUai3o=.c2201234-bbf2-465a-b237-cd9fe8505491@github.com> > This PR addresses a wrong compilation during string optimizations. > > During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. > > After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. > > The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. > > Testing: T1-3 (aed5952). > > Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. Daniel Skantz has updated the pull request incrementally with two additional commits since the last revision: - store intermediate calculations - direction convention ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27028/files - new: https://git.openjdk.org/jdk/pull/27028/files/8e93056d..5638f221 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27028&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27028&range=00-01 Stats: 8 lines in 1 file changed: 3 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27028.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27028/head:pull/27028 PR: https://git.openjdk.org/jdk/pull/27028 From aph at openjdk.org Wed Sep 3 08:14:45 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 3 Sep 2025 08:14:45 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 07:22:27 GMT, erifan wrote: >> The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Don't rename sve_cpy as sve_fcpy > - Merge branch 'master' into JDK-8365911 > - 8365911: AArch64: Fix encoding error in sve_cpy for negative floats > > The?sve_cpy?instruction is not correctly implemented for?negative > floating-point?values. The issues include: > > 1. When a negative floating-point number (e.g. `-1.0`) is passed, the > `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: > - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. > - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. > - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. > - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. > > 2. Additionally, the encoding of the negative floating-point number is incorrect: > - The imm8?field can fall outside the valid range of?**[-128, 127]**. > - Bit **13** should be encoded as **0** for floating-point numbers. > > This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. > > Some test cases are added to aarch64-asmtest.py, and all tests passed. This looks good, modulo the minor style fixes. src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3819: > 3817: if (isFloat) { > 3818: assert(T != B, "invalid size"); > 3819: assert((imm8 >> 8) == 0, "invalid immediate"); Suggestion: assert((imm8 & 0xff) == 0, "invalid immediate"); To match line 3819. src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3831: > 3829: int m = isMerge ? 1 : 0; > 3830: f(0b00000101, 31, 24), f(T, 23, 22), f(0b01, 21, 20); > 3831: prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), f(imm8&0xff, 12, 5), rf(Zd, 0); Suggestion: prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), f(imm8 & 0xff, 12, 5), rf(Zd, 0); General HotSpot style. ------------- PR Review: https://git.openjdk.org/jdk/pull/26951#pullrequestreview-3179466006 PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318148242 PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318149316 From mli at openjdk.org Wed Sep 3 08:15:46 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 3 Sep 2025 08:15:46 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 19:28:51 GMT, Robbin Ehn wrote: > ``` > Using trampolines: > ############## > --- Statistical Analysis --- > Average (Mean): 3610.00 > Median: 3645.09 > Standard Deviation: 297.11 > -------------------------- > > Using load calls: > ############## > --- Statistical Analysis --- > Average (Mean): 3691.09 > Median: 3793.11 > Standard Deviation: 403.80 > -------------------------- > ``` Not sure if I understood the data right. Does this mean old trampolines perform better than new implementation related to benchmark chi-square? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3248164717 From duke at openjdk.org Wed Sep 3 08:25:58 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 08:25:58 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 08:11:27 GMT, Andrew Haley wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Don't rename sve_cpy as sve_fcpy >> - Merge branch 'master' into JDK-8365911 >> - 8365911: AArch64: Fix encoding error in sve_cpy for negative floats >> >> The?sve_cpy?instruction is not correctly implemented for?negative >> floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the >> `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3819: > >> 3817: if (isFloat) { >> 3818: assert(T != B, "invalid size"); >> 3819: assert((imm8 >> 8) == 0, "invalid immediate"); > > Suggestion: > > assert((imm8 & 0xff) == 0, "invalid immediate"); > > To match line 3819. This may not be the case, `imm8 >> 8` doesn't equal to `imm8 & 0xff` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318181658 From rehn at openjdk.org Wed Sep 3 08:28:42 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 3 Sep 2025 08:28:42 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 08:13:24 GMT, Hamlin Li wrote: > > ``` > > Using trampolines: > > ############## > > --- Statistical Analysis --- > > Average (Mean): 3610.00 > > Median: 3645.09 > > Standard Deviation: 297.11 > > -------------------------- > > > > Using load calls: > > ############## > > --- Statistical Analysis --- > > Average (Mean): 3691.09 > > Median: 3793.11 > > Standard Deviation: 403.80 > > -------------------------- > > ``` > > Not sure if I understood the data right. Does this mean old trampolines perform better than new implementation related to benchmark chi-square? As the averages are within one standard deviation from each other, it's not statistical certain. But it do indicate that, but as I said without also backporting "8340241: RISC-V: Returns mispredicted", it's not so clear to me at least. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3248205920 From aph at openjdk.org Wed Sep 3 08:30:43 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 3 Sep 2025 08:30:43 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 08:23:32 GMT, erifan wrote: >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3819: >> >>> 3817: if (isFloat) { >>> 3818: assert(T != B, "invalid size"); >>> 3819: assert((imm8 >> 8) == 0, "invalid immediate"); >> >> Suggestion: >> >> assert((imm8 & 0xff) == 0, "invalid immediate"); >> >> To match line 3819. > > This may not be the case, `imm8 >> 8` doesn't equal to `imm8 & 0xff` What is the range of values you're trying to test? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318194266 From epeter at openjdk.org Wed Sep 3 08:33:47 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 3 Sep 2025 08:33:47 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> Message-ID: On Fri, 23 May 2025 04:42:08 GMT, Vladimir Ivanov wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> review feedback > >>> Representing ReachabilityFence as memory barrier (e.g., MemBarCPUOrder) would solve the issue, but performance costs are prohibitively high. > >> How bad is it? MemBarCPUOrder pinches all memory, so I assume this breaks a lot of optimizations when RF is sitting in the hot loop? I remember we went through a similar exercise with Blackholes: [JDK-8296545](https://bugs.openjdk.org/browse/JDK-8296545) -- and decided to pinch only the control. I guessing this is not enough to fix RF, or is it? > > Yes, if a barrier stays inside loop body, it breaks a lot of important optimizations. It may end up almost as bad as a full-blown call (except a barrier can be moved around while a call can't). And moving a node when it depends both on control and memory is more complicated than just a CFG node. Moreover, as you can see in the proposed solution, even CFG-only representation is problematic for loop opts, so additional care is needed to ensure RFs are moved out of loops. > > As an alternative approach, I thought about reifying RF as a data node (think of `CastPP`) and then linking its referent to all safepoints it dominates after loop opts are over. But that would only affect `optimize_reachability_fences()`. Everything else would stay the same. So, I decided to stay with CFG-only representation for now. @iwanowww Let me know whenever this is ready to review again ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3248221299 From duke at openjdk.org Wed Sep 3 08:36:42 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 08:36:42 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 08:28:30 GMT, Andrew Haley wrote: >> This may not be the case, `imm8 >> 8` doesn't equal to `imm8 & 0xff` > > What is the range of values you're trying to test? It's hard to say, because it is actually the value bits of a fp8. Simply put, the lower 8 bits are valid values. The remaining bits must be 0. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318210500 From aph at openjdk.org Wed Sep 3 08:53:46 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 3 Sep 2025 08:53:46 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 08:33:40 GMT, erifan wrote: >> What is the range of values you're trying to test? > > It's hard to say, because it is actually the value bits of a fp8. > > Simply put, the lower 8 bits are valid values. The remaining bits must be 0. Sorry, thinko. I meant to say `imm8 & ~0xff` but never mind, let it stand. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318269552 From mli at openjdk.org Wed Sep 3 09:19:42 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 3 Sep 2025 09:19:42 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 07:54:10 GMT, Robbin Ehn wrote: > But the AbstractICache::invalidate_range is not documented to guarantee to have this effect. what "not documented" here mean? By reading the code, seems `AbstractICache::invalidate_range` will delegate to `icache_flush` in riscv which will do the fence and flush. BTW, here are some comments from hotspot/share/runtime/icache.hpp, // Default implementation is in icache.cpp, and can be hidden per-platform. // Most platforms must provide only ICacheStubGenerator::generate_icache_flush(). > If someone executes the new instruction when changed to jalr(3), we did want them to call the new location we stored to the stub(1). By saying 1 happens before 3, we convey our intent. > Aarch64 also have this. Make sense! In worst condition, what will happen if we remove the 2 release here and just count on `fence rw, rw` in `AbstractICache::invalidate_range`? Seems we're fine based on your latter comment. I suppose these extra 2 releases bring some performance penalty? If this is true, I'm not sure if it's worth to treat such a rare condition in such a proper way. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2318341469 From rehn at openjdk.org Wed Sep 3 09:49:49 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 3 Sep 2025 09:49:49 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 09:17:14 GMT, Hamlin Li wrote: > > But the AbstractICache::invalidate_range is not documented to guarantee to have this effect. > > what "not documented" here mean? By reading the code, seems `AbstractICache::invalidate_range` will delegate to `icache_flush` in riscv which will do the fence and flush. > > BTW, here are some comments from hotspot/share/runtime/icache.hpp, > > ``` > // Default implementation is in icache.cpp, and can be hidden per-platform. > // Most platforms must provide only ICacheStubGenerator::generate_icache_flush(). > ``` Yes, and it doesn't say this method also provide a release fence or anything like that. I other general code we seem to needed, I can remove release(4) for a comment if you like. > > > If someone executes the new instruction when changed to jalr(3), we did want them to call the new location we stored to the stub(1). By saying 1 happens before 3, we convey our intent. > > Aarch64 also have this. > > Make sense! In worst condition, what will happen if we remove the 2 release here and just count on `fence rw, rw` in `AbstractICache::invalidate_range`? Seems we're fine based on your latter comment. I suppose these extra 2 releases bring some performance penalty? If this is true, I'm not sure if it's worth to treat such a rare condition in such a proper way. Yes, we should be fine, but there is no reason to not store them in 'wish' order. No there is no perfomance differences, this code is not executed often and the call to invalidate_range is so slow that anything else don't matter. You are talking about removing a few cycles from something that take tens of thousands of cycles. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2318417867 From duke at openjdk.org Wed Sep 3 10:02:24 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 10:02:24 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: > The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: > > 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: > - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. > - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. > - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. > - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. > > 2. Additionally, the encoding of the negative floating-point number is incorrect: > - The imm8?field can fall outside the valid range of?**[-128, 127]**. > - Bit **13** should be encoded as **0** for floating-point numbers. > > This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. > > Some test cases are added to aarch64-asmtest.py, and all tests passed. erifan has updated the pull request incrementally with one additional commit since the last revision: Code style fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26951/files - new: https://git.openjdk.org/jdk/pull/26951/files/16a06948..66ba6570 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26951&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26951&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26951.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26951/head:pull/26951 PR: https://git.openjdk.org/jdk/pull/26951 From duke at openjdk.org Wed Sep 3 10:06:46 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 10:06:46 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 10:02:24 GMT, erifan wrote: >> The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Code style fixes Thanks @theRealAph , I have addressed your suggested changes. ------------- PR Review: https://git.openjdk.org/jdk/pull/26951#pullrequestreview-3179900424 From duke at openjdk.org Wed Sep 3 10:06:47 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 10:06:47 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 08:50:57 GMT, Andrew Haley wrote: >> It's hard to say, because it is actually the value bits of a fp8. >> >> Simply put, the lower 8 bits are valid values. The remaining bits must be 0. > > Sorry, thinko. I meant to say > > `imm8 & ~0xff` > > but never mind, let it stand. Ok, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318451958 From duke at openjdk.org Wed Sep 3 10:06:50 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 10:06:50 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 08:11:55 GMT, Andrew Haley wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Don't rename sve_cpy as sve_fcpy >> - Merge branch 'master' into JDK-8365911 >> - 8365911: AArch64: Fix encoding error in sve_cpy for negative floats >> >> The?sve_cpy?instruction is not correctly implemented for?negative >> floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the >> `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3831: > >> 3829: int m = isMerge ? 1 : 0; >> 3830: f(0b00000101, 31, 24), f(T, 23, 22), f(0b01, 21, 20); >> 3831: prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), f(imm8&0xff, 12, 5), rf(Zd, 0); > > Suggestion: > > prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), f(imm8 & 0xff, 12, 5), rf(Zd, 0); > > General HotSpot style. Done, thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26951#discussion_r2318454592 From duke at openjdk.org Wed Sep 3 10:12:54 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 10:12:54 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v7] In-Reply-To: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> References: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> Message-ID: On Thu, 5 Jun 2025 11:05:48 GMT, Emanuel Peter wrote: >>> > FYI: `BoolTest::negate` already does what you want: `mask negate( ) const { return mask(_test^4); }` I think you should use that instead :) >>> >>> Indeed, I hadn't noticed that, thank you. >> >> Oh I think we still cannot use `BoolTest::negate`, because we cannot instantiate a `BoolTest` object with **unsigned** comparison. `BoolTest::negate` is a non-static function. > >> Oh I think we still cannot use `BoolTest::negate`, because we cannot instantiate a `BoolTest` object with **unsigned** comparison. `BoolTest::negate` is a non-static function. > > I see. Ok. Hmm. I still think that the logic should be in `BoolTest`, because that is where the exact implementation of the enum values is. In that context it is easier to see why `^4` does the negation. And imagine we were ever to change the enum values, then it would be harder to find your code and fix it. > > Maybe it could be called `BoolTest::negate_mask(mast btm)` and explain in a comment that both signed and unsigned is supported. Hi @eme64 @theRealAph @XiaohongGong @fg1417 @shqking , could you help take a look at this PR, thanks ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3248596662 From duke at openjdk.org Wed Sep 3 10:14:56 2025 From: duke at openjdk.org (erifan) Date: Wed, 3 Sep 2025 10:14:56 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation In-Reply-To: References: Message-ID: On Wed, 20 Aug 2025 11:27:59 GMT, Andrew Haley wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation. @theRealAph @e1iu @XiaohongGong @fg1417 @shqking, could you help take a look at this PR, thanks~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3248603705 From aph at openjdk.org Wed Sep 3 10:16:44 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 3 Sep 2025 10:16:44 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 10:02:24 GMT, erifan wrote: >> The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Code style fixes Good. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26951#pullrequestreview-3179955144 From epeter at openjdk.org Wed Sep 3 12:39:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 3 Sep 2025 12:39:44 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: References: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> Message-ID: On Wed, 3 Sep 2025 07:19:06 GMT, erifan wrote: >>> 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. >> >> That's a fair point, but the Arch64 name for all four instructions is CPY, and they are distinguished by their operands. Deviation from the names in the Reference Manual is occasionally necessary, but it makes life painful for maintainers when they have to search for what we've called an instruction they want to use. >> >>> 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . >> >> Yes, that's right. >> >>> While some encoding code can be reused, separating the encodings makes the code clearer. >> >> I don't agree that it makes the code clearer. In fact, tight factoring emphasizes the fact that these instructions are similar, and explicitly shows where they are different. >> >> It is true that I have a strong bias against copy-and-paste programming. >> >>> I think both implementations are fine. If you think it's better to not refactor, I'll revert. >> >> I do. Thank you. > >> I do. Thank you. > > Ok, I have reverted the refactoring. Please help take another look, thanks~ @erifan I'm running some internal testing - though we don't have SVE machines so you are responsible to make sure it is adequately tested for that ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3249077329 From epeter at openjdk.org Wed Sep 3 12:47:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 3 Sep 2025 12:47:46 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2] In-Reply-To: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> References: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> Message-ID: On Thu, 21 Aug 2025 07:00:35 GMT, erifan wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Improve the comment of the vector expand implementation > - Merge branch 'master' into JDK-8363989 > - 8363989: AArch64: Add missing backend support of VectorAPI expand operation > > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified > for 32-bit and 64-bit types only when SVE2 is available. In the following > cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. > This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` > in these cases, this patch mainly leverages the `TBL` instruction to > implement `expand`. To compute the index input for `TBL`, the prefix sum > algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. > Take a 128-bit byte vector on SVE2 as an example: > ``` > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > ``` > Step 1: calculate the index input of the TBL instruction. > ``` > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 > > dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 > tmp2 += ... test/hotspot/jtreg/compiler/vectorapi/VectorExpandTest.java line 48: > 46: static final VectorSpecies F_SPECIES = FloatVector.SPECIES_MAX; > 47: static final VectorSpecies L_SPECIES = LongVector.SPECIES_MAX; > 48: static final VectorSpecies D_SPECIES = DoubleVector.SPECIES_MAX; Would it make sense to run these tests with various vector sizes? Because it seems your algorithm depends on `vector_length_in_bytes` in the prefix sum algo. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2318862195 From epeter at openjdk.org Wed Sep 3 12:52:49 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 3 Sep 2025 12:52:49 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2] In-Reply-To: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> References: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> Message-ID: On Thu, 21 Aug 2025 07:00:35 GMT, erifan wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Improve the comment of the vector expand implementation > - Merge branch 'master' into JDK-8363989 > - 8363989: AArch64: Add missing backend support of VectorAPI expand operation > > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified > for 32-bit and 64-bit types only when SVE2 is available. In the following > cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. > This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` > in these cases, this patch mainly leverages the `TBL` instruction to > implement `expand`. To compute the index input for `TBL`, the prefix sum > algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. > Take a 128-bit byte vector on SVE2 as an example: > ``` > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > ``` > Step 1: calculate the index input of the TBL instruction. > ``` > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 > > dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 > tmp2 += ... Looks like a nice improvement! src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2819: > 2817: subv(dst, size, tmp2, tmp1); > 2818: // dst = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 > 2819: tbl(dst, size, src, 1, dst); It would make it a little easier to read the example if the numbers were aligned. Now the minus sign disrupts that a little. Maybe leave 2 spaces if the number is positive? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3249121442 PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2318874112 From hgreule at openjdk.org Wed Sep 3 15:22:51 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Wed, 3 Sep 2025 15:22:51 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v7] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Tue, 26 Aug 2025 12:46:31 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > review I also filed https://bugs.openjdk.org/browse/JDK-8366815 now regarding the early transformation of div/mod by constants. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3249705608 From fgao at openjdk.org Wed Sep 3 16:55:45 2025 From: fgao at openjdk.org (Fei Gao) Date: Wed, 3 Sep 2025 16:55:45 GMT Subject: RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2] In-Reply-To: <3upl3uiPM5gnO1HCV7vb1C7CFyV3HQ2ztGXVJkss-AM=.09da8cb2-e384-420a-91d1-f3bb8d8cfc6a@github.com> References: <3upl3uiPM5gnO1HCV7vb1C7CFyV3HQ2ztGXVJkss-AM=.09da8cb2-e384-420a-91d1-f3bb8d8cfc6a@github.com> Message-ID: > In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the > `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go. > > Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`. > > To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop. > > The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop. > > This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow. > > The whole process is done by the function `insert_post_loop()`. > > We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`: > > 1. The fall-in control flow to the vectorized drain loop comes from a `RegionNode` merging exits ... Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits: - Merge branch 'master' into optimize-atomic-post - Clean up comments for consistency and add spacing for readability - Fix some corner case failures and refined part of code - Merge branch 'master' into optimize-atomic-post - Refine ascii art, rename some variables and resolve conflicts - Merge branch 'master' into optimize-atomic-post - Add necessary ASCII art, refactor insert_post_loop() and rename "atomic post loop" with "vectorized drain loop. - Merge branch 'master' into optimize-atomic-post - 8307084: C2: Vector atomic post loop is not executed for some small trip counts In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the pre-main-post-loop model. Then a counted loop could be split into pre-main-post loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, The execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to 8x, the main loop guard tests if the loop has less than 8 iterations and then decide which way to go. Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate 8 iterations and the super-unrolling count is 4, the trip guard of the main loop will test if remaining trip is less than 8 * 4 = 32. To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vector drain loop, i.e. atomic post loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and vector post loop jump to the scalar post loop. The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vector atomic post loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the atomic post loop. For example, in the above case, a loop still has 25 iterations after the pre-loop, we may run 3 rounds of the atomic post loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the atomic post loop. This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow. The whole process is done by the function insert_atomic_post_loop_impl(). We introduce a new CloneLoopMode, InsertAtomicPost. When we're cloning vector main loop to atomic post loop with mode InsertAtomicPost: 1. The fall-in control flow to the atomic post-loop comes from a RegionNode merging exits from pre-loop and main-loop, implemented in insert_atomic_post_loop_impl(). 2. All fall-in values to the atomic post-loop come from (one or more) phis merging exit values from pre-loop and main-loop, implemented by clone_up_atomic_post_backedge_goo(). 3. All control uses of exits from old-loop now should use new RegionNodes that merge RegionNodes which merge exits from pre-loop and main-loop and exits from the new-loop (atomic post loop) equivalents, implemented by fix_ctrl_uses_for_atomic_post() 4. All data uses of values from old-loop now should use new Phis that merge Phis which merge values from pre-loop and main-loop and values from the new-loop (atomic post loop) equivalents, implemented by handle_data_uses_for_atomic_post_loop(). We also add a new micro-benchmark to test the performance gain. Here are the performance results from different vector-length machines. Tier 1- 3 passed on aarch64 and x86. There are still a few fuzzer test failures. ------------- Changes: https://git.openjdk.org/jdk/pull/22629/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22629&range=01 Stats: 1542 lines in 8 files changed: 1358 ins; 59 del; 125 mod Patch: https://git.openjdk.org/jdk/pull/22629.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/22629/head:pull/22629 PR: https://git.openjdk.org/jdk/pull/22629 From fgao at openjdk.org Wed Sep 3 17:10:47 2025 From: fgao at openjdk.org (Fei Gao) Date: Wed, 3 Sep 2025 17:10:47 GMT Subject: RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts In-Reply-To: References: <3upl3uiPM5gnO1HCV7vb1C7CFyV3HQ2ztGXVJkss-AM=.09da8cb2-e384-420a-91d1-f3bb8d8cfc6a@github.com> Message-ID: On Thu, 28 Aug 2025 14:58:25 GMT, Emanuel Peter wrote: > BTW: I just integrated https://github.com/openjdk/jdk/pull/24278 which may have silent merge conflicts, so it would be good if you merged and tested again. Hi @eme64 , I?ve rebased the patch onto the latest JDK, and all tier1 to tier3 tests have passed on my local AArch64 and x86 machines. > It would be good if you re-ran the benchmarks. It seems the last ones you did in December of 2024. We should see that we have various benchmarks, both for array and MemorySegment. You could look at the array benchmarks from here: https://github.com/openjdk/jdk/pull/22070 I also re-verified the benchmark from [PR #22070](https://github.com/openjdk/jdk/pull/22070) on 128-bit, 256-bit, and 512-bit vector machines. The results show no significant regressions and performance changes are consistent with the previous round described in [perf results]( https://bugs.openjdk.org/browse/JDK-8307084?focusedId=14729524&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14729524). > Once you do that I could also run some internal testing, if you like :) I?d really appreciate it if you could run some internal testing at a time you think is suitable. Thanks :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3250077476 From cslucas at openjdk.org Wed Sep 3 17:13:22 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 3 Sep 2025 17:13:22 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT. Message-ID: Please, review this patch to fix issue that may occur when reducing allocation merge. As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. The change in `revisit_reducible_phi_status` is just a clean-up. The real fix is in `find_scalar_replaceable_allocs`. Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. ------------- Commit messages: - Fix for RAM not reducible before SUT & Test. Changes: https://git.openjdk.org/jdk/pull/27063/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27063&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361699 Stats: 87 lines in 2 files changed: 73 ins; 13 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27063.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27063/head:pull/27063 PR: https://git.openjdk.org/jdk/pull/27063 From vlivanov at openjdk.org Wed Sep 3 17:34:42 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 17:34:42 GMT Subject: RFR: 8355354: C2 crashed: assert(_callee == nullptr || _callee == m) failed: repeated inline attempt with different callee [v4] In-Reply-To: References: <_eAERVexsTQc_Acje4IUJ9yqqE98dB4-hz_fJ0jrUhs=.b2194a63-2599-42f7-a65f-41c29bb37bc3@github.com> Message-ID: <_XB8zA3RZEgWvAwKe8DsB3Udb7gaIqBHiEPHw_28t6Y=.4a54aa09-b496-4818-a4cd-7e7013970c72@github.com> On Wed, 3 Sep 2025 06:50:26 GMT, Damon Fenacci wrote: >> # Issue >> The CTW test `applications/ctw/modules/java_xml.java` crashes when trying to repeat late inlining of a virtual method (after IGVN passes through the method's call node again). The failure originates [here](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callGenerator.cpp#L473) because `_callee != m`. Apparently when running IGVN a second time after a first late inline failure and [setting the callee in the call generator](https://github.com/openjdk/jdk/blob/e2ae50d877b13b121912e2496af4b5209b315a05/src/hotspot/share/opto/callnode.cpp#L1240) we notice that the previous callee is not the same as the current one. >> In this specific instance it seems that the issue happens when CTW is compiling Apache Xalan. >> >> # Cause >> The root of the issue has to do with repeated late inlining, class hierarchy analysis and dynamic class loading. >> >> For this particular issue the two differing methods are `org.apache.xalan.xsltc.compiler.LocationPathPattern::translate` first and `org.apache.xalan.xsltc.compiler.AncestorPattern::translate` the second time. `LocationPathPattern` is an abstract class but has a concrete `translate` method. `AncestorPattern` is a concrete class that extends another abstract class `RelativePathPattern` that extends `LocationPathPattern`. `AncestorPattern` overrides the translate method. >> What seems to be happening is the following: we compile a virtual call `RelativePathPattern::translate` and at compile time. Only the abstract classes `RelativePathPattern` <: `LocationPathPattern` are loaded. CHA then finds out that the call must always call `LocationPathPattern::translate` because the method is not overwritten anywhere else. However, there is still no non-abstract class in the entire class hierarchy, i.e. as soon as `AncestorPattern` is loaded, this class is then the only non-abstract class in the class hierarchy and therefore the receiver type must be `AncestorPattern`. >> >> More in general, when late inlining is repeated and classes are loaded dynamically, it is possible that the resolved method between a late inlining attempt and the next one is not the same. >> >> # Fix >> >> This looks like a very edge-case. If CHA is affected by class loading the original recorded dependency becomes invalid. This can possibly happen in other situations (e.g JVMTI class redefinition). So, instead of modifying the assert (to check for invalid dependencies) we avoid re-setting the callee method ... > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8355354: add stress comment Looks fine. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26441#pullrequestreview-3181755233 From vlivanov at openjdk.org Wed Sep 3 20:43:52 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 20:43:52 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 17:20:38 GMT, Emanuel Peter wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> renaming > > src/hotspot/share/opto/c2_globals.hpp line 83: > >> 81: \ >> 82: product(bool, StressReachabilityFences, false, DIAGNOSTIC, \ >> 83: "Randomly insert ReachabilityFence nodes") \ > > Drive-by sniping: what about a hello-world test where you test out these flags? Good idea. Added one. > src/hotspot/share/opto/callGenerator.cpp line 617: > >> 615: uint endoff = call->jvms()->endoff(); >> 616: if (C->inlining_incrementally()) { >> 617: assert(endoff == call->req(), ""); // assert in SafePointNode::grow_stack > > What exactly are you asserting here? And what is the comment for? The assert ensures there are no reachability edges present when incremental inlining takes place. Inlining logic doesn't expect any extra edges past debug info and the comment refers to the assert which fires the first. > src/hotspot/share/opto/callnode.hpp line 497: > >> 495: // Are we guaranteed that this node is a safepoint? Not true for leaf calls and >> 496: // for some macro nodes whose expansion does not have a safepoint on the fast path. >> 497: virtual bool guaranteed_safepoint() { return true; } > > I see you only copied it. It makes me a little nervous when we call the "default" case safe. Because when you add more cases, you just assume it is safe... and if it is not we first have to discover that through a bug. What do you think? Well, it's a SafePointNode class after all. I lifted it from `CallNode` subclass to avoid elaborate check on SafePoint nodes (!is_Call() || as_Call() && guaranteed_safepoint()`)). If some node extends SafePointNode, but doesn't keep JVM state, it has to communicate it to users one way or another. And changing the default doesn't improve the situation IMO: reporting a safepoint node as a non-safepoint is still a bug. > src/hotspot/share/opto/compile.cpp line 3958: > >> 3956: Node* rf = C->reachability_fence(i); >> 3957: Node* in = rf->in(1); >> 3958: if (in->is_DecodeN()) { > > Why not: > Suggestion: > > ReachabilityFence* rf = C->reachability_fence(i); > DecodeNNode* dn = rf->in(1)->isa_DecodeN(); > if (dn != nullptr) { Ok, reshaped as you suggested. > src/hotspot/share/opto/compile.hpp line 381: > >> 379: GrowableArray _template_assertion_predicate_opaques; >> 380: GrowableArray _expensive_nodes; // List of nodes that are expensive to compute and that we'd better not let the GVN freely common >> 381: GrowableArray _reachability_fences; // List of reachability fences > > Why not: > Suggestion: > > GrowableArray _reachability_fences; // List of all reachability fences Ok, done. > src/hotspot/share/opto/compile.hpp line 741: > >> 739: void remove_reachability_fence(Node* n) { >> 740: _reachability_fences.remove_if_existing(n); >> 741: } > > You could also add the type `ReachabilityFenceNode*` here. Done. > src/hotspot/share/opto/loopTransform.cpp line 78: > >> 76: } >> 77: return unique_loop_exit; >> 78: } > > `proj_out_or_null` returns a `ProjNode` (it is probably a `IfTrue` or `IfFalse`, right?) and `outer_loop_exit` returns a `IfFalseNode`. So we should be able to return a `IfProjNode` from this method. What do you think? > > What is the benefit of the `unique_loop_exit` variable here? Why not return immediately? It was easier to inspect it in the debugger. Reshaped as you suggested. > src/hotspot/share/opto/macro.cpp line 983: > >> 981: _igvn._worklist.push(ac); >> 982: } else if (use->is_ReachabilityFence() && OptimizeReachabilityFences) { >> 983: _igvn.replace_input_of(use, 1, _igvn.makecon(TypePtr::NULL_PTR)); // reset; redundant fence > > Can you quickly explain in a code comment how this does a "reset"? What happens with it next? Turned it into `ReachabilityFenceNode::clear_referent()`. Hope it makes it clearer. > src/hotspot/share/opto/node.hpp line 701: > >> 699: DEFINE_CLASS_ID(MemBar, Multi, 3) >> 700: DEFINE_CLASS_ID(Initialize, MemBar, 0) >> 701: DEFINE_CLASS_ID(MemBarStoreStore, MemBar, 1) > > Suggestion: > > DEFINE_CLASS_ID(Initialize, MemBar, 0) > DEFINE_CLASS_ID(MemBarStoreStore, MemBar, 1) > > I don't think you needed to touch the lines above, right? Fixed. > src/hotspot/share/opto/parse.hpp line 361: > >> 359: bool _wrote_fields; // Did we write any field? >> 360: Node* _alloc_with_final_or_stable; // An allocation node with final or @Stable field >> 361: Node* _stress_rf_hook; // StressReachabilityFences support > > You could write out the `rf` I'd like to avoid that. `_stress_reachability_fence_hook` is way too verbose IMO. The declaration and all the accesses are accompanied by `StressReachabilityFences` which should make it clear what `rf` refers to. > src/hotspot/share/opto/parse1.cpp line 379: > >> 377: _stress_rf_hook->add_req(loc); >> 378: } >> 379: } > > Can you add a short code comment describing what you are doing here, please? Done. > src/hotspot/share/opto/parse1.cpp line 394: > >> 392: _stress_rf_hook->add_req(stk); >> 393: } >> 394: } > > A short code comment would be helpful Done. > src/hotspot/share/opto/parse1.cpp line 2222: > >> 2220: >> 2221: if (StressReachabilityFences) { >> 2222: // Keep all oop arguments alive until method return. > > Why? Can you extend the comment a little? Done. Does it look better now? > src/hotspot/share/opto/reachability.cpp line 44: > >> 42: * (0) initial set of RFs is materialized during parsing; >> 43: * (1) optimization pass during loop opts which eliminates redundant nodes and >> 44: * moves loop-invariant ones outside loops; > > Suggestion: > > * (1) optimization pass during loop opts which eliminates redundant nodes and > * moves loop-invariant ones outside loops; > > I'd prever consistent indentation, but optional/question of taste Fixed. > src/hotspot/share/opto/reachability.cpp line 51: > >> 49: * >> 50: * It looks attractive to get rid of RF nodes early and transfer to safepoint-attached representation, >> 51: * but it is not correct until loop opts are done. > > Why is it not correct? What could go wrong? Why is it safe to do it after loop opts? Live ranges of values are routinely extended during loop opts. And it can break the invariant that all interfering safepoints contain the referent in their oop map. (If an interfering safepoint doesn't keep the referent alive, then it becomes possible for the referent to be prematurely GCed.) After loop opts are over, it becomes possible to reliably enumerate all interfering safe points and ensure the referent present in their oop maps. > test/hotspot/jtreg/compiler/c2/TestReachabilityFence.java line 38: > >> 36: * @summary Tests to ensure that reachabilityFence() correctly keeps objects from being collected prematurely. >> 37: * @modules java.base/jdk.internal.misc >> 38: * @run main/othervm -Xbatch compiler.c2.TestReachabilityFence > > What about some extra runs where you use your new flags? This particular test is carefully crafted to provoke a failure when reachability fence effects aren't properly modeled. Stressing RF implementation doesn't help here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320090697 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320120466 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320062127 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320121852 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320122602 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320123063 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320123818 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320135080 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320135683 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320066556 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320136667 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320137310 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320138235 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320138496 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320080872 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320087314 From vlivanov at openjdk.org Wed Sep 3 20:43:54 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 20:43:54 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Mon, 16 Jun 2025 09:28:59 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/callnode.cpp line 950: >> >>> 948: case CatchProjNode::catch_all_index: projs->catchall_catchproj = cpn; break; >>> 949: default: { >>> 950: assert(cpn->_con > 1, ""); // exception table; rethrow case >> >> Can we please turn this into a helpful assert message? > > Can you quickly comment why you changed this? Some call nodes inspected during `expand_reachability_fences` demonstrate this IR shape where some exception table projections are directly attached to the call node. Looks like a missed case in `CallNode::extract_projections` we simply never hit before. >> src/hotspot/share/opto/loopnode.hpp line 1485: >> >>> 1483: void remove_rf(Node* rf); >>> 1484: #ifdef ASSERT >>> 1485: bool has_redundant_rfs(Unique_Node_List& ignored_rfs, bool rf_only); >> >> I would prefer if all the method names spelled out `reachability_fences` instead of `rf / rfs`. > > The arguments are less important for me. There are 2 types of methods here: internal ones (used solely in `reachability.cpp`) and those which are called from loop optimization code (`optimize_reachability_fences` and `eliminate_reachability_fences`). IMO it's counter-productive to repeatedly spell out what "RF" means inside `reachability.cpp`, so I kept the names intact. I split the declarations into public and private ones to stress the distinction. >> src/hotspot/share/opto/reachability.cpp line 46: >> >>> 44: * moves loop-invariant ones outside loops; >>> 45: * (2) reachability information is transferred to safepoint nodes (appended as edges after debug info); >>> 46: * (3) reachability information from safepoints materialized as RF nodes attached to the safepoint node. >> >> Can you expand the explanation a little, please? I don't really understand. Why do you do this? What does it achieve? > > It could be helpful if you wrote a paragraph (maybe at the top), about the interaction of SafePoint and ReachabilityFence. And you should also define "reachability information", I don't yet understand what that entails. I elaborated the description a bit and added more details. Let me know how it reads now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320108310 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320132768 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320139929 From vlivanov at openjdk.org Wed Sep 3 20:43:54 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 20:43:54 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: <_yDpYorDH_2ox5RaGm_JdCk4uYbiUYanemuUGR2LCp4=.33c1414a-7c61-45bb-9632-dbff88711fde@github.com> References: <_yDpYorDH_2ox5RaGm_JdCk4uYbiUYanemuUGR2LCp4=.33c1414a-7c61-45bb-9632-dbff88711fde@github.com> Message-ID: On Mon, 16 Jun 2025 09:40:30 GMT, Emanuel Peter wrote: >> Might be helpful if you write in a comment if this eliminates all or just some of the reachability fences. > > Can we limit it to cases where we actually have reachability fences? Good point. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320121142 From vlivanov at openjdk.org Wed Sep 3 20:43:55 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 20:43:55 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v6] In-Reply-To: References: Message-ID: On Mon, 16 Jun 2025 09:44:48 GMT, Emanuel Peter wrote: >> Are you asking specifically about `ReachabilityFence -> DecodeN -> LoadN` shape? Yes, it's common, especially after inlining. > > @iwanowww Can you add a code comment why this is safe to look through the ReachabilityFence? Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2320062861 From vlivanov at openjdk.org Wed Sep 3 21:18:06 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 21:18:06 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v7] In-Reply-To: References: Message-ID: <9zI6zFF3tzgRMp6RidkEIIIYg_qMVU3tfdhQMVG84d4=.1c4e2c34-d8be-40af-b160-0f0542934bae@github.com> > This PR introduces C2 support for `Reference.reachabilityFence()`. > > After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. > > `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. > > Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. > > Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 > "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." > > Testing: > - [x] hs-tier1 - hs-tier8 > - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations > - [x] java/lang/foreign microbenchmarks Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: Update ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25315/files - new: https://git.openjdk.org/jdk/pull/25315/files/0762dda9..bdf1b396 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=05-06 Stats: 55 lines in 3 files changed: 52 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/25315.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25315/head:pull/25315 PR: https://git.openjdk.org/jdk/pull/25315 From vlivanov at openjdk.org Wed Sep 3 21:24:46 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 21:24:46 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> Message-ID: On Wed, 3 Sep 2025 08:30:47 GMT, Emanuel Peter wrote: >>>> Representing ReachabilityFence as memory barrier (e.g., MemBarCPUOrder) would solve the issue, but performance costs are prohibitively high. >> >>> How bad is it? MemBarCPUOrder pinches all memory, so I assume this breaks a lot of optimizations when RF is sitting in the hot loop? I remember we went through a similar exercise with Blackholes: [JDK-8296545](https://bugs.openjdk.org/browse/JDK-8296545) -- and decided to pinch only the control. I guessing this is not enough to fix RF, or is it? >> >> Yes, if a barrier stays inside loop body, it breaks a lot of important optimizations. It may end up almost as bad as a full-blown call (except a barrier can be moved around while a call can't). And moving a node when it depends both on control and memory is more complicated than just a CFG node. Moreover, as you can see in the proposed solution, even CFG-only representation is problematic for loop opts, so additional care is needed to ensure RFs are moved out of loops. >> >> As an alternative approach, I thought about reifying RF as a data node (think of `CastPP`) and then linking its referent to all safepoints it dominates after loop opts are over. But that would only affect `optimize_reachability_fences()`. Everything else would stay the same. So, I decided to stay with CFG-only representation for now. > > @iwanowww Let me know whenever this is ready to review again ? @eme64 please, take another look. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3250854323 From vlivanov at openjdk.org Wed Sep 3 21:29:43 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 3 Sep 2025 21:29:43 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: > This PR introduces C2 support for `Reference.reachabilityFence()`. > > After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. > > `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. > > Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. > > Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 > "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." > > Testing: > - [x] hs-tier1 - hs-tier8 > - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations > - [x] java/lang/foreign microbenchmarks Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: whitespaces ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25315/files - new: https://git.openjdk.org/jdk/pull/25315/files/bdf1b396..e95d4eb9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=06-07 Stats: 9 lines in 1 file changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/25315.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25315/head:pull/25315 PR: https://git.openjdk.org/jdk/pull/25315 From missa at openjdk.org Thu Sep 4 00:22:58 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 4 Sep 2025 00:22:58 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v4] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: - Update floating point conversion tests to check for AVX 10.2 CPU feature ID - Correct matching rules for AVX 10.2 floating point conversion instructions that involve memory ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/be5c0b4e..07ac817a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=02-03 Stats: 22 lines in 4 files changed: 0 ins; 0 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From dlong at openjdk.org Thu Sep 4 00:31:42 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 4 Sep 2025 00:31:42 GMT Subject: RFR: 8366461: Remove obsolete method handle invoke logic [v3] In-Reply-To: References: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> Message-ID: On Wed, 3 Sep 2025 07:12:20 GMT, Manuel H?ssig wrote: >> Dean Long has updated the pull request incrementally with three additional commits since the last revision: >> >> - revert whitespace change >> - undo debug changes >> - cleanup > > src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/aarch64/AARCH64Frame.java line 372: > >> 370: // DEBUG_ONLY(verifyDeoptriginalPc(senderNm, raw_unextendedSp)); >> 371: } >> 372: } > > `Frame.java adjustUnextendedSP()` do not seem to do anything? Perhaps these could be cleaned up as well? Yes, it's tempting to want to clean these up, but I noticed that SA code really tries to mirror the C++ code, so I'm inclined to leave it. Is there a Serviceability expert that would like to see this code cleaned up further? @plummercj , what do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27059#discussion_r2320526360 From xgong at openjdk.org Thu Sep 4 02:15:45 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 4 Sep 2025 02:15:45 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Mon, 4 Aug 2025 02:31:08 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge 'jdk:master' into JDK-8351623-sve > - Address review comments > - Refine IR pattern and clean backend rules > - Fix indentation issue and move the helper matcher method to header files > - Merge branch jdk:master into JDK-8351623-sve > - 8351623: VectorAPI: Add SVE implementation of subword gather load operation Hi @eme64 , could you please help take a look at this PR? Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3251469639 From xgong at openjdk.org Thu Sep 4 02:18:47 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 4 Sep 2025 02:18:47 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2] In-Reply-To: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> References: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> Message-ID: On Thu, 21 Aug 2025 07:00:35 GMT, erifan wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Improve the comment of the vector expand implementation > - Merge branch 'master' into JDK-8363989 > - 8363989: AArch64: Add missing backend support of VectorAPI expand operation > > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified > for 32-bit and 64-bit types only when SVE2 is available. In the following > cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. > This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` > in these cases, this patch mainly leverages the `TBL` instruction to > implement `expand`. To compute the index input for `TBL`, the prefix sum > algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. > Take a 128-bit byte vector on SVE2 as an example: > ``` > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > ``` > Step 1: calculate the index input of the TBL instruction. > ``` > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 > > dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 > tmp2 += ... Reviewed internally. So LGTM! ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/26740#pullrequestreview-3183180082 From haosun at openjdk.org Thu Sep 4 02:47:45 2025 From: haosun at openjdk.org (Hao Sun) Date: Thu, 4 Sep 2025 02:47:45 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Mon, 4 Aug 2025 02:31:08 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge 'jdk:master' into JDK-8351623-sve > - Address review comments > - Refine IR pattern and clean backend rules > - Fix indentation issue and move the helper matcher method to header files > - Merge branch jdk:master into JDK-8351623-sve > - 8351623: VectorAPI: Add SVE implementation of subword gather load operation LGTM ------------- Marked as reviewed by haosun (Committer). PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3183215589 From missa at openjdk.org Thu Sep 4 05:20:30 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 4 Sep 2025 05:20:30 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v5] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Add AVX 10.2 CPU feature flag to list of verified ones ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/07ac817a..e0c84f69 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=03-04 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From jbhateja at openjdk.org Thu Sep 4 05:44:45 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 4 Sep 2025 05:44:45 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v5] In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 05:20:30 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` >> 1... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Add AVX 10.2 CPU feature flag to list of verified ones test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 90: > 88: @Test > 89: @IR(counts = {IRNode.VECTOR_CAST_F2I, IRNode.VECTOR_SIZE_16, "> 0"}, > 90: applyIfCPUFeatureOr = {"avx512f", "true", "avx10_2", "true"}) You should check for target specific Machine IR which is selected on AVX10_2 targets. test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 108: > 106: @Test > 107: @IR(counts = {IRNode.VECTOR_CAST_F2L, IRNode.VECTOR_SIZE_8, "> 0"}, > 108: applyIfCPUFeatureOr = {"avx512dq", "true", "avx10_2", "true"}) avx10_2 is super set of AVX512DQ, we enable all AVX512 featurs during VM initialization and IRFrameWork rely on the same. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2320889420 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2320891875 From jbhateja at openjdk.org Thu Sep 4 05:47:25 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 4 Sep 2025 05:47:25 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits Message-ID: This patch optimizes PopCount value transforms using KnownBits information. Following are the results of the micro-benchmark included with the patch System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.30GHz (Emerald Rapids) Baseline:- Benchmark Mode Cnt Score Error Units PopCountValueTransform.LogicFoldingKerenLong thrpt 2 151997.051 ops/s PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 161261.825 ops/s PopCountValueTransform.StockKernelInt thrpt 2 194680.419 ops/s PopCountValueTransform.StockKernelLong thrpt 2 216580.319 ops/s Withopt:- Benchmark Mode Cnt Score Error Units PopCountValueTransform.LogicFoldingKerenLong thrpt 2 216502.647 ops/s PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 193400.575 ops/s PopCountValueTransform.StockKernelInt thrpt 2 195595.989 ops/s PopCountValueTransform.StockKernelLong thrpt 2 217776.426 ops/s Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8365205: C2: Optimize popcount value computation using knownbits Changes: https://git.openjdk.org/jdk/pull/27075/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8365205 Stats: 137 lines in 3 files changed: 137 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From hgreule at openjdk.org Thu Sep 4 06:29:40 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Thu, 4 Sep 2025 06:29:40 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 16:10:43 GMT, Jatin Bhateja wrote: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > System: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.30GHz (Emerald Rapids) > > > Baseline:- > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 151997.051 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 161261.825 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 194680.419 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 216580.319 ops/s > > Withopt:- > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 216502.647 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 193400.575 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 195595.989 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 217776.426 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin The change looks good, but I wonder: - if it makes sense to have some kind of IR tests (i.e., it's folded away when unneeded, when the input is a constant, ...)? - whether the explanation could be simplified: Assuming a correct implementation of the KnownBits canonicalization, we can argue - `_zeroes` has the bits set that are known to be always 0. So `BitsPer - popCount(x)` gives you an upper limit of how many bits *might* be 1. And `BitsPer - popCount(_zeroes)` is equivalent to `popCount(~_zeroes)`. - `_ones` has the bits set that are known to be always 1. Trivially, `popCount(_ones)` is a valid lower bound. - The rest repeats how `adjust_bits_from_unsigned_bounds` works, but that's not specific to the popcount nodes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27075#issuecomment-3252114288 From dskantz at openjdk.org Thu Sep 4 06:34:44 2025 From: dskantz at openjdk.org (Daniel Skantz) Date: Thu, 4 Sep 2025 06:34:44 GMT Subject: RFR: 8362394: C2: Repeated stacked string concatenation fails with "Hit MemLimit" and other resourcing errors [v4] In-Reply-To: References: Message-ID: On Thu, 21 Aug 2025 07:41:32 GMT, Daniel Skantz wrote: >> This PR addresses a bug in the stringopts phase. During string concatenation, repeated stacking of concatenations can lead to excessive compilation resource use and generation of questionable code as the merging of two StringBuilder-append-toString links sc1 and sc2 can result in a new StringBuilder with the size sc1->num_arguments() * sc2->num_arguments(). >> >> In the attached test, the size of the successively merged StringBuilder doubles on each merge -- there's 24 of them -- as the toString result of the first component is used twice in the second component [1], etc. Not only does the compiler hang on this test case, but the string concat optimization seems to give an arbitrary amount of back-to-back stores in the generated code depending on the number of stacked concatenations. >> >> The proposed solution is to put an upper bound on the size of a merged concatenation, which guards against this case of repeated concatenations on the same string variable, and potentially other edge cases. 100 seems like a generous limit, and higher limits could be insufficient as each argument corresponds to about 20 new nodes later in replace_string_concat [2]. >> >> [1] https://github.com/openjdk/jdk/blob/0ceb366dc26e2e4f6252da9dd8930b016a5d46ba/src/hotspot/share/opto/stringopts.cpp#L303 >> >> [2] https://github.com/openjdk/jdk/blob/0ceb366dc26e2e4f6252da9dd8930b016a5d46ba/src/hotspot/share/opto/stringopts.cpp#L1806 >> >> Testing: T1-4. >> >> Extra testing: verified that no method in T1-4 is being compiled with a merged concat candidate exceeding the suggested limit of 100 aguments, regardless of whether or not the later checks verify_control_flow() and verify_mem_flow pass. > > Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: > > compare order A comment to keep PR active. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26685#issuecomment-3252128277 From epeter at openjdk.org Thu Sep 4 06:56:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 4 Sep 2025 06:56:52 GMT Subject: RFR: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops [v3] In-Reply-To: References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: On Tue, 2 Sep 2025 13:19:40 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review >> >> Co-authored-by: Manuel H?ssig >> Co-authored-by: Christian Hagedorn > > Marked as reviewed by chagedorn (Reviewer). @chhagedorn @TobiHartmann @mhaessig Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27045#issuecomment-3252182731 From epeter at openjdk.org Thu Sep 4 06:56:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 4 Sep 2025 06:56:53 GMT Subject: Integrated: 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops In-Reply-To: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> References: <7guNwHJ6tuJXGG-X9aACAWAHjsneD4uryM-ZazES_Uc=.fe831ae6-c8a1-446d-b63e-5b7a1a1f8704@github.com> Message-ID: <6K1D9UzhzSh8gyGh3FefsMHXkABL_nKWlJkHkopRahE=.2ca357ec-0e05-4f22-bb3a-08e8e8b630ba@github.com> On Tue, 2 Sep 2025 10:45:33 GMT, Emanuel Peter wrote: > **Analysis** > > A `CastP2X` without ctrl can float. If it floats over a `SafePoint` (or call), we may GC and move the oop. But the `CastP2X` value does not end up on the oop-map, and so the pointer is stale (old). > > With `StressGCM`, the aliasing runtime check has one `CastP2X` that floats over the SafePoint, and another that stays after the SafePoint. Both read the oop of the same array, so instead of getting the same address, we now get the old and the new oop. And so the aliasing runtime check passes (thinks there is no aliasing), even though there is aliasing. We end up vectorizing, which reorders the loads/stores and would only be safe if there is no aliasing. > > **Fix:** add control to the `CastP2X` so that it cannot float too far. > > **Details** > > > rbp = Allcoate array > spill <- rbp + 0x20 > > call to allocateArrays > -> allocates a lot, and triggers GC. That moves the allocated array behind rbp > -> rbp is oop-mapped, so it is updated automatically to the new oop > -> spill value remains based on the old oop > > We now compute the aliasing runtime check: > -> one side of the comparison is computed from rbp (new oop) > -> the other side is computed from the the spill value (old oop) > -> the cmp returns a nonsensical value, and we take the wrong branch > -> vectorize even though we have aliasing! This pull request has now been integrated. Changeset: 2527e9e5 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/2527e9e58d770c50e6d807bf1483c6bb07dd3de7 Stats: 152 lines in 5 files changed: 139 ins; 1 del; 12 mod 8366490: C2 SuperWord: wrong result because CastP2X is missing ctrl and floats over SafePoint creating stale oops Reviewed-by: thartmann, chagedorn, mhaessig ------------- PR: https://git.openjdk.org/jdk/pull/27045 From rcastanedalo at openjdk.org Thu Sep 4 07:47:41 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 4 Sep 2025 07:47:41 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT. In-Reply-To: References: Message-ID: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> On Wed, 3 Sep 2025 00:53:59 GMT, Cesar Soares Lucas wrote: > Please, review this patch to fix issue that may occur when reducing allocation merge. > > As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. > > The change in `revisit_reducible_phi_status` is just a clean-up. > The real fix is in `find_scalar_replaceable_allocs`. > > Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3252350744 From duke at openjdk.org Thu Sep 4 08:04:46 2025 From: duke at openjdk.org (erifan) Date: Thu, 4 Sep 2025 08:04:46 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2] In-Reply-To: References: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> Message-ID: <_VZ4L0DTdTxRz1XzG4QIyYY7TyCHzroEOeOV21N17_Y=.e92ad3fd-3e94-4bf5-a570-dc8cc8c9e9ed@github.com> On Wed, 3 Sep 2025 12:49:32 GMT, Emanuel Peter wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Improve the comment of the vector expand implementation >> - Merge branch 'master' into JDK-8363989 >> - 8363989: AArch64: Add missing backend support of VectorAPI expand operation >> >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified >> for 32-bit and 64-bit types only when SVE2 is available. In the following >> cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. >> This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` >> in these cases, this patch mainly leverages the `TBL` instruction to >> implement `expand`. To compute the index input for `TBL`, the prefix sum >> algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. >> Take a 128-bit byte vector on SVE2 as an example: >> ``` >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> ``` >> Step 1: calculate the index input of the TBL instruction. >> ``` >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 ... > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2819: > >> 2817: subv(dst, size, tmp2, tmp1); >> 2818: // dst = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> 2819: tbl(dst, size, src, 1, dst); > > It would make it a little easier to read the example if the numbers were aligned. > Now the minus sign disrupts that a little. Maybe leave 2 spaces if the number is positive? Make sense, I'll update it in the following commit. > test/hotspot/jtreg/compiler/vectorapi/VectorExpandTest.java line 48: > >> 46: static final VectorSpecies F_SPECIES = FloatVector.SPECIES_MAX; >> 47: static final VectorSpecies L_SPECIES = LongVector.SPECIES_MAX; >> 48: static final VectorSpecies D_SPECIES = DoubleVector.SPECIES_MAX; > > Would it make sense to run these tests with various vector sizes? > Because it seems your algorithm depends on `vector_length_in_bytes` in the prefix sum algo. Since we already have correctness tests for `expand` on **all vector types** under `test/jdk/jdk/incubator/vector/`, such as https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5375, this test primarily verifies that the expected IR is generated. So, I think this is sufficient? I've tested this PR locally on a 128-bit SVE2 machine, a 256-bit SVE machine, and a 512-bit QEMU environment, and all tests passed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2321198368 PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2321194040 From duke at openjdk.org Thu Sep 4 08:13:42 2025 From: duke at openjdk.org (erifan) Date: Thu, 4 Sep 2025 08:13:42 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: References: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> Message-ID: On Wed, 3 Sep 2025 07:19:06 GMT, erifan wrote: >>> 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. >> >> That's a fair point, but the Arch64 name for all four instructions is CPY, and they are distinguished by their operands. Deviation from the names in the Reference Manual is occasionally necessary, but it makes life painful for maintainers when they have to search for what we've called an instruction they want to use. >> >>> 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . >> >> Yes, that's right. >> >>> While some encoding code can be reused, separating the encodings makes the code clearer. >> >> I don't agree that it makes the code clearer. In fact, tight factoring emphasizes the fact that these instructions are similar, and explicitly shows where they are different. >> >> It is true that I have a strong bias against copy-and-paste programming. >> >>> I think both implementations are fine. If you think it's better to not refactor, I'll revert. >> >> I do. Thank you. > >> I do. Thank you. > > Ok, I have reverted the refactoring. Please help take another look, thanks~ > @erifan I'm running some internal testing - though we don't have SVE machines so you are responsible to make sure it is adequately tested for that ;) Yeah, I have tested the PR on a 128-bit sve2 machine, 512-bit and 256-bit qemu environments. All tests passed. A test timed out on macOS, which I believe is unrelated to the PR. I retriggered the test to see what was happening. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3252432122 From duke at openjdk.org Thu Sep 4 08:16:46 2025 From: duke at openjdk.org (erifan) Date: Thu, 4 Sep 2025 08:16:46 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Mon, 4 Aug 2025 02:31:08 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge 'jdk:master' into JDK-8351623-sve > - Address review comments > - Refine IR pattern and clean backend rules > - Fix indentation issue and move the helper matcher method to header files > - Merge branch jdk:master into JDK-8351623-sve > - 8351623: VectorAPI: Add SVE implementation of subword gather load operation LGTM ------------- Marked as reviewed by erifan at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3183982402 From mchevalier at openjdk.org Thu Sep 4 08:49:44 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 08:49:44 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 13:44:27 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/phaseX.hpp line 615: >> >>> 613: Node* _verify_window[_verify_window_size]; >>> 614: void verify_step(Node* n); >>> 615: GraphInvariantChecker* _invariant_checker; >> >> Why do you allocate it separately, and not have it in-place? > > Is there only a single PhaseIterGVN per compilation? I forgot. An alternative would be to allocate it at the level of the compilation. > Why do you allocate it separately, and not have it in-place? So that I can forward declare `GraphInvariantChecker` so I won't leak a non-trivial header everywhere through a widely included header. > Is there only a single PhaseIterGVN per compilation? I forgot. An alternative would be to allocate it at the level of the compilation. Not quite, indeed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321310336 From mchevalier at openjdk.org Thu Sep 4 08:53:43 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 08:53:43 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 13:46:55 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Beno?t's comments > > src/hotspot/share/opto/graphInvariants.cpp line 32: > >> 30: >> 31: void LocalGraphInvariant::LazyReachableCFGNodes::fill() { >> 32: precond(live_nodes.size() == 0); > > Maybe I missed something here: where do the `precond` and `postcond` come from? `debug.hpp` just next to `assert`. They are "standard", but not very widely used. I think they are good as they clearly state what is a precondition or a postcondition. There is no message (or rather a default one), but it's better (or not worse) than giving a not very inspired one, like "fail", which one can find often. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321320549 From mchevalier at openjdk.org Thu Sep 4 09:04:43 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 09:04:43 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 13:50:24 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Beno?t's comments > > src/hotspot/share/opto/graphInvariants.hpp line 37: > >> 35: static constexpr int OutputStep = -1; >> 36: >> 37: struct LazyReachableCFGNodes { > > You could add a comment here. What I was surprised by: that you do a whole graph traversal the first time we call `is_node_dead`. I thought you would just visit a subgraph every time, and fill out the `live_nodes` gradually. > > You could also give an explanation why it needs to be lazy. Is it possible that we never call `is_node_dead`? I don't think it makes much sense to visit a subgraph: I want a proof a node is dead. I could climb from the node, trying to reach the root, and traverse a lot of things, that I can't say yet if they are dead or not. Once I saturated or reached the root, I can say for those, but it seems that the logic is more tricky: I would have 3 state per node: dead, alive, not decided yet. I don't think it's worth the complexity. Also, let's not exaggerate: it's a traversal only of the control sub-graph, not the whole graph. It is much smaller. I think I give an explanation in the comment of `LocalGraphInvariant::check`: > The parameter [live_nodes] is used to share the lazily computed set of CFG nodes reachable from root. This is because some checks don't apply to dead code, suppress their error if a violation is detected in dead code. So we would call `is_node_dead` only if there is a violation of a check that we should suppress in dead code. If there is no violation of such a check (no violation at all, or only of checks that we still want to see from dead code), we won't call `is_node_dead`. I'll improve the comment and point to there. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321348168 From mchevalier at openjdk.org Thu Sep 4 09:19:43 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 09:19:43 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 14:03:58 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Beno?t's comments > > src/hotspot/share/opto/graphInvariants.cpp line 207: > >> 205: } >> 206: bool (Node::*_type_check)() const; >> 207: }; > > You could probably generalize this with a callback approach. And then one concrete implentation is the one that does the type check. Just an idea. Seems overengineered to me. The callback version would be similarly long as this. The user that must provide the callback will also be similarly long. It makes the logic unnecessarily complicated to me. Of course, everything boils down to a function that takes a node and perform a specific check, but then, this generalized version does nothing significant but calling the callback. The concrete implementation will just have all the same logic, but in a callback passed to another method instead of having it as a first class method... If I don't have an adapter class that would only check type but I leave that at instanciation time, the code would look like NodeCallback([](const Node* n) { return n->is_Region(); }) instead of NodeClass(&Node::is_Region) which is unreadable. That's the point of patterns: it makes easy to understand the shape, otherwise, one can just write normal, manual traversal, which is all powerful. It was also discussed above that something like the `NodeCallback` could exist for when we need something that can't be expressed simply, but: - will it ever happen? - NodeCallback doesn't even provide a useful error messages, we would also need a callback to craft it (or make the one callback more complicated, that would be pretty much the content of `NodeClass::check`) - I'm not willing to make the common kind of patterns ugly for a rare usecase. And as for implementing `NodeClass` from a hypothetical `NodeCallback`, what would be the concrete benefits? (kinda the first paragraph again: all the logic in the callback, and NodeCallback doing nothing). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321388003 From chagedorn at openjdk.org Thu Sep 4 09:33:47 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 4 Sep 2025 09:33:47 GMT Subject: RFR: 8364970: Redo JDK-8327381 by updating the CmpU type instead of the Bool type [v3] In-Reply-To: <_EN6o6Jwu73CNwvSXYt2cHSHu6Yglkp86f1t7lywwi4=.a84b6fac-327a-48a5-8f1e-772b31d8da10@github.com> References: <_EN6o6Jwu73CNwvSXYt2cHSHu6Yglkp86f1t7lywwi4=.a84b6fac-327a-48a5-8f1e-772b31d8da10@github.com> Message-ID: On Fri, 29 Aug 2025 13:17:48 GMT, Christian Hagedorn wrote: >> # Absence note >> >> Today is the last day before a ~2 weeks vacation, so my next working day is Monday, September 1st. >> >> Please feel free to keep giving feedback and/or reviews, and I will continue when I'm back. >> >> Cheers, >> Francisco > > Hi @franferrax, hope you had a good vacation! > >> Hi @chhagedorn, >> >> I added the new tests in [e6b1cb8](https://github.com/openjdk/jdk/commit/e6b1cb897d9c75b34744c7d24f72abcec9986b0b). One problem I'm facing is that I'm unable to generate `Bool` nodes with arbitrary `BoolTest` values. Even if I try the assert inversions I removed in [10e1e3f](https://github.com/openjdk/jdk/commit/10e1e3f4f796d05dcd5c56bc2365d5d564d93952), C2 has preference for `BoolTest::ne`, `BoolTest::le` and `BoolTest::lt`. Instead of using `BoolTest::eq`, `BoolTest::gt` or `BoolTest::ge`, it swaps what is put in `IfTrue` and `IfFalse`. >> >> Even if `javac` generates an `ifeq` and an `ifne` with the same inputs, instead of a single `CmpU` with two `Bool`s (`BoolTest::eq` and `BoolTest::ne`), I get a single `Bool` (`BoolTest::ne`) with two `If` (one of them swapping `IfTrue` with `IfFalse`). I guess this is some sort of canonicalization to enable further optimizations. >> >> Do you know a way to influence the `Bool`'s `BoolTest` value? Or @rwestrel do you? >> >> This means the following 8 cases are not really testing what they claim, but repeating other cases with `IfTrue` and `IfFalse` swapped: >> >> * `testCase1aOptimizeAsFalseForGT(xm|mx)` (they should use `BoolTest::gt`, but use `BoolTest::le`) >> * `testCase1bOptimizeAsFalseForEQ(xm|mx)` (they should use `BoolTest::eq`, but use `BoolTest::ne`) >> * `testCase1bOptimizeAsFalseForGE(xm|mx)` (they should use `BoolTest::ge`, but use `BoolTest::lt`) >> * `testCase1bOptimizeAsFalseForGT(xm|mx)` (they should use `BoolTest::gt`, but use `BoolTest::le`) >> >> Even if we don't find a way to influence the `BoolTest`, the cases are still valid and can be kept (just in case the described behaviour changes). > > Hm, that's a good point. `Parse::do_if()` indeed always canonicalizes the `Bool` nodes... But I was sure we can still somehow end up with non-canonicalized versions again with some tricks. I was curious and played around with some examples and could indeed find test cases for `gt`, `ge` , and `eq`. > > I was then also thinking about notification code in IGVN. We already concluded further up that it's not needed for CCP because `CmpU` nodes below `AddI` nodes are put to the worklist again. However, with IGVN, we could modify the graph above the `AndI` as well. We miss notification code for `CmpU` below `AndI`. I changed my test cases further to also run into such a missing optimization case. When run with `-XX:VerifyIterativeGVN=1110`, we indeed get su... > Hi @chhagedorn, thank you for the additional work and your insights. This is much appreciated from a learner perspective. Sure, you're welcome :-) > I didn't fully analyze the Test.java you provided yet, but wanted to check if you are aiming to include the missing IGVN notification code as part of this issue (and its corresponding test). Or are you working on an independent issue? I think you could squeeze that in here as well. With mainline, you probably need a different notification code because we need to add the `Bool` node instead of the `CmpU` node. But with this patch, we only require the `CmpU`. So, I guess it's not worth to fix it separately only to update it again with this patch. > My availability will be limited as the October CPU approaches, but it will try to find some timeboxes to make TestBoolNodeGVN.java emit the right test cases for gt, ge , and eq Sounds good, no hurry. Thanks for taking another look to improve the test! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26666#issuecomment-3252790567 From duke at openjdk.org Thu Sep 4 09:42:49 2025 From: duke at openjdk.org (erifan) Date: Thu, 4 Sep 2025 09:42:49 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 10:02:24 GMT, erifan wrote: >> The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Code style fixes The test failure should be irrelevant to this PR, I can see it in other PR's test results, like https://github.com/egahlin/jdk/actions/runs/17436633376/job/49510579213 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3252830637 From mchevalier at openjdk.org Thu Sep 4 10:53:43 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 10:53:43 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: <_-lh_iD6ya5G9_ODqDXbfa1aTrC6J1DP5hUM4RHQUKo=.1b66fd8a-ee5c-4844-bb81-01b7758ba5dc@github.com> On Mon, 25 Aug 2025 14:09:09 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 270: >> >>> 268: new HasNOutputs(2), >>> 269: new AtSingleOutputOfType(&Node::is_IfTrue, new True()), >>> 270: new AtSingleOutputOfType(&Node::is_IfFalse, new True()))) { >> >> I would suggest that you append the word `Pattern` to all `Patterns` - at least in most cases this will make it a bit easier to see what you have at the use-site. I'm looking at `new True()` and wonder what might be passed here... if it was called `TruePattern`, it would be immediately clear. > > You could leave a comment at `True(Pattern)` that it is (often) used as the terminal pattern, at the end of a branch / search. > I would suggest that you append the word Pattern to all Patterns - at least in most cases this will make it a bit easier to see what you have at the use-site. I'm looking at new True() and wonder what might be passed here... if it was called TruePattern, it would be immediately clear. For `True` alright, no strong opinion. Could also be `TrivialPattern` or so. For everything else, that looks very verbose and hurts readability a lot, and I think readability is very important. I think the other patterns are pretty understandable: for instance, I don't see how `HasAtLeastNInputsPattern` would really help compared to `HasAtLeastNInputs`, it seems just like bloat my brain will have to strip. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321622684 From mchevalier at openjdk.org Thu Sep 4 11:10:46 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 11:10:46 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 14:14:41 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Beno?t's comments > > src/hotspot/share/opto/graphInvariants.cpp line 279: > >> 277: return CheckResult::NOT_APPLICABLE; >> 278: } >> 279: CheckResult r = PatternBasedCheck::check(center, reachable_cfg_nodes, steps, path, ss); > > Could this not be solved with a `OrPattern`? > > Or::make( > ) > > Not sure that's worth it... I understand that OrPatterns are tempting! I also thought about it, it's naturally the dual of `And`. At this point, they are not actually a good idea. First, they cannot provide good reporting. When an `And` is failing, we can at least blame the first thing that fails: "I followed this path, I expected to find 5 inputs (for instance), there are only 2!". With `Or` we would get that and... maybe it's fine? Maybe not? Depends on the next branches, and if it ends up failing, how to provide a good message? Also, they cause a mess with binding. If a branch contains a `Bind`, one cannot know which branch matched and whether the content of the `Node` pointer given to `Bind` is trustworthy. We can't even rely on a test whether the pointer was set because the execution of a branch might find a `Bind` first, run it, assign the pointer and later fail, and then the `Bind` is not to use. This is a common problem with pattern matching in functional programming: the same bindings must appear (with same types) on each branch of or-patterns. But we have no such mechanisms to enforce that yet, and it seems like setting a trap for future us. There is also relatively few use cases, and that would not profit a lot from a `Or` pattern. Maybe in the future, we will have more interesting usecases and we will see how to address these issues. But for now, I think we should not include it for now rather than making a bad choice. By the way, I think something that has more future than `Or` is rather a case analysis: `IfThenElse(CondtionPattern, TrueBranchPattern, FalseBranchPattern)` if CondtionPattern is true, then we try to match TrueBranchPattern, otherwise FalseBranchPattern. This is better for reporting since we know which branch to we expect to be true, and so to blame (assuming we don't blame CondtionPattern, but we can include that in the message possibly). This still has the binding consistency issue, but more boilerplate could help (querying the set of pointers that would be set in each branch with helping methods...). Yet, let's wait and see. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321677987 From mchevalier at openjdk.org Thu Sep 4 11:13:43 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 11:13:43 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 14:16:42 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 287: >> >>> 285: } >>> 286: } >>> 287: return r; >> >> Also this could probably be handled with a pattern wrapping mechanism, right? >> `FailOnlyForLiveNodes( )` > > I'm just suggesting it in case you need to do this sort of special-casing elsewhere too ;) That would be possible. It's still rare, and I'm not convinced we should make so specialized such patterns for one usecase. If it gets more usage, sure, that would be something to do. The only other usage is not so easy to phrase as template. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321687572 From mchevalier at openjdk.org Thu Sep 4 11:16:42 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 11:16:42 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 14:20:40 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 301: >> >>> 299: And::make( >>> 300: new NodeClass(&Node::is_Region), >>> 301: new Bind(region_node))))) { >> >> This sort of binding is kinda cool! Never thought of it before. Could be really cool for general pattern matching. >> We would have to find a solution if there would be multiple bindings though ... I think that's not possible with your patterns, right? Is that a fundamental constraint? > > What would be extra cool / funky: > If we could somehow already cast the `Bind` variable to `Region`. Could be tricky. > Doing this `is_Region and bind` could be a very common idiom, so very useful. > We would have to find a solution if there would be multiple bindings though ... I think that's not possible with your patterns, right? Is that a fundamental constraint? Not sure what you mean? `And::make(new Bind(bla), AtInput(1, new Bind(bli)))`? You probably mean something else. > If we could somehow already cast the Bind variable to Region. Could be tricky. > Doing this is_Region and bind could be a very common idiom, so very useful. Interesting... Not sure how with some template magic we don't have (like `Node::is`) but probably doable with macros. I'll give it a try. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321698421 From mchevalier at openjdk.org Thu Sep 4 11:21:50 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 11:21:50 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 14:23:12 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Beno?t's comments > > src/hotspot/share/opto/graphInvariants.cpp line 319: > >> 317: return CheckResult::FAILED; >> 318: } >> 319: return CheckResult::VALID; > > Another funky idea: could probably be handled with some callback, some "terminal" check you do on the bound variable. Not sure if worth it. It's difficult if we want to speak about more than one node. It cannot be part of the pattern since it'd be very non-local. Also with only one node, it must be executed at the end, and not when still traversing. I think it'd get even messier when we have a few bindings and we want to do things with them in a couple of different ways... Not sure how to express that much nicer. > src/hotspot/share/opto/graphInvariants.cpp line 332: > >> 330: } >> 331: >> 332: Node_List ctrl_succ; > > Do we need a `ResouceMark` for this? Everything will run under `GraphInvariantChecker::run()` that has a `ResouceMark`. I'm not sure, but my guess is that it's not worth keeping entering and leaving resource marks for relatively short lists? At the very list, everything will be released at the end of the whole check. I can still add one here if you think it's better. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321705602 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321711388 From mchevalier at openjdk.org Thu Sep 4 11:32:49 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 11:32:49 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 14:27:27 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Beno?t's comments > > src/hotspot/share/opto/graphInvariants.cpp line 338: > >> 336: if (out->is_CFG()) { >> 337: cfg_out++; >> 338: ctrl_succ.push(out); > > Seems you do these in a pair. So why do you need `cfg_out` at all? Can you not take the length/size of `ctrl_succ`? After all, it counts duplicates too (hope that is intended). True. And yes, duplicated input must still be counted! > src/hotspot/share/opto/graphInvariants.cpp line 413: > >> 411: ss.print_cr("%s nodes' 0-th input must be itself or nullptr (for a copy Region).", center->Name()); >> 412: return CheckResult::FAILED; >> 413: } > > Absolutely subjective: checking `self != center` is more about `self`, checking `center != self` is more about `center`. So I would use `self != center` :rofl: > Suggestion: > > if (self != center || (center->is_Region() && self == nullptr)) { > ss.print_cr("%s nodes' 0-th input must be itself or nullptr (for a copy Region).", center->Name()); > return CheckResult::FAILED; > } yes > src/hotspot/share/opto/graphInvariants.cpp line 447: > >> 445: And::make( >> 446: new NodeClass(&Node::is_IfTrue), >> 447: new HasAtLeastNInputs(1), > > Can an `IfTrue` have more than 1 input? I surely hope not! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321724188 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321730065 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321743991 From mchevalier at openjdk.org Thu Sep 4 11:32:52 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 11:32:52 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: <6AiBxTxm_R4n0IV0yrqX0qT6nHhmg_-QcYcrJ8c3XNA=.ff226726-3312-4a42-8bef-fb577da92782@github.com> On Mon, 25 Aug 2025 14:36:41 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 417: >> >>> 415: if (self == nullptr) { >>> 416: // Must be a copy Region >>> 417: Node_List non_null_inputs; >> >> ResouceMark? > > Is it worth it to do the allocation, if in most cases we just expect 1 non-null? > Why not count non-nulls, and if we find more than one, traverse again over the Region, and filter and dump them? True. > And I would call it `counted_loop_end`. Right > Ah, another check and Bind! Why not allow Bind, so we can bind it with the cast? I'll try something, but that would be the rather disappointing drawback (since it won't check the type at the same time). Let's see what I can do. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321735660 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321740949 From mchevalier at openjdk.org Thu Sep 4 11:36:46 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 4 Sep 2025 11:36:46 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Mon, 25 Aug 2025 14:43:26 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 469: >> >>> 467: assert(counted_loop != nullptr, "sanity"); >>> 468: if (is_long) { >>> 469: if (counted_loop->is_CountedLoopEnd()) { >> >> Sounds like head/tail confusion here. Call it `counted_loop_end`. > > Also: I would invert the check to `!counted_loop_end->is_LongCountedLoopEnd()`. Because you expect it to be a long end here. Subjective. If you want. I don't think it's perfect because then the message might be less accurate: I don't know that > A CountedLoopEnd is the backedge of a LongCountedLoop. I rather know that > The backedge of a LongCountedLoop is not a LongCountedLoopEnd ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2321755027 From chagedorn at openjdk.org Thu Sep 4 12:49:56 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 4 Sep 2025 12:49:56 GMT Subject: RFR: 8366890: C2: Split through phi printing with TraceLoopOpts misses line break Message-ID: [JDK-8356176](https://bugs.openjdk.org/browse/JDK-8356176) added new printing code for `TraceLoopOpts` when splitting nodes through a phi but missed a line break. This will result in: Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 RegionSplit-If instead of Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 Region Split-If This patch fixes this. Thanks, Christian ------------- Commit messages: - C2: Split through phi printing with TraceLoopOpts misses line break Changes: https://git.openjdk.org/jdk/pull/27092/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27092&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366890 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27092.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27092/head:pull/27092 PR: https://git.openjdk.org/jdk/pull/27092 From rcastanedalo at openjdk.org Thu Sep 4 13:25:43 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 4 Sep 2025 13:25:43 GMT Subject: RFR: 8366890: C2: Split through phi printing with TraceLoopOpts misses line break In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 12:44:43 GMT, Christian Hagedorn wrote: > [JDK-8356176](https://bugs.openjdk.org/browse/JDK-8356176) added new printing code for `TraceLoopOpts` when splitting nodes through a phi but missed a line break. This will result in: > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 RegionSplit-If > > instead of > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 Region > Split-If > > This patch fixes this. > > Thanks, > Christian Looks good, and trivial. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27092#pullrequestreview-3185272793 From mhaessig at openjdk.org Thu Sep 4 13:30:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 4 Sep 2025 13:30:42 GMT Subject: RFR: 8366890: C2: Split through phi printing with TraceLoopOpts misses line break In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 12:44:43 GMT, Christian Hagedorn wrote: > [JDK-8356176](https://bugs.openjdk.org/browse/JDK-8356176) added new printing code for `TraceLoopOpts` when splitting nodes through a phi but missed a line break. This will result in: > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 RegionSplit-If > > instead of > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 Region > Split-If > > This patch fixes this. > > Thanks, > Christian Thank you for fixing my silly mistake, @chhagedorn! Looks good to me as well. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27092#pullrequestreview-3185304718 From mhaessig at openjdk.org Thu Sep 4 13:31:15 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 4 Sep 2025 13:31:15 GMT Subject: RFR: 8366775: TestCompileTaskTimeout should use timeoutFactor Message-ID: `TestCompileTaskTimeout.java` employs a timeout to test that methods compiled faster than a specified `CompileTaskTimeout`. However, it does not make use of the jtreg timeout factor, which lead to #26963 increasing the timeout to 2 s. This PR remedies this, by using the timeout factor and reducing the default timeout to 500 ms. Testing: - [ ] Github Actions - [ ] tier1, tier2 linux-x64-debug, linux-x64, linux-aarch64-debug, linux-aarch64 ------------- Commit messages: - Use timeuot factor Changes: https://git.openjdk.org/jdk/pull/27094/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27094&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366775 Stats: 7 lines in 1 file changed: 6 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27094.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27094/head:pull/27094 PR: https://git.openjdk.org/jdk/pull/27094 From rehn at openjdk.org Thu Sep 4 13:32:34 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 4 Sep 2025 13:32:34 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v5] In-Reply-To: References: Message-ID: <64z-PlrnxAISLzKBq-RZz7CXkQirGTvOgTGMJQl833o=.73ea3239-dfb6-4e32-b20f-8398334f2759@github.com> > Hey, please consider! > > A bunch of info in JBS entry, please read that also. > > I narrowed this issue down to the old jal optimization, making direct calls when in reach. > This patch restores them and removes this regression. > > In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. > > Please test on your hardware! > > > Chi Square (100 runs each, 10 fastest iterations of each run, P550) > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'master' into 8365926 - Review comments - Review comments - Merge branch 'master' into 8365926 - Spelling - Merge branch 'master' into 8365926 - draft jal<->jalr ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26944/files - new: https://git.openjdk.org/jdk/pull/26944/files/72e3ba6a..da18e6b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=03-04 Stats: 6217 lines in 654 files changed: 3237 ins; 1282 del; 1698 mod Patch: https://git.openjdk.org/jdk/pull/26944.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26944/head:pull/26944 PR: https://git.openjdk.org/jdk/pull/26944 From chagedorn at openjdk.org Thu Sep 4 14:06:45 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 4 Sep 2025 14:06:45 GMT Subject: RFR: 8366775: TestCompileTaskTimeout should use timeoutFactor In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 13:26:22 GMT, Manuel H?ssig wrote: > `TestCompileTaskTimeout.java` employs a timeout to test that methods compiled faster than a specified `CompileTaskTimeout`. However, it does not make use of the jtreg timeout factor, which lead to #26963 increasing the timeout to 2 s. This PR remedies this, by using the timeout factor and reducing the default timeout to 500 ms. > > Testing: > - [ ] Github Actions > - [ ] tier1, tier2 linux-x64-debug, linux-x64, linux-aarch64-debug, linux-aarch64 Looks reasonable, thanks for adjusting it again! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27094#pullrequestreview-3185494488 From rcastanedalo at openjdk.org Thu Sep 4 14:06:45 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 4 Sep 2025 14:06:45 GMT Subject: RFR: 8366775: TestCompileTaskTimeout should use timeoutFactor In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 13:26:22 GMT, Manuel H?ssig wrote: > `TestCompileTaskTimeout.java` employs a timeout to test that methods compiled faster than a specified `CompileTaskTimeout`. However, it does not make use of the jtreg timeout factor, which lead to #26963 increasing the timeout to 2 s. This PR remedies this, by using the timeout factor and reducing the default timeout to 500 ms. > > Testing: > - [ ] Github Actions > - [ ] tier1, tier2 linux-x64-debug, linux-x64, linux-aarch64-debug, linux-aarch64 Looks good to me! Please check with the PPC port maintainers (and perhaps [the maintainers of RISC-V, s390, and ARM32](https://wiki.openjdk.org/display/HotSpot/Ports)?) that this works in their environment. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27094#pullrequestreview-3185506163 From mhaessig at openjdk.org Thu Sep 4 15:13:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 4 Sep 2025 15:13:42 GMT Subject: RFR: 8366775: TestCompileTaskTimeout should use timeoutFactor In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 13:26:22 GMT, Manuel H?ssig wrote: > `TestCompileTaskTimeout.java` employs a timeout to test that methods compiled faster than a specified `CompileTaskTimeout`. However, it does not make use of the jtreg timeout factor, which lead to #26963 increasing the timeout to 2 s. This PR remedies this, by using the timeout factor and reducing the default timeout to 500 ms. > > Testing: > - [ ] Github Actions > - [ ] tier1, tier2 linux-x64-debug, linux-x64, linux-aarch64-debug, linux-aarch64 @MBaesken, could you please have a look, since you filed the issue? Is the reduced default a problem on your side? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27094#issuecomment-3254177977 From cslucas at openjdk.org Thu Sep 4 17:17:44 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 4 Sep 2025 17:17:44 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT. In-Reply-To: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> Message-ID: On Thu, 4 Sep 2025 07:44:52 GMT, Roberto Casta?eda Lozano wrote: >> Please, review this patch to fix issue that may occur when reducing allocation merge. >> >> As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. >> >> The change in `revisit_reducible_phi_status` is just a clean-up. >> The real fix is in `find_scalar_replaceable_allocs`. >> >> Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. > > Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. Thank you @robcasloz ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3254677197 From snatarajan at openjdk.org Thu Sep 4 19:54:19 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Thu, 4 Sep 2025 19:54:19 GMT Subject: RFR: 8356779: IGV: dump the index of the SafePointNode containing the current JVMS during parsing Message-ID: This PR prints index of the SafePointNode containing the current JVMS during parsing in IGV. As stated in JBS the reason for this is that there are a lot of nodes during parsing, it would be nice to know what are the current nodes in the local slots or in the stack when looking at a graph. ------------- Commit messages: - initial fix Changes: https://git.openjdk.org/jdk/pull/27083/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27083&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8356779 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27083.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27083/head:pull/27083 PR: https://git.openjdk.org/jdk/pull/27083 From sparasa at openjdk.org Thu Sep 4 20:11:28 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 4 Sep 2025 20:11:28 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v2] In-Reply-To: References: Message-ID: > This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. > > Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). > > For example: > `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding > `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - nomenclature change - Merge branch 'master' of https://git.openjdk.java.net/jdk into cdemotion - remove trailing whitespaces - remove unused instructions - 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26997/files - new: https://git.openjdk.org/jdk/pull/26997/files/bd14470a..91962f4f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=00-01 Stats: 26115 lines in 1121 files changed: 16613 ins; 5592 del; 3910 mod Patch: https://git.openjdk.org/jdk/pull/26997.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26997/head:pull/26997 PR: https://git.openjdk.org/jdk/pull/26997 From sparasa at openjdk.org Thu Sep 4 20:15:52 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 4 Sep 2025 20:15:52 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v2] In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 20:11:28 GMT, Srinivas Vamsi Parasa wrote: >> This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. >> >> Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). >> >> For example: >> `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding >> `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding > > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - nomenclature change > - Merge branch 'master' of https://git.openjdk.java.net/jdk into cdemotion > - remove trailing whitespaces > - remove unused instructions > - 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 > Hi @vamsi-parasa , thanks for working on this, I am process of validating #26283 and find that additional RA biasing will enable demotion for more cases, with a minimal test case I see following results > Hi Jatin (@jatin-bhateja), thank you for sharing the information about the register allocation biasing PR you're working on that will improve demotion. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26997#issuecomment-3255441849 From sparasa at openjdk.org Thu Sep 4 20:15:53 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 4 Sep 2025 20:15:53 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v2] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 13:17:23 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - nomenclature change >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into cdemotion >> - remove trailing whitespaces >> - remove unused instructions >> - 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 > > src/hotspot/cpu/x86/assembler_x86.cpp line 13055: > >> 13053: bool is_prefixq = (size == EVEX_64bit) ? true : false; >> 13054: bool normal_demotion = is_demotable(no_flags, dst_enc, nds_enc); >> 13055: bool commutative_demotion = is_commutative && is_demotable(no_flags, dst_enc, src_enc); > > Nomenclature change: instead of normal_demotion and commutative demotion, it will be more appropriate to use first/second_operand_demotable. Please see the updated nomenclature changed in the updated code as suggested. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2323411449 From sparasa at openjdk.org Thu Sep 4 20:18:43 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 4 Sep 2025 20:18:43 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v2] In-Reply-To: References: Message-ID: <0X5cvpQZxb1l5Q_8f-iU0K4WtdyFW8ehdPXR2zsnSzo=.7f4f3d03-94db-4482-b5ee-c5f1362d84b5@github.com> On Tue, 2 Sep 2025 02:40:59 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - nomenclature change >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into cdemotion >> - remove trailing whitespaces >> - remove unused instructions >> - 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 > > src/hotspot/cpu/x86/x86_64.ad line 7121: > >> 7119: %{ >> 7120: predicate(UseAPX); >> 7121: match(Set dst (AddI (LoadI src1) src2)); > > Will this not be covered by the pattern at line 7103, since ADLC automatically generates a DFA to handle both cases? Will run experiments to make sure that the RegRegMem pattern also applies to RegMemReg case and remove the newly added match rules if they're redundant. Will update you soon. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2323424398 From dlong at openjdk.org Fri Sep 5 01:42:19 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 5 Sep 2025 01:42:19 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis [v2] In-Reply-To: <11lcsXkMGpKMQr60NCKofzldqpnJka1XZtrGRrUai3o=.c2201234-bbf2-465a-b237-cd9fe8505491@github.com> References: <11lcsXkMGpKMQr60NCKofzldqpnJka1XZtrGRrUai3o=.c2201234-bbf2-465a-b237-cd9fe8505491@github.com> Message-ID: On Wed, 3 Sep 2025 08:02:04 GMT, Daniel Skantz wrote: >> This PR addresses a wrong compilation during string optimizations. >> >> During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. >> >> After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. >> >> The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. >> >> Testing: T1-3 (aed5952). >> >> Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. > > Daniel Skantz has updated the pull request incrementally with two additional commits since the last revision: > > - store intermediate calculations > - direction convention This seems to be missing the root cause of the problem. From what I can tell, we have two string concats here, with the 2nd dependent on the first. But we incorrectly decide to coalesce them into a single concat, which then causes havoc when eliminate_unneeded_control() starts nuking edges without regard for the dependency. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27028#issuecomment-3256794151 From dlong at openjdk.org Fri Sep 5 01:59:09 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 5 Sep 2025 01:59:09 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis [v2] In-Reply-To: <11lcsXkMGpKMQr60NCKofzldqpnJka1XZtrGRrUai3o=.c2201234-bbf2-465a-b237-cd9fe8505491@github.com> References: <11lcsXkMGpKMQr60NCKofzldqpnJka1XZtrGRrUai3o=.c2201234-bbf2-465a-b237-cd9fe8505491@github.com> Message-ID: On Wed, 3 Sep 2025 08:02:04 GMT, Daniel Skantz wrote: >> This PR addresses a wrong compilation during string optimizations. >> >> During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. >> >> After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. >> >> The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. >> >> Testing: T1-3 (aed5952). >> >> Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. > > Daniel Skantz has updated the pull request incrementally with two additional commits since the last revision: > > - store intermediate calculations > - direction convention Hmm, I see now that validate_control_flow() does limit coalescing, but I'm worried that the pattern matching may not catch all problematic cases. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27028#issuecomment-3256815844 From epeter at openjdk.org Fri Sep 5 06:06:41 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 06:06:41 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type Message-ID: I have seen 3 manifestations of this bug: 1. assert # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require 2. assert # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 # Error: assert(bt == T_FLOAT) failed 3. Wrong result When the feature was available but we used the wrong CastVector It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: } else if (VectorNode::is_reinterpret_opcode(opc)) { assert(first->req() == 2 && req() == 2, "only one input expected"); const TypeVect* vt = TypeVect::make(bt, vlen); vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. ------------- Commit messages: - fix whitespace - fix test vector api visibility - fix copyright - IR rules - JDK-8366845 Changes: https://git.openjdk.org/jdk/pull/27100/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27100&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366845 Stats: 226 lines in 2 files changed: 225 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27100.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27100/head:pull/27100 PR: https://git.openjdk.org/jdk/pull/27100 From galder at openjdk.org Fri Sep 5 06:06:42 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 5 Sep 2025 06:06:42 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 14:42:46 GMT, Emanuel Peter wrote: > I have seen 3 manifestations of this bug: > > 1. assert > > # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 > # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require > > > 2. assert > > # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 > # Error: assert(bt == T_FLOAT) failed > > > 3. Wrong result > When the feature was available but we used the wrong CastVector > > It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: > > > } else if (VectorNode::is_reinterpret_opcode(opc)) { > assert(first->req() == 2 && req() == 2, "only one input expected"); > const TypeVect* vt = TypeVect::make(bt, vlen); > vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); > > > Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. > > But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: > The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. > > The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. Great catch @eme64! Sorry for introducing this issue :$ I was wondering if we'd need more cases being tested? Reversed ones? E.g. `test1 ` goes from long -> double -> float -> int, do we need something that does int -> float -> double -> long? Does that make sense? Makes sense @eme64. Happy with the fix and tests :) ------------- PR Review: https://git.openjdk.org/jdk/pull/27100#pullrequestreview-3185798345 Marked as reviewed by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/27100#pullrequestreview-3185920801 From vlivanov at openjdk.org Fri Sep 5 06:06:42 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 5 Sep 2025 06:06:42 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: <_favZs4uLmC9KBKZiZekexIi8GRq66w1s0tgqZ5gOiw=.abb71cf4-6d27-4343-a9d8-6bcab85125cb@github.com> On Thu, 4 Sep 2025 14:42:46 GMT, Emanuel Peter wrote: > I have seen 3 manifestations of this bug: > > 1. assert > > # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 > # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require > > > 2. assert > > # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 > # Error: assert(bt == T_FLOAT) failed > > > 3. Wrong result > When the feature was available but we used the wrong CastVector > > It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: > > > } else if (VectorNode::is_reinterpret_opcode(opc)) { > assert(first->req() == 2 && req() == 2, "only one input expected"); > const TypeVect* vt = TypeVect::make(bt, vlen); > vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); > > > Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. > > But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: > The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. > > The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27100#pullrequestreview-3186086231 From thartmann at openjdk.org Fri Sep 5 06:06:42 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 5 Sep 2025 06:06:42 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: <20PTiZWdvMZQNCgMBJSOzD7f7uWC-J8t0bWoXT6NV7Q=.ed065007-326d-4228-b23a-e0964fc8940f@github.com> On Thu, 4 Sep 2025 14:42:46 GMT, Emanuel Peter wrote: > I have seen 3 manifestations of this bug: > > 1. assert > > # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 > # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require > > > 2. assert > > # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 > # Error: assert(bt == T_FLOAT) failed > > > 3. Wrong result > When the feature was available but we used the wrong CastVector > > It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: > > > } else if (VectorNode::is_reinterpret_opcode(opc)) { > assert(first->req() == 2 && req() == 2, "only one input expected"); > const TypeVect* vt = TypeVect::make(bt, vlen); > vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); > > > Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. > > But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: > The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. > > The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. Looks good to me otherwise. Nice test! test/hotspot/jtreg/compiler/loopopts/superword/TestReinterpretAndCast.java line 170: > 168: int v0 = a[i]; > 169: float v1 = Float.intBitsToFloat(v0); > 170: // Reinterpret: int -> float Same here. test/hotspot/jtreg/compiler/loopopts/superword/TestReinterpretAndCast.java line 212: > 210: float v2 = v1.floatValue(); > 211: int v3 = Float.floatToRawIntBits(v2); > 212: // Reinterpret: float -> int The indentation is off here. Please also fix the whitespace errors. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27100#pullrequestreview-3188100502 PR Review Comment: https://git.openjdk.org/jdk/pull/27100#discussion_r2324177886 PR Review Comment: https://git.openjdk.org/jdk/pull/27100#discussion_r2324177346 From epeter at openjdk.org Fri Sep 5 06:06:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 06:06:42 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 15:08:04 GMT, Galder Zamarre?o wrote: >> I have seen 3 manifestations of this bug: >> >> 1. assert >> >> # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 >> # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require >> >> >> 2. assert >> >> # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 >> # Error: assert(bt == T_FLOAT) failed >> >> >> 3. Wrong result >> When the feature was available but we used the wrong CastVector >> >> It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: >> >> >> } else if (VectorNode::is_reinterpret_opcode(opc)) { >> assert(first->req() == 2 && req() == 2, "only one input expected"); >> const TypeVect* vt = TypeVect::make(bt, vlen); >> vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); >> >> >> Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. >> >> But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: >> The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. >> >> The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. > > Great catch @eme64! Sorry for introducing this issue :$ > > I was wondering if we'd need more cases being tested? Reversed ones? E.g. `test1 ` goes from long -> double -> float -> int, do we need something that does int -> float -> double -> long? Does that make sense? @galderz Thanks for having a look. We could add more cases, but I'd also like to integrate rather quickly since this is failing 10x or more on our CI daily. If it takes too long we would have to back out [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) instead. So I'd suggest this: We can file a follow-up RFE that covers more cases. Because we basically need to cover: all Reinterpret (I2F, F2I, L2D, D2L, HF2S, S2HF) with all compatible casts after it. That is a lot of cases. We can consider using a templated test for it, or just generate them ahead. Generally, it is quite difficult to test the "moves" well because of the way that different NaN bits are handled. I'd like to develop generally more templated tests. But it is difficult to do arbitrary expressions, because if you have some float expression that can generate a NaN, and then you "move" it to int with `Float.floatToRawIntBits`, you can get different results if you are in the interpreter or in compiled code. The I2F, F2I, L2D, D2L are "moves" are currently also tested with unaligned memory accesses via MemorySegment - that is how we found this bug in the first place. For now, I think the fix is quite simple and clear, so I'd think it is ok to defer the tests a little. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27100#issuecomment-3254226006 From duke at openjdk.org Fri Sep 5 06:08:20 2025 From: duke at openjdk.org (duke) Date: Fri, 5 Sep 2025 06:08:20 GMT Subject: RFR: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers [v2] In-Reply-To: References: Message-ID: <_EIRPDGLVZ9QPEc95OcNBQgvga1GmohBC7QyniOVM-w=.c56ef657-cc77-4f28-9202-8d99a61e7e37@github.com> On Wed, 3 Sep 2025 02:40:27 GMT, Anjian Wen wrote: >> According to JDK-8353216?Add extra verification logic into MethodHandle::invokeBasic/linkTo* to ensure that holder classes are properly initialized on riscv platform. > > Anjian Wen has updated the pull request incrementally with one additional commit since the last revision: > > Add assertion and modify format @Anjian-Wen Your change (at version b5eb3bd13bef6bb886e4bd8e0b91a8fe67f64354) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26938#issuecomment-3257172995 From thartmann at openjdk.org Fri Sep 5 06:08:11 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 5 Sep 2025 06:08:11 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 14:42:46 GMT, Emanuel Peter wrote: > I have seen 3 manifestations of this bug: > > 1. assert > > # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 > # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require > > > 2. assert > > # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 > # Error: assert(bt == T_FLOAT) failed > > > 3. Wrong result > When the feature was available but we used the wrong CastVector > > It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: > > > } else if (VectorNode::is_reinterpret_opcode(opc)) { > assert(first->req() == 2 && req() == 2, "only one input expected"); > const TypeVect* vt = TypeVect::make(bt, vlen); > vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); > > > Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. > > But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: > The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. > > The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. Looks good! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27100#pullrequestreview-3188108067 From wenanjian at openjdk.org Fri Sep 5 06:16:16 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Fri, 5 Sep 2025 06:16:16 GMT Subject: Integrated: 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers In-Reply-To: References: Message-ID: On Tue, 26 Aug 2025 09:18:14 GMT, Anjian Wen wrote: > According to JDK-8353216?Add extra verification logic into MethodHandle::invokeBasic/linkTo* to ensure that holder classes are properly initialized on riscv platform. This pull request has now been integrated. Changeset: 0d7f8f83 Author: Anjian Wen Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/0d7f8f83c7a674f5da4b93d66a24f9ce5ba46011 Stats: 54 lines in 2 files changed: 48 ins; 1 del; 5 mod 8366747: RISC-V: Improve VerifyMethodHandles for method handle linkers Reviewed-by: fyang, dzhang ------------- PR: https://git.openjdk.org/jdk/pull/26938 From duke at openjdk.org Fri Sep 5 06:30:34 2025 From: duke at openjdk.org (erifan) Date: Fri, 5 Sep 2025 06:30:34 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v3] In-Reply-To: References: Message-ID: > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: > > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > > Step 1: calculate the index input of the TBL instruction. > > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 > > dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 > tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 > > // Clear inactive elements. > dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 > > // Set the inactive lane value to -1 and set the active lane to the target index. > dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 > > Step 2: shuffle the source vector elements to the target vector > > tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > > > The same algorithm is used for NEON and SVE1, but with different instructions where appropriate. > > The following benchmarks are from panama-... erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Align code example data for better reading - Merge branch 'master' into JDK-8363989 - Improve the comment of the vector expand implementation - Merge branch 'master' into JDK-8363989 - 8363989: AArch64: Add missing backend support of VectorAPI expand operation Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: 1. **Subword types** on SVE2-capable hardware. 2. **All types** on NEON and SVE1 environments. As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: ``` To compute: dst = src.expand(mask) Data direction: high <== low Input: src = p o n m l k j i h g f e d c b a mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 Expected result: dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a ``` Step 1: calculate the index input of the TBL instruction. ``` // Set tmp1 as all 0 vector. tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 // Move the mask bits from the predicate register to a vector register. // **1-bit** mask lane of P register to **8-bit** mask lane of V register. tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 // Shift the entire register. Prefix sum algorithm. dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 // Clear inactive elements. dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 // Set the inactive lane value to -1 and set the active lane to the target index. dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 ``` Step 2: shuffle the source vector elements to the target vector ``` tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a ``` The same algorithm is used for NEON and SVE1, but with different instructions where appropriate. The following benchmarks are from panama-vector/vectorIntrinsics. On Nvidia Grace machine with option `-XX:UseSVE=2`: ``` Benchmark Unit Before Score Error After Score Error Uplift Byte128Vector.expand ops/ms 1791.022366 5.619883 9633.388683 1.968788 5.37 Double128Vector.expand ops/ms 4489.255846 0.48485 4488.772949 0.491596 0.99 Float128Vector.expand ops/ms 8863.02424 6.888087 8908.352235 51.487453 1 Int128Vector.expand ops/ms 8873.485683 3.275682 8879.635643 1.243863 1 Long128Vector.expand ops/ms 4485.1149 4.458073 4489.365269 0.851093 1 Short128Vector.expand ops/ms 792.068834 2.640398 5880.811288 6.40683 7.42 Byte64Vector.expand ops/ms 854.455002 8.548982 5999.046295 37.209987 7.02 Double64Vector.expand ops/ms 46.49763 0.104773 46.526043 0.102451 1 Float64Vector.expand ops/ms 4510.596811 0.504477 4509.984244 1.519178 0.99 Int64Vector.expand ops/ms 4508.778322 1.664461 4535.216611 26.742484 1 Long64Vector.expand ops/ms 45.665462 0.705485 46.496232 0.075648 1.01 Short64Vector.expand ops/ms 394.527324 1.284691 3860.199621 0.720015 9.78 ``` On Nvidia Grace machine with option `-XX:UseSVE=1`: ``` Benchmark Unit Before Score Error After Score Error Uplift Byte128Vector.expand ops/ms 1767.314171 12.431526 9630.892248 1.478813 5.44 Double128Vector.expand ops/ms 197.614381 0.945541 2416.075281 2.664325 12.22 Float128Vector.expand ops/ms 390.878183 2.089234 3844.011978 3.792751 9.83 Int128Vector.expand ops/ms 394.550044 2.025371 3843.280133 3.528017 9.74 Long128Vector.expand ops/ms 198.366863 0.651726 2423.234639 4.911434 12.21 Short128Vector.expand ops/ms 790.044704 3.339363 5885.595035 1.440598 7.44 Byte64Vector.expand ops/ms 853.479119 7.158898 5942.750116 1.054905 6.96 Double64Vector.expand ops/ms 46.550458 0.079191 46.423053 0.057554 0.99 Float64Vector.expand ops/ms 197.977215 1.156535 2445.010767 1.992358 12.34 Int64Vector.expand ops/ms 198.326857 1.02785 2444.211583 2.5432 12.32 Long64Vector.expand ops/ms 46.526513 0.25779 45.984253 0.566691 0.98 Short64Vector.expand ops/ms 398.649412 1.87764 3837.495773 3.528926 9.62 ``` On Nvidia Grace machine with option `-XX:UseSVE=0`: ``` Benchmark Unit Before Score Error After Score Error Uplift Byte128Vector.expand ops/ms 1802.98702 6.906394 9427.491602 2.067934 5.22 Double128Vector.expand ops/ms 198.498191 0.429071 1190.476326 0.247358 5.99 Float128Vector.expand ops/ms 392.849005 2.034676 2373.195574 2.006566 6.04 Int128Vector.expand ops/ms 395.69179 2.194773 2372.084745 2.058303 5.99 Long128Vector.expand ops/ms 198.191673 1.476362 1189.712301 1.006821 6 Short128Vector.expand ops/ms 795.785831 5.62611 4731.514053 2.365213 5.94 Byte64Vector.expand ops/ms 843.549268 7.174254 5865.556155 37.639415 6.95 Double64Vector.expand ops/ms 45.943599 0.484743 46.529755 0.111551 1.01 Float64Vector.expand ops/ms 193.945993 0.943338 1463.836772 0.618393 7.54 Int64Vector.expand ops/ms 194.168021 0.492286 1473.004575 8.802656 7.58 Long64Vector.expand ops/ms 46.570488 0.076372 46.696353 0.078649 1 Short64Vector.expand ops/ms 387.973334 2.367312 2920.428114 0.863635 7.52 ``` Some JTReg test cases are added for the above changes. And the patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26740/files - new: https://git.openjdk.org/jdk/pull/26740/files/a1777974..8f1f8aaf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=01-02 Stats: 22892 lines in 964 files changed: 15292 ins; 4162 del; 3438 mod Patch: https://git.openjdk.org/jdk/pull/26740.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26740/head:pull/26740 PR: https://git.openjdk.org/jdk/pull/26740 From duke at openjdk.org Fri Sep 5 06:30:34 2025 From: duke at openjdk.org (erifan) Date: Fri, 5 Sep 2025 06:30:34 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2] In-Reply-To: <_VZ4L0DTdTxRz1XzG4QIyYY7TyCHzroEOeOV21N17_Y=.e92ad3fd-3e94-4bf5-a570-dc8cc8c9e9ed@github.com> References: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> <_VZ4L0DTdTxRz1XzG4QIyYY7TyCHzroEOeOV21N17_Y=.e92ad3fd-3e94-4bf5-a570-dc8cc8c9e9ed@github.com> Message-ID: On Thu, 4 Sep 2025 08:01:40 GMT, erifan wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2819: >> >>> 2817: subv(dst, size, tmp2, tmp1); >>> 2818: // dst = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >>> 2819: tbl(dst, size, src, 1, dst); >> >> It would make it a little easier to read the example if the numbers were aligned. >> Now the minus sign disrupts that a little. Maybe leave 2 spaces if the number is positive? > > Make sense, I'll update it in the following commit. Done, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2324218817 From epeter at openjdk.org Fri Sep 5 06:33:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 06:33:16 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 09:39:54 GMT, erifan wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Code style fixes > > The test failure should be irrelevant to this PR, I can see it in other PR's test results, like https://github.com/egahlin/jdk/actions/runs/17436633376/job/49510579213 @erifan There are only unrelated test failures, so good on testing front. The patch looks reasonable, though I'm not a aarch64 expert. Is the issue at all observable from Java? With the wrong encoding, could there be a wrong result that we could test in a jtreg test? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3257224069 From dskantz at openjdk.org Fri Sep 5 06:33:11 2025 From: dskantz at openjdk.org (Daniel Skantz) Date: Fri, 5 Sep 2025 06:33:11 GMT Subject: RFR: 8362117: C2: compiler/stringopts/TestStackedConcatsAppendUncommonTrap.java fails with a wrong result due to invalidated liveness assumptions for data phis [v2] In-Reply-To: <11lcsXkMGpKMQr60NCKofzldqpnJka1XZtrGRrUai3o=.c2201234-bbf2-465a-b237-cd9fe8505491@github.com> References: <11lcsXkMGpKMQr60NCKofzldqpnJka1XZtrGRrUai3o=.c2201234-bbf2-465a-b237-cd9fe8505491@github.com> Message-ID: On Wed, 3 Sep 2025 08:02:04 GMT, Daniel Skantz wrote: >> This PR addresses a wrong compilation during string optimizations. >> >> During stacked string concatenation of two StringBuilder links SB1 and SB2, the pattern "append -> Phi -> Region -> (True, False) -> If -> Bool -> CmpP -> Proj (Result) -> toString" may be observed, where toString is the end of SB1, and the simple diamond is part of SB2. >> >> After JDK-8291775, the Bool test to the diamond If is set to a constant zero to allow for folding the simple diamond away during IGVN, while not letting the top() value from the result projection of SB1 propagate through the graph too quickly. The assumption was that any data Phi of the Region would go away during PhaseRemoveUseless as they are no longer live -- I think that in the case of JDK-8291775, the user of phi was the constructor of SB2. However, in the attached test case, the Phi stays live as it's a parameter (input to an append) of SB2 and will be used during the transformation in `copy_string`. When the diamond region is later folded, the Phi's user picks up the wrong input corresponding to the false branch. >> >> The proposed solution is to disable the stacked concatenation optimization for this specific pattern. This might be pragmatic as it's an edge case and there's already a bug tail: JDK-8271341-> JDK-8291775 -> JDK-8362117. >> >> Testing: T1-3 (aed5952). >> >> Extra testing: ran T1-3 on Linux with an instrumented build and verified that the pattern I am excluding in this PR is not seen during any other compilation than that of the proposed regression test. > > Daniel Skantz has updated the pull request incrementally with two additional commits since the last revision: > > - store intermediate calculations > - direction convention Yes, `validate_control_flow()` is used for individual and coalesced concatenations, but just re-using those same checks for coalesced concatenations has been shown to not be sufficient in recent bugs -- in particular when the result of SB1 is used in unexpected ways in SB2. I am not convinced that we have covered all the cases yet. Would it be an idea to fix this issue and then go for the fuzzing approach next to cover more patterns (follow-up RFE), or is there a more general pattern we could prevent here already? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27028#issuecomment-3257224173 From duke at openjdk.org Fri Sep 5 06:33:12 2025 From: duke at openjdk.org (erifan) Date: Fri, 5 Sep 2025 06:33:12 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2] In-Reply-To: <_VZ4L0DTdTxRz1XzG4QIyYY7TyCHzroEOeOV21N17_Y=.e92ad3fd-3e94-4bf5-a570-dc8cc8c9e9ed@github.com> References: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> <_VZ4L0DTdTxRz1XzG4QIyYY7TyCHzroEOeOV21N17_Y=.e92ad3fd-3e94-4bf5-a570-dc8cc8c9e9ed@github.com> Message-ID: <5_oA0GhSFquOBfMBsQ7atQZBOR8R14Qc1GiDMS7Xbsc=.491d977f-8272-4415-9db1-7e8a12d41a6b@github.com> On Thu, 4 Sep 2025 08:00:14 GMT, erifan wrote: >> test/hotspot/jtreg/compiler/vectorapi/VectorExpandTest.java line 48: >> >>> 46: static final VectorSpecies F_SPECIES = FloatVector.SPECIES_MAX; >>> 47: static final VectorSpecies L_SPECIES = LongVector.SPECIES_MAX; >>> 48: static final VectorSpecies D_SPECIES = DoubleVector.SPECIES_MAX; >> >> Would it make sense to run these tests with various vector sizes? >> Because it seems your algorithm depends on `vector_length_in_bytes` in the prefix sum algo. > > Since we already have correctness tests for `expand` on **all vector types** under `test/jdk/jdk/incubator/vector/`, such as https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5375, this test primarily verifies that the expected IR is generated. So, I think this is sufficient? > > I've tested this PR locally on a 128-bit SVE2 machine, a 256-bit SVE machine, and a 512-bit QEMU environment, and all tests passed. By the way, `vector_length_in_bytes` doesn't affect the IR generation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2324223475 From duke at openjdk.org Fri Sep 5 06:41:14 2025 From: duke at openjdk.org (erifan) Date: Fri, 5 Sep 2025 06:41:14 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: <_C1JCBLGEQNTp2YPZOvu403adPXceeh1Dg6MYqWiqdw=.e94c5fdf-f2e6-4f95-b390-2cb3106673c7@github.com> On Fri, 5 Sep 2025 06:30:16 GMT, Emanuel Peter wrote: > Is the issue at all observable from Java? With the wrong encoding, could there be a wrong result that we could test in a jtreg test? No this is not observable from java because the JVM currently doesn't use `sve_cpy` to copy negative floating-point numbers, only positive floating-point numbers. I discovered this issue while trying to use this instruction to optimize `VectorMask.toVector()` , which needs to do `sve_cpy(-1.0)`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3257238951 From epeter at openjdk.org Fri Sep 5 06:46:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 06:46:13 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v2] In-Reply-To: <5_oA0GhSFquOBfMBsQ7atQZBOR8R14Qc1GiDMS7Xbsc=.491d977f-8272-4415-9db1-7e8a12d41a6b@github.com> References: <_YDJIkwt0sdsOAMfNNn1fHTVwH0SHDpJv5NpQoxnfiA=.a0ddb5f3-00f1-47e2-93da-f47cb3f62288@github.com> <_VZ4L0DTdTxRz1XzG4QIyYY7TyCHzroEOeOV21N17_Y=.e92ad3fd-3e94-4bf5-a570-dc8cc8c9e9ed@github.com> <5_oA0GhSFquOBfMBsQ7atQZBOR8R14Qc1GiDMS7Xbsc=.491d977f-8272-4415-9db1-7e8a12d41a6b@github.com> Message-ID: On Fri, 5 Sep 2025 06:30:30 GMT, erifan wrote: >> Since we already have correctness tests for `expand` on **all vector types** under `test/jdk/jdk/incubator/vector/`, such as https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5375, this test primarily verifies that the expected IR is generated. So, I think this is sufficient? >> >> I've tested this PR locally on a 128-bit SVE2 machine, a 256-bit SVE machine, and a 512-bit QEMU environment, and all tests passed. > > By the way, `vector_length_in_bytes` doesn't affect the IR generation. Ok, that sounds good, as long as we test all vector types elsewhere already :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26740#discussion_r2324243461 From epeter at openjdk.org Fri Sep 5 07:17:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 07:17:10 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: <7Qekyc5L6ZJS4G9DqSp6Ur68K-Jqv-EgPYcUMK0CrOc=.4a4331ea-ed91-4cd9-92a6-fd84b175dc0c@github.com> On Wed, 3 Sep 2025 10:02:24 GMT, erifan wrote: >> The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Code style fixes Alright, let me rubber stamp it then. Looks reasonable and tests are passing on our side. Thanks for fixing this :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26951#pullrequestreview-3188266800 From epeter at openjdk.org Fri Sep 5 07:27:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 07:27:10 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Thu, 4 Sep 2025 08:50:59 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 32: >> >>> 30: >>> 31: void LocalGraphInvariant::LazyReachableCFGNodes::fill() { >>> 32: precond(live_nodes.size() == 0); >> >> Maybe I missed something here: where do the `precond` and `postcond` come from? > > `debug.hpp` just next to `assert`. They are "standard", but not very widely used. I think they are good as they clearly state what is a precondition or a postcondition. There is no message (or rather a default one), but it's better (or not worse) than giving a not very inspired one, like "fail", which one can find often. Nice, did not know that :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324319149 From epeter at openjdk.org Fri Sep 5 07:34:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 07:34:11 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Thu, 4 Sep 2025 09:16:52 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 207: >> >>> 205: } >>> 206: bool (Node::*_type_check)() const; >>> 207: }; >> >> You could probably generalize this with a callback approach. And then one concrete implentation is the one that does the type check. Just an idea. > > Seems overengineered to me. The callback version would be similarly long as this. The user that must provide the callback will also be similarly long. It makes the logic unnecessarily complicated to me. Of course, everything boils down to a function that takes a node and perform a specific check, but then, this generalized version does nothing significant but calling the callback. The concrete implementation will just have all the same logic, but in a callback passed to another method instead of having it as a first class method... > > If I don't have an adapter class that would only check type but I leave that at instanciation time, the code would look like > > NodeCallback([](const Node* n) { return n->is_Region(); }) > > instead of > > NodeClass(&Node::is_Region) > > which is unreadable. That's the point of patterns: it makes easy to understand the shape, otherwise, one can just write normal, manual traversal, which is all powerful. > > It was also discussed above that something like the `NodeCallback` could exist for when we need something that can't be expressed simply, but: > - will it ever happen? > - NodeCallback doesn't even provide a useful error messages, we would also need a callback to craft it (or make the one callback more complicated, that would be pretty much the content of `NodeClass::check`) > - I'm not willing to make the common kind of patterns ugly for a rare usecase. > > And as for implementing `NodeClass` from a hypothetical `NodeCallback`, what would be the concrete benefits? (kinda the first paragraph again: all the logic in the callback, and NodeCallback doing nothing). Sounds good :) It was just an idea, and it is also a bit a question of taste. But you are right: callbacks can also look ugly and hard to read. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324333231 From mchevalier at openjdk.org Fri Sep 5 07:42:07 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 5 Sep 2025 07:42:07 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v4] In-Reply-To: References: Message-ID: > Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. > > Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. > > This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. > > For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. > > On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: > > 1 failure for node > 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > At node > 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) > From path: > [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) > <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringL... Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: - With typed binding - Review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26362/files - new: https://git.openjdk.org/jdk/pull/26362/files/700310e1..3c33fac9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=02-03 Stats: 334 lines in 7 files changed: 211 ins; 64 del; 59 mod Patch: https://git.openjdk.org/jdk/pull/26362.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26362/head:pull/26362 PR: https://git.openjdk.org/jdk/pull/26362 From epeter at openjdk.org Fri Sep 5 07:42:08 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 07:42:08 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Thu, 4 Sep 2025 11:08:20 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 279: >> >>> 277: return CheckResult::NOT_APPLICABLE; >>> 278: } >>> 279: CheckResult r = PatternBasedCheck::check(center, reachable_cfg_nodes, steps, path, ss); >> >> Could this not be solved with a `OrPattern`? >> >> Or::make( > >> ) >> >> Not sure that's worth it... > > I understand that OrPatterns are tempting! I also thought about it, it's naturally the dual of `And`. At this point, they are not actually a good idea. > > First, they cannot provide good reporting. When an `And` is failing, we can at least blame the first thing that fails: "I followed this path, I expected to find 5 inputs (for instance), there are only 2!". With `Or` we would get that and... maybe it's fine? Maybe not? Depends on the next branches, and if it ends up failing, how to provide a good message? > > Also, they cause a mess with binding. If a branch contains a `Bind`, one cannot know which branch matched and whether the content of the `Node` pointer given to `Bind` is trustworthy. We can't even rely on a test whether the pointer was set because the execution of a branch might find a `Bind` first, run it, assign the pointer and later fail, and then the `Bind` is not to use. This is a common problem with pattern matching in functional programming: the same bindings must appear (with same types) on each branch of or-patterns. But we have no such mechanisms to enforce that yet, and it seems like setting a trap for future us. > > There is also relatively few use cases, and that would not profit a lot from a `Or` pattern. Maybe in the future, we will have more interesting usecases and we will see how to address these issues. But for now, I think we should not include it for now rather than making a bad choice. > > By the way, I think something that has more future than `Or` is rather a case analysis: `IfThenElse(CondtionPattern, TrueBranchPattern, FalseBranchPattern)` if CondtionPattern is true, then we try to match TrueBranchPattern, otherwise FalseBranchPattern. This is better for reporting since we know which branch to we expect to be true, and so to blame (assuming we don't blame CondtionPattern, but we can include that in the message possibly). This still has the binding consistency issue, but more boilerplate could help (querying the set of pointers that would be set in each branch with helping methods...). Yet, let's wait and see. Hmm, yes. I did later on think about binding. If we ever use pattern matching for IGVN optimizations, we need to be able to do or-like patterns, maybe even over 2, 3, ...n many branches. And then bind to something. And you are right, reporting could also be an issue. Maybe there could be some kind of reporting still though: we could evaluate both branches and report where each fails. I saw multiple uses, so maybe at some point an Or could be justified. But maybe not yet. I think this is an interesting thread, so a shame you closed it as "resolved" ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324343167 From mchevalier at openjdk.org Fri Sep 5 07:53:15 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 5 Sep 2025 07:53:15 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Fri, 5 Sep 2025 07:36:34 GMT, Emanuel Peter wrote: >> I understand that OrPatterns are tempting! I also thought about it, it's naturally the dual of `And`. At this point, they are not actually a good idea. >> >> First, they cannot provide good reporting. When an `And` is failing, we can at least blame the first thing that fails: "I followed this path, I expected to find 5 inputs (for instance), there are only 2!". With `Or` we would get that and... maybe it's fine? Maybe not? Depends on the next branches, and if it ends up failing, how to provide a good message? >> >> Also, they cause a mess with binding. If a branch contains a `Bind`, one cannot know which branch matched and whether the content of the `Node` pointer given to `Bind` is trustworthy. We can't even rely on a test whether the pointer was set because the execution of a branch might find a `Bind` first, run it, assign the pointer and later fail, and then the `Bind` is not to use. This is a common problem with pattern matching in functional programming: the same bindings must appear (with same types) on each branch of or-patterns. But we have no such mechanisms to enforce that yet, and it seems like setting a trap for future us. >> >> There is also relatively few use cases, and that would not profit a lot from a `Or` pattern. Maybe in the future, we will have more interesting usecases and we will see how to address these issues. But for now, I think we should not include it for now rather than making a bad choice. >> >> By the way, I think something that has more future than `Or` is rather a case analysis: `IfThenElse(CondtionPattern, TrueBranchPattern, FalseBranchPattern)` if CondtionPattern is true, then we try to match TrueBranchPattern, otherwise FalseBranchPattern. This is better for reporting since we know which branch to we expect to be true, and so to blame (assuming we don't blame CondtionPattern, but we can include that in the message possibly). This still has the binding consistency issue, but more boilerplate could help (querying the set of pointers that would be set in each branch with helping methods...). Yet, let's wait and see. > > Hmm, yes. I did later on think about binding. > If we ever use pattern matching for IGVN optimizations, we need to be able to do or-like patterns, maybe even over 2, 3, ...n many branches. And then bind to something. > > And you are right, reporting could also be an issue. > Maybe there could be some kind of reporting still though: we could evaluate both branches and report where each fails. > > I saw multiple uses, so maybe at some point an Or could be justified. But maybe not yet. > > I think this is an interesting thread, so a shame you closed it as "resolved" ;) I think using Or patterns for recognizing patterns, and not for enforcing them is nicer since there is no need reporting problem. We can add that when we are there. >> Everything will run under `GraphInvariantChecker::run()` that has a `ResouceMark`. I'm not sure, but my guess is that it's not worth keeping entering and leaving resource marks for relatively short lists? At the very list, everything will be released at the end of the whole check. I can still add one here if you think it's better. > > I would do it defensively. Might save us from out-of-memory later on with higher tiers, and it could also make things faster: i.e. we might avoid timeouts, just because we need less memory. I don't have the overview how large these are and how many you'd create, so maybe it is unnecessary. Up to you. I will then! I've run this flag to tier3 + stress without OOM, so it's not too terrible, but yeah, we never know what kind of memory usage will come! >> I surely hope not! > > Then I would assert that it has exactly 1 input instead ;) Yes, I've changed it already (is GH laggy?...) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324352658 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324363672 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324367309 From epeter at openjdk.org Fri Sep 5 07:53:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 07:53:16 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Thu, 4 Sep 2025 11:14:27 GMT, Marc Chevalier wrote: > We would have to find a solution if there would be multiple bindings though ... I think that's not possible with your patterns, right? Is that a fundamental constraint? Sorry, that was not very clear. Yes you can bind multiple variables already. But you cannot do a disjuction (or) with binding. That would be helpful if you wanted to match patterns like: ((x + a) + a) or (a + (x + a)) We do that sort of thing a lot in IGVN optimizations: we need to be prepared to iterate over all associative reorderings. >> Also: I would invert the check to `!counted_loop_end->is_LongCountedLoopEnd()`. Because you expect it to be a long end here. Subjective. > > If you want. I don't think it's perfect because then the message might be less accurate: I don't know that >> A CountedLoopEnd is the backedge of a LongCountedLoop. > > I rather know that >> The backedge of a LongCountedLoop is not a LongCountedLoopEnd As far as I know, CountedLoopEnd is always the backedge of LongCountedLoop. Same for int. If not, I'd like to see a counter example ;) At least this should be true after IGVN. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324352933 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324364567 From mchevalier at openjdk.org Fri Sep 5 07:53:16 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 5 Sep 2025 07:53:16 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Fri, 5 Sep 2025 07:41:24 GMT, Emanuel Peter wrote: >>> We would have to find a solution if there would be multiple bindings though ... I think that's not possible with your patterns, right? Is that a fundamental constraint? >> >> Not sure what you mean? `And::make(new Bind(bla), AtInput(1, new Bind(bli)))`? You probably mean something else. >> >> >>> If we could somehow already cast the Bind variable to Region. Could be tricky. >>> Doing this is_Region and bind could be a very common idiom, so very useful. >> >> Interesting... Not sure how with some template magic we don't have (like `Node::is`) but probably doable with macros. I'll give it a try. > >> We would have to find a solution if there would be multiple bindings though ... I think that's not possible with your patterns, right? Is that a fundamental constraint? > > Sorry, that was not very clear. Yes you can bind multiple variables already. But you cannot do a disjuction (or) with binding. That would be helpful if you wanted to match patterns like: > > ((x + a) + a) > or > (a + (x + a)) > > We do that sort of thing a lot in IGVN optimizations: we need to be prepared to iterate over all associative reorderings. True. There is also no notion of "everyway this pattern can be matched around this center" (even tho, I tried to make patterns deterministic with numbering inputs and picking output of given type). I think that can fit with Or pattern, and rather for a IGVN use than a checking one. Let's see when we make use of that actually in this context. Then, we will not make stupid blind guesses on how to do it. >> If you want. I don't think it's perfect because then the message might be less accurate: I don't know that >>> A CountedLoopEnd is the backedge of a LongCountedLoop. >> >> I rather know that >>> The backedge of a LongCountedLoop is not a LongCountedLoopEnd > > As far as I know, CountedLoopEnd is always the backedge of LongCountedLoop. Same for int. If not, I'd like to see a counter example ;) > > At least this should be true after IGVN. Don't you mean "LongCountedLoopEnd is always the backedge of LongCountedLoop"? But I rather meant "what if we add another derived class of `CounterLoopEnd`? But I think the new assert should do the trick. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324360017 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324371124 From epeter at openjdk.org Fri Sep 5 07:53:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 07:53:18 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v3] In-Reply-To: References: <3-ZWJMEYL6eWaILQXqX4RskVroCjpFlNdGkmTQMt8Jc=.b09b689a-981f-4f95-83fa-015f0bd698cf@github.com> Message-ID: On Thu, 4 Sep 2025 11:19:13 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 332: >> >>> 330: } >>> 331: >>> 332: Node_List ctrl_succ; >> >> Do we need a `ResouceMark` for this? > > Everything will run under `GraphInvariantChecker::run()` that has a `ResouceMark`. I'm not sure, but my guess is that it's not worth keeping entering and leaving resource marks for relatively short lists? At the very list, everything will be released at the end of the whole check. I can still add one here if you think it's better. I would do it defensively. Might save us from out-of-memory later on with higher tiers, and it could also make things faster: i.e. we might avoid timeouts, just because we need less memory. I don't have the overview how large these are and how many you'd create, so maybe it is unnecessary. Up to you. >> src/hotspot/share/opto/graphInvariants.cpp line 447: >> >>> 445: And::make( >>> 446: new NodeClass(&Node::is_IfTrue), >>> 447: new HasAtLeastNInputs(1), >> >> Can an `IfTrue` have more than 1 input? > > I surely hope not! Then I would assert that it has exactly 1 input instead ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324357889 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2324360409 From mchevalier at openjdk.org Fri Sep 5 08:10:17 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 5 Sep 2025 08:10:17 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v4] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 07:42:07 GMT, Marc Chevalier wrote: >> Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. >> >> Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. >> >> This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. >> >> For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. >> >> On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: >> >> 1 failure for node >> 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> At node >> 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) >> From path: >> [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) >> <-(0)- 210 IfFalse === 209 [[ 21... > > Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: > > - With typed binding > - Review I've fixed a lot, but notably added a basic test, and gave the typed binding a try. I would have liked it without macro, but I think it's ok to use. I sometime dream we had `node->is()`, that would ease a few of these things (subjective). ------------- PR Comment: https://git.openjdk.org/jdk/pull/26362#issuecomment-3257455746 From mchevalier at openjdk.org Fri Sep 5 08:13:35 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 5 Sep 2025 08:13:35 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: > Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. > > Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. > > This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. > > For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. > > On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: > > 1 failure for node > 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > At node > 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) > From path: > [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) > <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringL... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: One more ResourceMark ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26362/files - new: https://git.openjdk.org/jdk/pull/26362/files/3c33fac9..ea78a5a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=03-04 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26362.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26362/head:pull/26362 PR: https://git.openjdk.org/jdk/pull/26362 From epeter at openjdk.org Fri Sep 5 08:13:35 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 08:13:35 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v4] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:07:21 GMT, Marc Chevalier wrote: > I sometime dream we had node->is(), that would ease a few of these things (subjective). Is that something we could do, in a separate RFE? I wonder if we could generate it with the same kind of macros as with which we define `is_Region`... we would just forward from the instantiation `is` to `is_Region`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26362#issuecomment-3257466360 From mchevalier at openjdk.org Fri Sep 5 08:17:18 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 5 Sep 2025 08:17:18 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:35 GMT, Marc Chevalier wrote: >> Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. >> >> Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. >> >> This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. >> >> For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. >> >> On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: >> >> 1 failure for node >> 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> At node >> 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) >> From path: >> [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) >> <-(0)- 210 IfFalse === 209 [[ 21... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > One more ResourceMark Totally not in this change, yes. And indeed, we could just use the macro to define a bit more. But I fear it will be a controversial topic. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26362#issuecomment-3257476856 From duke at openjdk.org Fri Sep 5 08:19:35 2025 From: duke at openjdk.org (erifan) Date: Fri, 5 Sep 2025 08:19:35 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable Message-ID: Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. Benchmarks on Nvidia Grace machine with 128-bit SVE: Benchmark Unit Before Score Error After Score Error Uplift microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 Benchmarks on Intel 6444y machine with 512-bit avx3: Benchmark Unit Before Score Error After Score Error Uplift microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 microMaskLaneIsSetLong256_var ops/ms 113838.3822 415.784529 360782.0645 710.076899 3.169247 microMaskLaneIsSetLong512_var ops/ms 57314.02695 190.1762 211690.8492 26.47233 3.693526 microMaskLaneIsSetShort128_var ops/ms 57675.58965 65.940976 211549.9551 276.57545 3.667928 microMaskLaneIsSetShort256_var ops/ms 57628.8642 91.957833 211694.0864 16.559412 3.673403 microMaskLaneIsSetShort512_var ops/ms 57845.35211 160.537421 211358.872 660.777147 3.65386 microMaskLaneIsSetShort64_var ops/ms 113848.8846 222.787418 360294.6295 491.425656 3.164674 ------------- Commit messages: - 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable Changes: https://git.openjdk.org/jdk/pull/27113/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27113&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366588 Stats: 170 lines in 4 files changed: 168 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27113.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27113/head:pull/27113 PR: https://git.openjdk.org/jdk/pull/27113 From duke at openjdk.org Fri Sep 5 08:21:11 2025 From: duke at openjdk.org (erifan) Date: Fri, 5 Sep 2025 08:21:11 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> References: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> Message-ID: On Tue, 2 Sep 2025 08:10:02 GMT, Andrew Haley wrote: >> Thanks @theRealAph . >> >> I've indeed considered and implemented your idea. The code diff: >> >> diff --git a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> index 11d302e9026..841d24f516b 100644 >> --- a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> +++ b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> @@ -3813,8 +3813,9 @@ template >> bool isMerge, bool isFloat) { >> starti; >> assert(T != Q, "invalid size"); >> + assert((!isFloat) || (isFloat && T != B), "invalid size"); >> int sh = 0; >> - if (imm8 <= 127 && imm8 >= -128) { >> + if ((imm8 <= 127 && imm8 >= -128) || (isFloat && (imm8 >> 8) == 0)) { >> sh = 0; >> } else if (T != B && imm8 <= 32512 && imm8 >= -32768 && (imm8 & 0xff) == 0) { >> sh = 1; >> @@ -3824,7 +3825,7 @@ template >> } >> int m = isMerge ? 1 : 0; >> f(0b00000101, 31, 24), f(T, 23, 22), f(0b01, 21, 20); >> - prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), sf(imm8, 12, 5), rf(Zd, 0); >> + prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), f(imm8&0xff, 12, 5), rf(Zd, 0); >> } >> >> public: >> @@ -3834,7 +3835,7 @@ template >> } >> // SVE copy floating-point immediate to vector elements (predicated) >> void sve_cpy(FloatRegister Zd, SIMD_RegVariant T, PRegister Pg, double d) { >> - sve_cpy(Zd, T, Pg, checked_cast(pack(d)), /*isMerge*/true, /*isFloat*/true); >> + sve_cpy(Zd, T, Pg, checked_cast(pack(d)), /*isMerge*/true, /*isFloat*/true); >> } >> >> // SVE conditionally select elements from two vectors >> >> >> However, some of my colleagues have differing opinions: >> 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. >> 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . While some encoding code can be reused, separating the encodings makes the code clearer. >> >> I think both implementations are fine. If you think it's better to not refactor, I'll revert. > >> 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. > > That's a fair point, but the Arch64 name for all four instructions is CPY, and they are distinguished by their operands. Deviation from the names in the Reference Manual is occasionally necessary, but it makes life painful for maintainers when they have to search for what we've called an instruction they want to use. > >> 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . > > Yes, that's right. > >> While some encoding code can be reused, separating the encodings makes the code clearer. > > I don't agree that it makes the code clearer. In fact, tight factoring emphasizes the fact that these instructions are similar, and explicitly shows where they are different. > > It is true that I have a strong bias against copy-and-paste programming. > >> I think both implementations are fine. If you think it's better to not refactor, I'll revert. > > I do. Thank you. Thanks for your review @theRealAph @eme64 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3257481227 From duke at openjdk.org Fri Sep 5 08:21:13 2025 From: duke at openjdk.org (duke) Date: Fri, 5 Sep 2025 08:21:13 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats [v3] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 10:02:24 GMT, erifan wrote: >> The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: >> >> 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: >> - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. >> - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. >> - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. >> - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. >> >> 2. Additionally, the encoding of the negative floating-point number is incorrect: >> - The imm8?field can fall outside the valid range of?**[-128, 127]**. >> - Bit **13** should be encoded as **0** for floating-point numbers. >> >> This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. >> >> Some test cases are added to aarch64-asmtest.py, and all tests passed. > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Code style fixes @erifan Your change (at version 66ba6570fd3a6f1a8faa794ed019e7aa768ac38e) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3257486195 From epeter at openjdk.org Fri Sep 5 08:50:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 08:50:20 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 15:39:20 GMT, Galder Zamarre?o wrote: >> I have seen 3 manifestations of this bug: >> >> 1. assert >> >> # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 >> # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require >> >> >> 2. assert >> >> # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 >> # Error: assert(bt == T_FLOAT) failed >> >> >> 3. Wrong result >> When the feature was available but we used the wrong CastVector >> >> It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: >> >> >> } else if (VectorNode::is_reinterpret_opcode(opc)) { >> assert(first->req() == 2 && req() == 2, "only one input expected"); >> const TypeVect* vt = TypeVect::make(bt, vlen); >> vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); >> >> >> Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. >> >> But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: >> The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. >> >> The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. > > Makes sense @eme64. Happy with the fix and tests :) @galderz @iwanowww @TobiHartmann Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27100#issuecomment-3257571484 From epeter at openjdk.org Fri Sep 5 08:50:22 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 08:50:22 GMT Subject: Integrated: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 14:42:46 GMT, Emanuel Peter wrote: > I have seen 3 manifestations of this bug: > > 1. assert > > # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 > # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require > > > 2. assert > > # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 > # Error: assert(bt == T_FLOAT) failed > > > 3. Wrong result > When the feature was available but we used the wrong CastVector > > It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: > > > } else if (VectorNode::is_reinterpret_opcode(opc)) { > assert(first->req() == 2 && req() == 2, "only one input expected"); > const TypeVect* vt = TypeVect::make(bt, vlen); > vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); > > > Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. > > But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: > The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. > > The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. This pull request has now been integrated. Changeset: e6fa8aae Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/e6fa8aae6168ea5a8579cd0a38209ca71c32e704 Stats: 226 lines in 2 files changed: 225 ins; 0 del; 1 mod 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type Reviewed-by: thartmann, galder, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/27100 From epeter at openjdk.org Fri Sep 5 09:03:21 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 09:03:21 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 15:39:20 GMT, Galder Zamarre?o wrote: >> I have seen 3 manifestations of this bug: >> >> 1. assert >> >> # Internal Error (.../src/hotspot/cpu/x86/x86.ad:7640), pid=84140, tid=28419 >> # assert(UseAVX > 2 && VM_Version::supports_avx512dq()) failed: require >> >> >> 2. assert >> >> # Internal Error (.../src/hotspot/share/opto/vectornode.cpp:1601), pid=4022154, tid=4022168 >> # Error: assert(bt == T_FLOAT) failed >> >> >> 3. Wrong result >> When the feature was available but we used the wrong CastVector >> >> It seems that [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) introduced reinterpret nodes to SuperWord: >> >> >> } else if (VectorNode::is_reinterpret_opcode(opc)) { >> assert(first->req() == 2 && req() == 2, "only one input expected"); >> const TypeVect* vt = TypeVect::make(bt, vlen); >> vn = new VectorReinterpretNode(in1, vt, in1->bottom_type()->is_vect()); >> >> >> Sadly, the `src` and `dst` type are swapped. For JDK25 [JDK-8346236](https://bugs.openjdk.org/browse/JDK-8346236) this had no bad effect yet, since we only cast between HF and short, which are both based on short. >> >> But with [JDK-8329077](https://bugs.openjdk.org/browse/JDK-8329077) we can now do reinterpret between I/F and between D/L. Here swapping has an effect, especially if it is followed by a cast: >> The cast deterines its input type from the output type of the input node. If that was a reinterpret node with the wrong output type, **we would get a cast with the wrong src type**. We might do a double -> int cast instead of a long -> int cast. That leads to all sorts of issues. >> >> The fuzzer test was only just recently added with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751). It uses MemorySegment, where unaligned float/double access gets handled with long/int memory access and then reinterpret (eg `MoveI2F`). But I was able to find examples that just work with `Float.intBitsToFloat` etc. > > Makes sense @eme64. Happy with the fix and tests :) @galderz @iwanowww @TobiHartmann FYI, I filed: [JDK-8366965](https://bugs.openjdk.org/browse/JDK-8366965) C2 SuperWord: add more tests for MoveF2I / Float.floatToRawIntBits and friends ------------- PR Comment: https://git.openjdk.org/jdk/pull/27100#issuecomment-3257611988 From duke at openjdk.org Fri Sep 5 09:21:18 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Fri, 5 Sep 2025 09:21:18 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v8] In-Reply-To: References: <5e1o1xtN0ZdQZGJi2aVmgCEApW625koeE9F53VhDi5E=.2390045d-844e-4800-8d4b-075a2a3a8793@github.com> Message-ID: <0xBTvjLjmNJpFoSlXulP1kaiNo97ld-fYPqsfLBzZXQ=.0b0baf4d-2373-4ad2-8bc2-47b68cc8d24f@github.com> On Wed, 4 Jun 2025 06:04:46 GMT, Robbin Ehn wrote: >> As you can expect I am trying to implement the following code with RVV: >> >> for (; i + (N-1) < cnt; i += N) { >> h = 31^^N * h >> + 31^^(N-1) * val[i + 0] >> + 31^^(N-2) * val[i + 1] >> ... >> + 31^^1 * val[i + (N-2)] >> + 31^^0 * val[i + (N-1)]; >> } >> for (; i < cnt; i++) { >> h = 31 * h + val[i]; >> } >> >> where `N` is a number of processing array elements in "chunk". >> IIUC, the main issue with your approach is "reverse" order of array elements versus preloaded `31^^X` coeffs WHEN the remaining number of elems is less than `N`, say `M=N-1`. >> >> h = 31^^M * h >> + 31^^(M-1) * val[i + 0] >> + 31^^(M-2) * val[i + 1] >> ... >> + 31^^1 * val[i + (M-2)] >> + 32^^0 * val[i + (M-1)]; >> >> or returning to our `N` for clarity >> >> h = 31^^(N-1) * h >> + 31^^(N-2) * val[i + 0] >> + 31^^(N-3) * val[i + 1] >> ... >> + 31^^1 * val[i + (N-3)] >> + 31^^0 * val[i + (N-2)]; >> >> Now we need to "slide down" preloaded multiplier coeffs in designated vector register by one (as `M=N-1`) to be in "sync" with `val[i + X]` (may be move them into temporary VR in the process), and moreover, DO this operation IFF the remaining `cnt` is less than `N` (==>an additional check on every iteration). That's probably acceptable only at tail phase as one-time operation but NOT inside of main loop... > > @ygaevsky @RealFYang how can we procced ? Hi @robehn, could you please take a look at the latest updates? Thanks... ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3257668418 From epeter at openjdk.org Fri Sep 5 09:38:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 09:38:09 GMT Subject: RFR: 8366890: C2: Split through phi printing with TraceLoopOpts misses line break In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 12:44:43 GMT, Christian Hagedorn wrote: > [JDK-8356176](https://bugs.openjdk.org/browse/JDK-8356176) added new printing code for `TraceLoopOpts` when splitting nodes through a phi but missed a line break. This will result in: > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 RegionSplit-If > > instead of > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 Region > Split-If > > This patch fixes this. > > Thanks, > Christian Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27092#pullrequestreview-3188663001 From epeter at openjdk.org Fri Sep 5 09:42:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 09:42:10 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT. In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 00:53:59 GMT, Cesar Soares Lucas wrote: > Please, review this patch to fix issue that may occur when reducing allocation merge. > > As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. > > The change in `revisit_reducible_phi_status` is just a clean-up. > The real fix is in `find_scalar_replaceable_allocs`. > > Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. test/hotspot/jtreg/compiler/escapeAnalysis/TestReduceAllocationNotReducibleAnymore.java line 58: > 56: } > 57: } > 58: } Could we make the catch exception matching more precise? I'd just like to avoid a case where we miscompile and throw the wrong exception and that gets caught silently. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2324608898 From dlong at openjdk.org Fri Sep 5 09:48:12 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 5 Sep 2025 09:48:12 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v2] In-Reply-To: <5CGrcWjFZ7Zqj_Tm0LO6Tqg9cUA-xxvcaa2J-yWW8BE=.af4dea7c-e39d-491d-b924-c89fa82e757a@github.com> References: <5CGrcWjFZ7Zqj_Tm0LO6Tqg9cUA-xxvcaa2J-yWW8BE=.af4dea7c-e39d-491d-b924-c89fa82e757a@github.com> Message-ID: On Thu, 14 Aug 2025 10:54:08 GMT, Damon Fenacci wrote: >> # Issue >> While compiling `java.util.zip.ZipFile` in C2 this assert is triggered >> https://github.com/openjdk/jdk/blob/a2e86ff3c56209a14c6e9730781eecd12c81d170/src/hotspot/share/opto/memnode.cpp#L4235 >> >> # Cause >> While compiling the constructor of java.util.zip.ZipFile$CleanableResource the following happens: >> * we insert a trailing `MemBarStoreStore` in the constructor >> before_folding >> >> * during IGVN we completely fold the memory subtree of the `MemBarStoreStore` node. The node still has a control output attached. >> after_folding >> >> * later during the same IGVN run the `MemBarStoreStore` node is handled and we try to remove it (because the `Allocate` node of the `MembBar` is not escaping the thread ) https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4301-L4302 >> * the assert https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4235 >> triggers because the barrier has only 1 (control) output and is a `MemBarStoreStore` (not `Initialize`) barrier >> >> The issue happens only when the `UseStoreStoreForCtor` is set (default as well), which makes C2 use `MemBarStoreStore` instead of `MemBarRelease` at the end of constructors. `MemBarStoreStore` are processed separately by EA and this happens after the IGVN pass that folds the memory subtree. `MemBarRelease` on the other hand are handled during same IGVN pass before the memory subtree gets removed and it?s still got 2 outputs (assert skipped). >> >> # Fix >> Adapting the assert to accept that `MemBarStoreStore` can also have `!= 2` outputs (when `+UseStoreStoreForCtor` is used) seems to be an OK solution as this seems like a perfectly plausible situation. >> >> # Testing >> Unfortunately reproducing the issue with a simple regression test has proven very hard. The test seems to rely on very peculiar profiling and IGVN worklist sequence. JBS replay compilation passes. Running JCK's `api/java_util` 100 times triggers the assert a couple of times on average before the fix, none after. >> Tier 1-3+ tests passed. > > Damon Fenacci has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' into JDK-8360031 > - JDK-8360031: update assert message > - Merge branch 'master' into JDK-8360031 > - JDK-8360031: remove unnecessary include > - JDK-8360031: remove UseNewCode > - JDK-8360031: compilation asserts in MemBarNode::remove I stepped through the crash with the replay file, and I'm not convinced that the problem is only with MemBarStoreStore and not MemBarRelease. What happens in the replay crash is the MemBarStoreStore gets onto the worklist through an indirect route in ConnectionGraph::split_unique_types() because of its memory edge. I think this explains why it is intermittent and hard to reproduce. A MemBarRelease on the other hand would get added to the worklist directly in compute_escape() if it has a Precedent edge. The different handling of MemBarStoreStore vs MemBarRelease in this code is confusing. The MemBarRelease code came from JDK-6934604. It adds the node to the worklist, and lets MemBarNode::Ideal remove it based on does_not_escape_thread() on the alloc node. Contrast that with the MemBarStoreStore handling, which came from JDK-7121140, and instead of removing the node, it replaces it with a MemBarCPUOrder based on not_global_escape() on the alloc node. This MemBarStoreStore handling is for "MemBarStoreStore nodes added in library_call.cpp" and seems to fail to work for MemBarStoreStore nodes added in the ctor, which means MemBarStoreStore nodes added in the ctor only get on the worklist by accident, as mentioned above. I think the conservative fix is to have compute_escape() always add the MemBarStoreStore to the worklist if it has a Precedent edge. Because of StressIGVN randomizing the worklist, I think the outcnt() can be 1 for either MemBarStoreStore or MemBarRelease, so we should relax the assert accordingly. I'm not sure how useful the assert will be after that. It might be better to remove it. Longer-term, it might be nice to get rid of the separate handling of "MemBarStoreStore nodes added in library_call.cpp" if the MemBarCPUOrder is not really needed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26556#issuecomment-3257743488 From epeter at openjdk.org Fri Sep 5 09:49:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 09:49:10 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT. In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 00:53:59 GMT, Cesar Soares Lucas wrote: > Please, review this patch to fix issue that may occur when reducing allocation merge. > > As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. > > The change in `revisit_reducible_phi_status` is just a clean-up. > The real fix is in `find_scalar_replaceable_allocs`. > > Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. Just a drive-by comment. You also have a broken title ;) src/hotspot/share/opto/escape.cpp line 3078: > 3076: Node* phi = reducible_merges.at(i); > 3077: > 3078: if (!can_reduce_phi(phi->as_Phi())) { You say this is a pure cleanup? There are some slight differences in the code though, right? This method call checks `PhaseMacroExpand::can_eliminate_allocation`, and has a side effect with `ptn->set_scalar_replaceable(false)`. Just pointing it out, not a EA expert. ------------- PR Review: https://git.openjdk.org/jdk/pull/27063#pullrequestreview-3188687569 PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2324618370 From shade at openjdk.org Fri Sep 5 10:16:12 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 5 Sep 2025 10:16:12 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... This looks fine to me. I took another look at [JDK-8358749](https://bugs.openjdk.org/browse/JDK-8358749), and I think this is the only place where we can really accept the non-constant input. In all other cases, we either pull `is_con()` or `const_oop()` out of the input. I think we will bikeshed about the tests a bit. test/micro/org/openjdk/bench/jdk/incubator/vector/VectorExtractBenchmark.java line 34: > 32: @Warmup(iterations = 5, time = 1) > 33: @Measurement(iterations = 5, time = 1) > 34: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"}) Don't do 1 fork, do at least 3. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27113#pullrequestreview-3188769547 PR Review Comment: https://git.openjdk.org/jdk/pull/27113#discussion_r2324679427 From epeter at openjdk.org Fri Sep 5 10:52:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 10:52:16 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Mon, 4 Aug 2025 02:31:08 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge 'jdk:master' into JDK-8351623-sve > - Address review comments > - Refine IR pattern and clean backend rules > - Fix indentation issue and move the helper matcher method to header files > - Merge branch jdk:master into JDK-8351623-sve > - 8351623: VectorAPI: Add SVE implementation of subword gather load operation Looks very interesting. I have a first series of questions / comments :) There is definitively a tradeoff between complexity in the backend and in the C2 IR. So I'm yet trying to wrap my head around that decision. I'm just afraid that adding more very specific C2 IR nodes makes things more complicated to do optimizations in the C2 IR. src/hotspot/cpu/aarch64/aarch64_vector.ad line 6008: > 6006: // predicate and place in elements of twice their size within > 6007: // the destination predicate. > 6008: Suggestion: unnecessary empty line src/hotspot/share/opto/vectornode.hpp line 1123: > 1121: // The basic type of memory, which might be different with the vector element type > 1122: // when it is a subword type loading. > 1123: BasicType _mem_bt; Can you make an example and add it to the comment? Can you please also add some comment at the node about what we expect the index map to be? What basic type does it have? src/hotspot/share/opto/vectornode.hpp line 1769: > 1767: // dst = [h g f e d c b a] > 1768: // > 1769: class VectorConcatenateNode : public VectorNode { That semantic is not quite what I would expect from `Concatenate`. Maybe we can call it something else? `VectorConcatenateAndNarrowNode`? src/hotspot/share/opto/vectornode.hpp line 1774: > 1772: : VectorNode(vec1, vec2, vt) { > 1773: assert(type2aelembytes(vec1->bottom_type()->is_vect()->element_basic_type()) == > 1774: type2aelembytes(vt->element_basic_type()) * 2, "must be half size"); What about asserting that `vec1` and `vec2` have the same `vect`? src/hotspot/share/opto/vectornode.hpp line 1841: > 1839: > 1840: // Unpack the elements to twice size. > 1841: class VectorMaskWidenNode : public VectorNode { Can you add a visual example like above for `VectorConcatenateNode`, please? ------------- PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3188813972 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324710079 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324736345 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324740007 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324741462 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324744990 From epeter at openjdk.org Fri Sep 5 10:52:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 10:52:17 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> Message-ID: On Fri, 1 Aug 2025 01:48:51 GMT, Xiaohong Gong wrote: >> src/hotspot/cpu/arm/matcher_arm.hpp line 160: >> >>> 158: static const bool supports_encode_ascii_array = false; >>> 159: >>> 160: // Return true if vector gather-load/scatter-store needs vector index as input. >> >> If the function returns `false`, does it indicate one of the following cases? >> - Vector gather-load or scatter-store does not accept a vector index for the current use case on this platform. >> - The current platform does not support vector gather-load or scatter-store at all. > > Yes, I think so. To me a `false` means this: If we support gater/scalter, then we do not need a vector index, we can do without it. Is that correct? But that would contradict @fg1417 's statement: If we support gater/scalter, then we do not permit a vector index. Can you clarify? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324726476 From epeter at openjdk.org Fri Sep 5 10:52:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 10:52:18 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 10:37:39 GMT, Emanuel Peter wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: >> >> - Merge 'jdk:master' into JDK-8351623-sve >> - Address review comments >> - Refine IR pattern and clean backend rules >> - Fix indentation issue and move the helper matcher method to header files >> - Merge branch jdk:master into JDK-8351623-sve >> - 8351623: VectorAPI: Add SVE implementation of subword gather load operation > > src/hotspot/share/opto/vectornode.hpp line 1123: > >> 1121: // The basic type of memory, which might be different with the vector element type >> 1122: // when it is a subword type loading. >> 1123: BasicType _mem_bt; > > Can you make an example and add it to the comment? > Can you please also add some comment at the node about what we expect the index map to be? What basic type does it have? Same for the scatter. > src/hotspot/share/opto/vectornode.hpp line 1769: > >> 1767: // dst = [h g f e d c b a] >> 1768: // >> 1769: class VectorConcatenateNode : public VectorNode { > > That semantic is not quite what I would expect from `Concatenate`. Maybe we can call it something else? > `VectorConcatenateAndNarrowNode`? Have you considered using `2x Cast + Concatenate` instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the `unslice` or some other operation that concatenates two vectors. > src/hotspot/share/opto/vectornode.hpp line 1774: > >> 1772: : VectorNode(vec1, vec2, vt) { >> 1773: assert(type2aelembytes(vec1->bottom_type()->is_vect()->element_basic_type()) == >> 1774: type2aelembytes(vt->element_basic_type()) * 2, "must be half size"); > > What about asserting that `vec1` and `vec2` have the same `vect`? And what about the vector length being consistent between `vec1`, `vec2` and `vt`? > src/hotspot/share/opto/vectornode.hpp line 1841: > >> 1839: >> 1840: // Unpack the elements to twice size. >> 1841: class VectorMaskWidenNode : public VectorNode { > > Can you add a visual example like above for `VectorConcatenateNode`, please? Did you consider the alternative of `Extract` + `Cast`? Not sure if that would be better, you know more about the code complexity. It would just allow us to have one fewer nodes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324737096 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324754984 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324742727 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2324748722 From shade at openjdk.org Fri Sep 5 11:45:20 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 5 Sep 2025 11:45:20 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability Message-ID: See the bug for discussion what issues current machinery has. This PR executes the plan outlined in the bug: 1. Common the receiver type profiling code in interpreter and C1 2. Rewrite receiver type profiling code to only do atomic receiver slot installations 3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral. Additional testing: - [x] Linux x86_64 server fastdebug, `compiler/` - [ ] Linux x86_64 server fastdebug, `all` ------------- Commit messages: - Drop atomic counters - Initial version Changes: https://git.openjdk.org/jdk/pull/25305/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25305&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8357258 Stats: 350 lines in 7 files changed: 135 ins; 196 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/25305.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25305/head:pull/25305 PR: https://git.openjdk.org/jdk/pull/25305 From shade at openjdk.org Fri Sep 5 11:45:20 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 5 Sep 2025 11:45:20 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability In-Reply-To: References: Message-ID: <2TI8gwmSmpcW8-UCscGYU_5qijJhfmmetVox0yDDkOU=.2bf74836-67c3-4b21-92b8-1780f2e03582@github.com> On Mon, 19 May 2025 14:59:36 GMT, Aleksey Shipilev wrote: > See the bug for discussion what issues current machinery has. > > This PR executes the plan outlined in the bug: > 1. Common the receiver type profiling code in interpreter and C1 > 2. Rewrite receiver type profiling code to only do atomic receiver slot installations > 3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed > > This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler/` > - [ ] Linux x86_64 server fastdebug, `all` In addition to reliability improvements, doing a denser loop allows to significantly optimize tier3 code density. With larger `TypeProfileWidth`, type profile checks are the significant part of generated code. This density improvement allows us to do the CAS without increasing the code size. It also allows us to store (more) tier3 code in AOTCache going forward. If/when folks (looking at @theRealAph, really) start doing probabilistic profiling counters, this budget increase would also help to cram in more code. $ for I in 1 2 3 4; do build/linux-x86_64-server-release/images/jdk/bin/java -XX:TieredStopAtLevel=${I} \ -Xcomp -XX:+CITime -Xmx2g Hello.java 2>&1 | grep "Tier${I}" | cut -d' ' -f 3,23-; done === -XX:TypeProfileWidth=2 (default) # Baseline Tier1 nmethods_code_size: 7091616 bytes Tier2 nmethods_code_size: 7579424 bytes Tier3 nmethods_code_size: 17494984 bytes Tier4 nmethods_code_size: 6058128 bytes # Patched Tier1 nmethods_code_size: 7091648 bytes Tier2 nmethods_code_size: 7581808 bytes Tier3 nmethods_code_size: 16806440 bytes (-4.1%) Tier4 nmethods_code_size: 6057920 bytes === -XX:TypeProfileWidth=8 (default with +UseJVMCICompiler) # Baseline Tier1 nmethods_code_size: 7091672 bytes Tier2 nmethods_code_size: 7580576 bytes Tier3 nmethods_code_size: 28096448 bytes Tier4 nmethods_code_size: 6061280 bytes # Patched Tier1 nmethods_code_size: 7090760 bytes Tier2 nmethods_code_size: 7579432 bytes Tier3 nmethods_code_size: 16837688 bytes (-66.7% !!!) Tier4 nmethods_code_size: 6058104 bytes ------------- PR Comment: https://git.openjdk.org/jdk/pull/25305#issuecomment-3258049226 From djelinski at openjdk.org Fri Sep 5 13:37:41 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Fri, 5 Sep 2025 13:37:41 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer Message-ID: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> The nop list has never been used in the history of OpenJDK. Let's clean it up. Tested with Mach5 tier 1-5, no related failures. ------------- Commit messages: - Update copyright - Remove outdated comment - Remove nop list Changes: https://git.openjdk.org/jdk/pull/27117/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27117&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366971 Stats: 83 lines in 11 files changed: 1 ins; 77 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27117.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27117/head:pull/27117 PR: https://git.openjdk.org/jdk/pull/27117 From epeter at openjdk.org Fri Sep 5 13:51:21 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 13:51:21 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes Message-ID: I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: https://github.com/openjdk/jdk/pull/20964 [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. --------------------------------- I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. My vision: - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes - That means it is straight-forward to compute cost - And it also makes optimizations on that graph easier - And the `apply` methods are simpler too ---------------------------------- So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. What I did: - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. - Will make it easier to optimize and compute cost in future RFE's. - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: - `VTransformReinterpretVectorNode` - `VTransformElementWiseLongOpWithCastToIntVectorNode` - `VTransformCmpVectorNode` - Rename `set_all_req_with_vectors` -> `init_all_req_with_vectors` (forgot it in #26991) - A few smaller changes / refactorings. ------------- Commit messages: - fix merge - manual merge conflict resolution - flatten - cleanup - adr_type refactor - hide prototype - wip x1 - wip continued 2 - wip continued - wip cleanup - ... and 13 more: https://git.openjdk.org/jdk/compare/0dad3f1a...05ee2800 Changes: https://git.openjdk.org/jdk/pull/27056/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366702 Stats: 327 lines in 4 files changed: 169 ins; 55 del; 103 mod Patch: https://git.openjdk.org/jdk/pull/27056.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27056/head:pull/27056 PR: https://git.openjdk.org/jdk/pull/27056 From epeter at openjdk.org Fri Sep 5 13:51:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 13:51:25 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 15:30:06 GMT, Emanuel Peter wrote: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... src/hotspot/share/opto/superwordVTransformBuilder.cpp line 91: > 89: init_req_with_scalar(p0, vtn, MemNode::Control); > 90: init_req_with_scalar(p0, vtn, MemNode::Address); > 91: init_req_with_vector(pack, vtn, MemNode::ValueIn); I'm also adding control to the load/store vectors. That allows us to load control without access to the `nodes` in `VTransformLoadVectorNode::apply` and `VTransformStoreVectorNode::apply`: https://github.com/openjdk/jdk/blob/05ee280048757e6ac095bf7e28708dce258635bf/src/hotspot/share/opto/vtransform.cpp#L877 https://github.com/openjdk/jdk/blob/05ee280048757e6ac095bf7e28708dce258635bf/src/hotspot/share/opto/vtransform.cpp#L906 src/hotspot/share/opto/superwordVTransformBuilder.cpp line 119: > 117: } else { > 118: init_all_req_with_vectors(pack, vtn); > 119: } I'm mostly flattening the control flow here. There is also a new else case that just does `init_all_req_with_vectors(pack, vtn);` this applies to the new nodes that I split away from `ElementWiseVector`: - `VTransformReinterpretVectorNode` - `VTransformElementWiseLongOpWithCastToIntVectorNode` - `VTransformCmpVectorNode` I also adapted the logic for `CMove`, to integrate the special handling logic from `VTransformElementWiseVectorNode::apply`, so now the inputs are differently permuted already at this stage, and they are now already the same as the generated `BlendVector` will once have them. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 196: > 194: vtn = new (_vtransform.arena()) VTransformElementWiseVectorNode(_vtransform, p0->req(), prototype, vopc); > 195: } else if (VectorNode::is_scalar_op_that_returns_int_but_vector_op_returns_long(sopc)) { > 196: vtn = new (_vtransform.arena()) VTransformElementWiseLongOpWithCastToIntVectorNode(_vtransform, prototype); Cases moved from `VTransformElementWiseVectorNode::apply`. src/hotspot/share/opto/vtransform.cpp line 108: > 106: #ifndef PRODUCT > 107: if (_trace._info) { > 108: print_schedule(); Verbose is often too much, but it is nice to see which `VTransformNode`s are generated, and to see their order after scheduling. src/hotspot/share/opto/vtransform.cpp line 163: > 161: VTransformMemVectorNode* vtn = vtnodes.at(i)->isa_MemVector(); > 162: if (vtn == nullptr) { continue; } > 163: const VPointer& vp = vtn->vpointer(); We can check for `MemVector` directly, and then we know that they all represent `Mem` nodes and they all have a `vpointer`. src/hotspot/share/opto/vtransform.cpp line 798: > 796: // Handled by Bool / VTransformBoolVectorNode, so we do not generate any nodes here. > 797: return VTransformApplyResult::make_empty(); > 798: } Moved to `VTransformCmpVectorNode` -> has empty apply. src/hotspot/share/opto/vtransform.cpp line 801: > 799: vn = VectorNode::make(vopc, in1, in2, vt); // unary and binary > 800: } else { > 801: vn = VectorNode::make(vopc, in1, in2, in3, vt); // ternary Moved to `SuperWordVTransformBuilder::build_inputs_for_vector_vtnodes`, to simplify the logic here. src/hotspot/share/opto/vtransform.cpp line 880: > 878: // first has the correct memory state, determined by VTransformGraph::apply_memops_reordering_with_schedule > 879: Node* mem = first->in(MemNode::Memory); > 880: Node* adr = apply_state.transformed_node(in_req(MemNode::Address)); There is still minimal reliance on `nodes` / `first`: but only for `mem` state. And currently, we cannot remove that yet, because we rely on the memory graph being reordered before vectorization, see `VTransformGraph::apply_memops_reordering_with_schedule`. In a future RFE, I will refactor scheduling, so that we build the memory graph during apply. See step 3 in [plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) src/hotspot/share/opto/vtransform.cpp line 909: > 907: // first has the correct memory state, determined by VTransformGraph::apply_memops_reordering_with_schedule > 908: Node* mem = first->in(MemNode::Memory); > 909: Node* adr = apply_state.transformed_node(in_req(MemNode::Address)); There is still minimal reliance on nodes / first: but only for mem state. And currently, we cannot remove that yet, because we rely on the memory graph being reordered before vectorization, see VTransformGraph::apply_memops_reordering_with_schedule. In a future RFE, I will refactor scheduling, so that we build the memory graph during apply. See step 3 in [plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) src/hotspot/share/opto/vtransform.cpp line 933: > 931: phase->register_new_node(vn, apply_state.vloop().cl()); > 932: phase->igvn()._worklist.push(vn); > 933: VectorNode::trace_new_vector(vn, "AutoVectorization"); Removing the argument here allows us yet another removal of dependency on the old scalar graph. We only needed it for using the same control as the old graph - but that is not necessary, we can just use the CountedLoop as control, which is good enough. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325037712 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325051629 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325054470 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325058939 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325065437 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325066973 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325068600 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325076487 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325077256 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325079910 From epeter at openjdk.org Fri Sep 5 13:51:26 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 5 Sep 2025 13:51:26 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 13:13:02 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > src/hotspot/share/opto/vtransform.cpp line 801: > >> 799: vn = VectorNode::make(vopc, in1, in2, vt); // unary and binary >> 800: } else { >> 801: vn = VectorNode::make(vopc, in1, in2, in3, vt); // ternary > > Moved to `SuperWordVTransformBuilder::build_inputs_for_vector_vtnodes`, to simplify the logic here. `is_scalar_op_that_returns_int_but_vector_op_returns_long` moved down to `VTransformElementWiseLongOpWithCastToIntVectorNode`. `is_reinterpret_opcode` moved down to `VTransformReinterpretVectorNode`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325071420 From chagedorn at openjdk.org Fri Sep 5 14:56:18 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 5 Sep 2025 14:56:18 GMT Subject: RFR: 8353290: C2: Refactor PhaseIdealLoop::is_counted_loop() [v8] In-Reply-To: <7qgNsgKbFFtzVwuDG2yM_vIczHbzMj6ZUKh_7sz1qow=.d9aeab55-f647-43bc-af2a-48f23d5bbcca@github.com> References: <7qgNsgKbFFtzVwuDG2yM_vIczHbzMj6ZUKh_7sz1qow=.d9aeab55-f647-43bc-af2a-48f23d5bbcca@github.com> Message-ID: <1PU55shmn1ijfzU6eeVUqZ4aAMd1szDOfSN24J1wfKE=.475818f9-7120-42bc-9717-54358fd4e855@github.com> On Tue, 26 Aug 2025 14:47:00 GMT, Kangcheng Xu wrote: >> This PR refactors `PhaseIdealLoop::is_counted_loop()` into (mostly) `CountedLoopConverter::is_counted_loop()` and `CountedLoopConverter::convert()` to decouple the detection and conversion code. This enables us to try different loop configurations easily and finally convert once a counted loop is found. >> >> A nested `PhaseIdealLoop::CountedLoopConverter` class is created to handle the context, but I'm not if this is the best name or place for it. Please let me know what you think. >> >> Blocks [JDK-8336759](https://bugs.openjdk.org/browse/JDK-8336759). > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 24 commits: > > - Merge branch 'openjdk:master' into counted-loop-refactor > - Merge remote-tracking branch 'origin/master' into counted-loop-refactor > > # Conflicts: > # src/hotspot/share/opto/loopnode.cpp > # src/hotspot/share/opto/loopnode.hpp > - Merge branch 'master' into counted-loop-refactor > > # Conflicts: > # src/hotspot/share/opto/loopnode.cpp > # src/hotspot/share/opto/loopnode.hpp > # src/hotspot/share/opto/loopopts.cpp > - Merge remote-tracking branch 'origin/master' into counted-loop-refactor > - further refactor is_counted_loop() by extracting functions > - WIP: refactor is_counted_loop() > - WIP: refactor is_counted_loop() > - WIP: review followups > - reviewer suggested changes > - line break > - ... and 14 more: https://git.openjdk.org/jdk/compare/173dedfb...763adeda Hi @tabjy, sorry for letting you wait! I was very busy with other things. Thanks for coming back with an improved version! This looks much better already. I had a first over-viewing look. Will dive into it more again next week. I've just left some thoughts/comments here and there. Generally, I think we could improve on the classes to not just make them pure data holders with public access but actually allow users to call methods to interact with the data that we could hide to prevent modification. Let me know what you think :-) src/hotspot/share/opto/loopnode.cpp line 442: > 440: } > 441: > 442: PhaseIdealLoop::LoopExitTest PhaseIdealLoop::loop_exit_test(const Node* back_control, const IdealLoopTree* loop) { Just an idea here: Could this also be part of `LoopExitTest` instead? Then a user could do something like: LoopExitTest loop_exit_test(...); // i.e. = PhaseIdealLoop::loop_exit_test() but with a `_is_valid` flag. Then at the end you // can also check for the right Cmp opcode and that it's not null which the current caller of // loop_exit_test() are all doing. If that's off, you set `_is_valid` to false accordingly. loop_exit_test.build(); if (loop_exit_test.is_not_valid()) { ... } The same also applies for the other classes like `LoopIVIncr`, `LoopIVStride` etc. src/hotspot/share/opto/loopnode.cpp line 1881: > 1879: PhaseIterGVN* igvn = &_phase->igvn(); > 1880: > 1881: LoopStructure structure{}; I think you can remove the `{}`: Suggestion: LoopStructure structure; src/hotspot/share/opto/loopnode.cpp line 2258: > 2256: } > 2257: > 2258: bool CountedLoopConverter::build_loop_structure(CountedLoopConverter::LoopStructure& structure) { Suggestion: bool CountedLoopConverter::build_loop_structure(LoopStructure& structure) { src/hotspot/share/opto/loopnode.cpp line 2259: > 2257: > 2258: bool CountedLoopConverter::build_loop_structure(CountedLoopConverter::LoopStructure& structure) { > 2259: PhaseIterGVN* igvn = &_phase->igvn(); Not used anymore and can be removed src/hotspot/share/opto/loopnode.cpp line 2266: > 2264: } > 2265: > 2266: PhaseIdealLoop::LoopExitTest exit_test = _phase->loop_exit_test(back_control, _loop); Some thoughts/suggestions here: - The method is still big and you need a moment to figure out what's going on/what checks we do. - It looks like you are only initializing fields of `LoopStructure`. Couldn't you move the method to this class? - You could have a separate field `_is_valid` in `LoopStructure`, then you could remove the `bool` return. I.e. this would then look something like this: LoopStructure loop_structure; loop_structure.build(); if (loop_structure.is_not_valid()) { return false; } You might need to pass in some additional info like `phase` to `LoopStructure` but I think that's okay. - When doing the thing above, then you can just step by step assign the fields as you go and as soon as something is off (i.e. not a counted loop anymore), you set `_is_valid` to false and stop parsing further. This would allow you to further split the method up which also improves documentation and moves field specific things to separate initializer methods: back_control = _phase->loop_exit_control(_head, _loop); if (back_control == nullptr) { _is_valid = false; return false; } exit_test = exit_test(); if (exit_test.is_not_valid()) { _is_valid = false; return; } incr = incr(); iv_incr = PhaseIdealLoop::loop_iv_incr(incr, _head, _loop); .... - Btw, you should use a `_` prefix for the fields. src/hotspot/share/opto/loopnode.cpp line 2329: > 2327: structure.phi = phi; > 2328: > 2329: structure.sfpt = sfpt; Are you later really going to use all these fields? I haven't double-checked. Another note here: I think it would be better to hide the fields and provide accessor methods. Otherwise, everyone can update them. src/hotspot/share/opto/loopnode.hpp line 1327: > 1325: static PhiNode* loop_iv_phi(const Node* xphi, const Node* phi_incr, const Node* head); > 1326: > 1327: bool try_convert_to_counted_loop(Node* head, IdealLoopTree*&loop, const BasicType iv_bt); Suggestion: bool try_convert_to_counted_loop(Node* head, IdealLoopTree*& loop, const BasicType iv_bt); ------------- PR Review: https://git.openjdk.org/jdk/pull/24458#pullrequestreview-3189442899 PR Review Comment: https://git.openjdk.org/jdk/pull/24458#discussion_r2325295897 PR Review Comment: https://git.openjdk.org/jdk/pull/24458#discussion_r2325148526 PR Review Comment: https://git.openjdk.org/jdk/pull/24458#discussion_r2325150664 PR Review Comment: https://git.openjdk.org/jdk/pull/24458#discussion_r2325150111 PR Review Comment: https://git.openjdk.org/jdk/pull/24458#discussion_r2325253187 PR Review Comment: https://git.openjdk.org/jdk/pull/24458#discussion_r2325256932 PR Review Comment: https://git.openjdk.org/jdk/pull/24458#discussion_r2325264509 From chagedorn at openjdk.org Fri Sep 5 15:29:16 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 5 Sep 2025 15:29:16 GMT Subject: RFR: 8366890: C2: Split through phi printing with TraceLoopOpts misses line break In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 13:23:24 GMT, Roberto Casta?eda Lozano wrote: >> [JDK-8356176](https://bugs.openjdk.org/browse/JDK-8356176) added new printing code for `TraceLoopOpts` when splitting nodes through a phi but missed a line break. This will result in: >> >> Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 RegionSplit-If >> >> instead of >> >> Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 Region >> Split-If >> >> This patch fixes this. >> >> Thanks, >> Christian > > Looks good, and trivial. Thanks @robcasloz, @mhaessig and @eme64 for your reviews! And no worries @mhaessig, was easy to overlook :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27092#issuecomment-3258771939 From chagedorn at openjdk.org Fri Sep 5 15:29:17 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 5 Sep 2025 15:29:17 GMT Subject: Integrated: 8366890: C2: Split through phi printing with TraceLoopOpts misses line break In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 12:44:43 GMT, Christian Hagedorn wrote: > [JDK-8356176](https://bugs.openjdk.org/browse/JDK-8356176) added new printing code for `TraceLoopOpts` when splitting nodes through a phi but missed a line break. This will result in: > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 RegionSplit-If > > instead of > > Split 974 CmpI through 1465 Phi in 953 RegionSplit 474 Bool through 1468 Phi in 953 Region > Split-If > > This patch fixes this. > > Thanks, > Christian This pull request has now been integrated. Changeset: ceacf6f7 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/ceacf6f7852514dc9877cfe284f9550c179d913a Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8366890: C2: Split through phi printing with TraceLoopOpts misses line break Reviewed-by: rcastanedalo, mhaessig, epeter ------------- PR: https://git.openjdk.org/jdk/pull/27092 From vlivanov at openjdk.org Fri Sep 5 16:47:23 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 5 Sep 2025 16:47:23 GMT Subject: Integrated: 8358751: C2: Recursive inlining check for compiled lambda forms is broken In-Reply-To: References: Message-ID: On Fri, 22 Aug 2025 01:24:52 GMT, Vladimir Ivanov wrote: > Recursive inlining checks are relaxed for compiled LambdaForms. Since LambdaForms are heavily reused, the check is performed on `MethodHandle` receivers instead. > > Unfortunately, the current implementation is broken. JVMState doesn't guarantee presence of receivers for caller frames. > An attempt to fetch pruned receiver reports unrelated info, but, in the worst case, it ends up as an out-of-bounds access into node's input array and crashes the JVM. > > Proposed fix captures receiver information as part of inlining and preserves it on `JVMState` for every compiled LambdaForm frame, so it can be reliably recovered during subsequent inlining attempts. > > Testing: hs-tier1 - hs-tier8 > > (Special thanks to @mroth23 who prepared a reproducer of the bug.) This pull request has now been integrated. Changeset: 9cca4f7c Author: Vladimir Ivanov URL: https://git.openjdk.org/jdk/commit/9cca4f7c760bea9bf79f7c03f37a70449acad51e Stats: 76 lines in 4 files changed: 42 ins; 1 del; 33 mod 8358751: C2: Recursive inlining check for compiled lambda forms is broken Reviewed-by: dlong, roland ------------- PR: https://git.openjdk.org/jdk/pull/26891 From mhaessig at openjdk.org Fri Sep 5 16:52:20 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 5 Sep 2025 16:52:20 GMT Subject: RFR: 8366878: Improve flags of compiler/loopopts/superword/TestAlignVectorFuzzer.java Message-ID: The test definitions of `TestAlignVectorFuzzer.java` all contain `printcompilation` directives. These are redundant and slow down the test execution of a test that already often times out. @eme64 also suggested adding a `compileonly` directive to one of the four tests. Testing: - [ ] Github Actions - [ ] tier1 and stress testing (features `TestAlignVectorFuzzer.java`) ------------- Commit messages: - Fix flags Changes: https://git.openjdk.org/jdk/pull/27122/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27122&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366878 Stats: 5 lines in 1 file changed: 0 ins; 3 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27122.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27122/head:pull/27122 PR: https://git.openjdk.org/jdk/pull/27122 From jbhateja at openjdk.org Fri Sep 5 17:14:27 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 5 Sep 2025 17:14:27 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v2] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 409295.875 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 368025.608 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 418649.269 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 381330.221 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: New IR test addition and review resolutions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/c83be331..a68fbc08 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=00-01 Stats: 170 lines in 4 files changed: 155 ins; 4 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From mhaessig at openjdk.org Fri Sep 5 17:15:47 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 5 Sep 2025 17:15:47 GMT Subject: RFR: 8366569: Disable CompileTaskTimeout for known long-running test cases Message-ID: This PR deliberately disables compile task timeouts using `-XX:CompileTaskTimeout=0` on some tests that are known to have long compilation times due to their construction. Disabling the timeouts in the task description enables running all other tests in the test suite in a Ci with a lower timeout and thus a higher chance of discovering degenerate compilations. In a perfect world, timeout values passed from the commandline would be increased by some factor to also have timeouts on these tests when requested. However, I am working with the tools I know and have... Testing: - [ ] Github Actions - [x] tier1,tier2,tier3 and stress testing with fastdebug on Oracle supported platforms. ------------- Commit messages: - Disable timeouts Changes: https://git.openjdk.org/jdk/pull/27123/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27123&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366569 Stats: 21 lines in 5 files changed: 16 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27123.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27123/head:pull/27123 PR: https://git.openjdk.org/jdk/pull/27123 From jbhateja at openjdk.org Fri Sep 5 17:17:52 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 5 Sep 2025 17:17:52 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 409295.875 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 368025.608 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 418649.269 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 381330.221 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update countbitsnode.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/a68fbc08..52ae6bc8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=01-02 Stats: 9 lines in 1 file changed: 0 ins; 1 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From cslucas at openjdk.org Fri Sep 5 17:19:11 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 5 Sep 2025 17:19:11 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 09:39:54 GMT, Emanuel Peter wrote: >> Please, review this patch to fix issue that may occur when reducing allocation merge. >> >> As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. >> >> The change in `revisit_reducible_phi_status` is just a clean-up. >> The real fix is in `find_scalar_replaceable_allocs`. >> >> Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. > > test/hotspot/jtreg/compiler/escapeAnalysis/TestReduceAllocationNotReducibleAnymore.java line 58: > >> 56: } >> 57: } >> 58: } > > Could we make the catch exception matching more precise? I'd just like to avoid a case where we miscompile and throw the wrong exception and that gets caught silently. I'll do a best effort attempt to minimize this test again. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2325647857 From cslucas at openjdk.org Fri Sep 5 17:24:15 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 5 Sep 2025 17:24:15 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 09:44:22 GMT, Emanuel Peter wrote: >> Please, review this patch to fix issue that may occur when reducing allocation merge. >> >> As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. >> >> The change in `revisit_reducible_phi_status` is just a clean-up. >> The real fix is in `find_scalar_replaceable_allocs`. >> >> Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. > > src/hotspot/share/opto/escape.cpp line 3078: > >> 3076: Node* phi = reducible_merges.at(i); >> 3077: >> 3078: if (!can_reduce_phi(phi->as_Phi())) { > > You say this is a pure cleanup? There are some slight differences in the code though, right? > This method call checks `PhaseMacroExpand::can_eliminate_allocation`, and has a side effect with `ptn->set_scalar_replaceable(false)`. > > Just pointing it out, not a EA expert. That shouldn't make a difference at this point in the analysis. I mentioned this is just a clean up because the verification that needs to be done at this point is essentially what is already performed in `can_reduce_phi` and this change doesn't have anything to do with the original issue. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2325657330 From mhaessig at openjdk.org Fri Sep 5 17:51:14 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 5 Sep 2025 17:51:14 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 15:30:06 GMT, Emanuel Peter wrote: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... Thank you for your continued effort on cost modelling, @eme64! I have some minor style comments and questions, but this mostly looks good to me. Regarding style, I find the alignment of local variables to be a bit distracting, especially when the aligned "things" are different operations and things are sometimes aligned and sometimes not. However, I do not know the style of the rest of the SuperWord code. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 115: > 113: VTransformBoolVectorNode* vtn_mask_cmp = vtn->in_req(3)->isa_BoolVector(); > 114: if (vtn_mask_cmp->test()._is_negated) { > 115: vtn->swap_req(1, 2); // swap if test was negated. Suggestion: // Inputs must be permuted from (mask, blend1, blend2) -> (blend1, blend2, mask) // Or, if the test was negated: (blend1, blend2, mask) -> (blend2, blend1, mask) vtn->swap_req(1, 3); // Now, the reqs are negated. VTransformBoolVectorNode* vtn_mask_cmp = vtn->in_req(3)->isa_BoolVector(); if (!vtn_mask_cmp->test()._is_negated) { vtn->swap_req(1, 2); // Swap if test was not negated. This would save to a swap, but I am unsure if this is also more readable. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 154: > 152: Node* p0 = pack->at(0); > 153: const VTransformVectorNodePrototype prototype = VTransformVectorNodePrototype::make_from_pack(pack, _vloop_analyzer); > 154: const int sopc = prototype.scalar_opcode(); Suggestion: const int sopc = prototype.scalar_opcode(); Nit: whitespace Or were you trying to align with the line below? Personally, I find this a bit too much, but up to you. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 155: > 153: const VTransformVectorNodePrototype prototype = VTransformVectorNodePrototype::make_from_pack(pack, _vloop_analyzer); > 154: const int sopc = prototype.scalar_opcode(); > 155: const uint vlen = prototype.vector_length(); As someone that is not familiar with the Superword code: is "pack size" and "vector length" often used interchangeably? if not, then I would keep the name. ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27056#pullrequestreview-3190237084 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325655080 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325656896 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325662867 From mhaessig at openjdk.org Fri Sep 5 17:51:14 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 5 Sep 2025 17:51:14 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: <-ntIikGlF88WGhDkEeTA2rU7xWHBuSRlaLUVwDvDzUY=.b482c2ed-ddc5-46e8-b7ad-2ee8e9f8fd67@github.com> On Fri, 5 Sep 2025 13:18:07 GMT, Emanuel Peter wrote: > we can just use the CountedLoop as control, which is good enough For my understanding: this is because we can only vectorize if there are no other control dependencies in the loop? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2325684207 From dlong at openjdk.org Fri Sep 5 19:58:09 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 5 Sep 2025 19:58:09 GMT Subject: RFR: 8366569: Disable CompileTaskTimeout for known long-running test cases In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 16:59:18 GMT, Manuel H?ssig wrote: > This PR deliberately disables compile task timeouts using `-XX:CompileTaskTimeout=0` on some tests that are known to have long compilation times due to their construction. Disabling the timeouts in the task description enables running all other tests in the test suite in a Ci with a lower timeout and thus a higher chance of discovering degenerate compilations. > > In a perfect world, timeout values passed from the commandline would be increased by some factor to also have timeouts on these tests when requested. However, I am working with the tools I know and have... > > Testing: > - [ ] Github Actions > - [x] tier1,tier2,tier3 and stress testing with fastdebug on Oracle supported platforms. Looks good, and trivial. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27123#pullrequestreview-3190614580 From mhaessig at openjdk.org Fri Sep 5 20:01:16 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 5 Sep 2025 20:01:16 GMT Subject: RFR: 8366569: Disable CompileTaskTimeout for known long-running test cases In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 19:55:20 GMT, Dean Long wrote: >> This PR deliberately disables compile task timeouts using `-XX:CompileTaskTimeout=0` on some tests that are known to have long compilation times due to their construction. Disabling the timeouts in the task description enables running all other tests in the test suite in a Ci with a lower timeout and thus a higher chance of discovering degenerate compilations. >> >> In a perfect world, timeout values passed from the commandline would be increased by some factor to also have timeouts on these tests when requested. However, I am working with the tools I know and have... >> >> Testing: >> - [x] Github Actions >> - [x] tier1,tier2,tier3 and stress testing with fastdebug on Oracle supported platforms. > > Looks good, and trivial. Thank you for reviewing, @dean-long! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27123#issuecomment-3259592016 From mhaessig at openjdk.org Fri Sep 5 20:01:17 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 5 Sep 2025 20:01:17 GMT Subject: Integrated: 8366569: Disable CompileTaskTimeout for known long-running test cases In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 16:59:18 GMT, Manuel H?ssig wrote: > This PR deliberately disables compile task timeouts using `-XX:CompileTaskTimeout=0` on some tests that are known to have long compilation times due to their construction. Disabling the timeouts in the task description enables running all other tests in the test suite in a Ci with a lower timeout and thus a higher chance of discovering degenerate compilations. > > In a perfect world, timeout values passed from the commandline would be increased by some factor to also have timeouts on these tests when requested. However, I am working with the tools I know and have... > > Testing: > - [x] Github Actions > - [x] tier1,tier2,tier3 and stress testing with fastdebug on Oracle supported platforms. This pull request has now been integrated. Changeset: 4ab2b5bd Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/4ab2b5bdb4e6d40a747d4088a25f7c6351131759 Stats: 21 lines in 5 files changed: 16 ins; 0 del; 5 mod 8366569: Disable CompileTaskTimeout for known long-running test cases Reviewed-by: dlong ------------- PR: https://git.openjdk.org/jdk/pull/27123 From sviswanathan at openjdk.org Fri Sep 5 22:08:14 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 5 Sep 2025 22:08:14 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v2] In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 20:11:28 GMT, Srinivas Vamsi Parasa wrote: >> This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. >> >> Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). >> >> For example: >> `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding >> `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding > > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - nomenclature change > - Merge branch 'master' of https://git.openjdk.java.net/jdk into cdemotion > - remove trailing whitespaces > - remove unused instructions > - 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 src/hotspot/cpu/x86/assembler_x86.cpp line 13125: > 13123: emit_arith(op1, op2, src1, src2, second_operand_demotable); > 13124: } > 13125: This could be written something like below: void Assembler::emit_eevex_prefix_or_demote_arith_ndd(Register dst, Register src1, Register src2, VexSimdPrefix pre, VexOpcode opc, InstructionAttr *attributes, int op1, int op2, bool no_flags, bool use_prefixq, bool is_commutative) { bool demotable = is_demotable(no_flags, dst->encoding(), src1->encoding()); if (!demotable && is_commutative) { if (is_demotable(no_flags, dst->encoding(), src2->encoding())) { demotable = true; // swap src1 and src2 Register tmp = src1; src1 = src2; src2 = tmp; } } (void)emit_eevex_prefix_or_demote_ndd(src1->encoding(), dst->encoding(), src2->encoding(), pre, opc, attributes, no_flags, use_prefixq); emit_arith(op1, op2, src1, src2); } Then we don't need extra argument in emit_arith() and emit_eevex_prefix_or_demote_ndd. src/hotspot/cpu/x86/assembler_x86.hpp line 812: > 810: void emit_eevex_prefix_or_demote_arith_ndd(Register dst, Register src1, Register src2, VexSimdPrefix pre, VexOpcode opc, > 811: InstructionAttr *attributes, int op1, int op2, bool no_flags = false, bool use_prefixq = false, bool is_commutative = false); > 812: The attributes parameter could be replaced by int size and the attributes computed inside the emit_eevex_prefix_or_demote_arith_ndd. Also then no need to have use_prefixq as a separate parameter, (size == EVEX_64bit) implies use_prefixq. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2326128354 PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2325781623 From missa at openjdk.org Sat Sep 6 00:31:36 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 6 Sep 2025 00:31:36 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v6] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: - Add new IR rules to vector float to integer conversion tests - Fix match rule for AVX 10.2 double to long scalar conversion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/e0c84f69..6407cc48 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=04-05 Stats: 81 lines in 3 files changed: 64 ins; 0 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From missa at openjdk.org Sat Sep 6 00:31:36 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 6 Sep 2025 00:31:36 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v2] In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 23:35:37 GMT, Mohamed Issa wrote: >> @missa-prime Looks like an interesting patch! Do you think you could add some sort of IR test here, to verify that the correct code is generated on AVX10 vs lower AVX? > >> @missa-prime Looks like an interesting patch! Do you think you could add some sort of IR test here, to verify that the correct code is generated on AVX10 vs lower AVX? > > @eme64 Thanks for the suggestion. This patch doesn't modify any IR though, so I'm not sure what IR test(s) to add. I could modify existing tests (`test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java`, `test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java`, `test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java`) that use IR nodes as dependencies though. Would that be sufficient? Or did you have something else in mind? > @missa-prime Could you not match on the mach graph? See example: `test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java` with `CompilePhase.FINAL_CODE`. > > Maybe another `CompilePhase` is better. I have never matched on the mach graph myself, but I wonder if it may be useful here. I modified existing vector conversion tests, and I'll add some matching scalar tests to get full coverage. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26919#issuecomment-3260121943 From dlong at openjdk.org Sat Sep 6 00:34:11 2025 From: dlong at openjdk.org (Dean Long) Date: Sat, 6 Sep 2025 00:34:11 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation In-Reply-To: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> Message-ID: On Fri, 5 Sep 2025 15:27:22 GMT, Manuel H?ssig wrote: > When running a debug JVM on Linux with a compile task timeout and repeated compilation, the execution will time out almost always because the timeout does not reset for repetitions of a compilation. The core of the compile task timeout is to limit the amount of time a single compilation can take. Thus, this PR resets the `CompileTaskTimeout` for every compilation when running with `-XX:RepeatCompilation=` for n > 1. > > This PR is stacked on top of #27094. > > Testing: > - [x] Github Actions (failures are unrelated) > - [x] tier1, tier2, tier3 plus some additional internal testing Looks good! ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27120#pullrequestreview-3191180852 From missa at openjdk.org Sat Sep 6 00:42:15 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 6 Sep 2025 00:42:15 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v3] In-Reply-To: References: <_Wv0Roo5xUHjswP_JUy6yzoU5KCwNpIoX3S2QBceUbE=.05b5bbbd-840b-4162-a454-94a9ddc2a69f@github.com> Message-ID: <1Yb2TM9_3y_98k508MooB4a5amzOh8hhuhVV4HjjmTI=.a19114e4-1eec-4c70-8f50-d6b3941b7f24@github.com> On Mon, 1 Sep 2025 07:51:49 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/x86.ad line 7804: >> >>> 7802: predicate(VM_Version::supports_avx10_2() && >>> 7803: is_integral_type(Matcher::vector_element_basic_type(n))); >>> 7804: match(Set dst (VectorCastD2X src)); >> >> I assume your intent here is to feed the memory operand to the vector cast IR, a memory operand is first loaded into register using LoadVector IR, so a CISC / memory variant of pattern should consume the Load IR such that the operand is directly exposed to the instruction. Checkout https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L8986 > > Make a similar change in all the newly added memory patterns. I updated the scalar and vector memory patterns. I'm not completely sure about the vector ones though, so I'll try and test further. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2326312130 From missa at openjdk.org Sat Sep 6 00:42:17 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 6 Sep 2025 00:42:17 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v5] In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 05:40:17 GMT, Jatin Bhateja wrote: >> Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: >> >> Add AVX 10.2 CPU feature flag to list of verified ones > > test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 90: > >> 88: @Test >> 89: @IR(counts = {IRNode.VECTOR_CAST_F2I, IRNode.VECTOR_SIZE_16, "> 0"}, >> 90: applyIfCPUFeatureOr = {"avx512f", "true", "avx10_2", "true"}) > > You should check for target specific Machine IR which is selected on AVX10_2 targets. New checks are added. > test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 108: > >> 106: @Test >> 107: @IR(counts = {IRNode.VECTOR_CAST_F2L, IRNode.VECTOR_SIZE_8, "> 0"}, >> 108: applyIfCPUFeatureOr = {"avx512dq", "true", "avx10_2", "true"}) > > avx10_2 is super set of AVX512DQ, we enable all AVX512 featurs during VM initialization and IRFrameWork rely on the same. I updated the checks to account for this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2326312302 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2326312430 From missa at openjdk.org Sat Sep 6 04:09:17 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 6 Sep 2025 04:09:17 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v7] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Avoid machine instruction searches in IR rules for non-AVX10.2 platforms during Vector floating point to integer conversion tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/6407cc48..709b4439 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=05-06 Stats: 24 lines in 2 files changed: 0 ins; 0 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From duke at openjdk.org Sat Sep 6 05:44:27 2025 From: duke at openjdk.org (duke) Date: Sat, 6 Sep 2025 05:44:27 GMT Subject: Withdrawn: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed In-Reply-To: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> Message-ID: <-zRTU72Jahqjyst5pleNDuMXos0x4i0S5H_YDZsVTzQ=.60ec04de-ca7e-4639-9545-7105fd199a09@github.com> On Thu, 10 Apr 2025 11:39:36 GMT, Roland Westrelin wrote: > An `Initialize` node for an `Allocate` node is created with a memory > `Proj` of adr type raw memory. In order for stores to be captured, the > memory state out of the allocation is a `MergeMem` with slices for the > various object fields/array element set to the raw memory `Proj` of > the `Initialize` node. If `Phi`s need to be created during later > transformations from this memory state, The `Phi` for a particular > slice gets its adr type from the type of the `Proj` which is raw > memory. If during macro expansion, the `Allocate` is found to have no > use and so can be removed, the `Proj` out of the `Initialize` is > replaced by the memory state on input to the `Allocate`. A `Phi` for > some slice for a field of an object will end up with the raw memory > state on input to the `Allocate` node. As a result, memory state at > the `Phi` is incorrect and incorrect execution can happen. > > The fix I propose is, rather than have a single `Proj` for the memory > state out of the `Initialize` with adr type raw memory, to use one > `Proj` per slice added to the memory state after the `Initalize`. Each > of the `Proj` should return the right adr type for its slice. For that > I propose having a new type of `Proj`: `NarrowMemProj` that captures > the right adr type. > > Logic for the construction of the `Allocate`/`Initialize` subgraph is > tweaked so the right adr type captured in is own `NarrowMemProj` is > added to the memory sugraph. Code that removes an allocation or moves > it also has to be changed so it correctly takes the multiple memory > projections out of the `Initialize` node into account. > > One tricky issue is that when EA split types for a scalar replaceable > `Allocate` node: > > 1- the adr type captured in the `NarrowMemProj` becomes out of sync > with the type of the slices for the allocation > > 2- before EA, the memory state for one particular field out of the > `Initialize` node can be used for a `Store` to the just allocated > object or some other. So we can have a chain of `Store`s, some to > the newly allocated object, some to some other objects, all of them > using the state of `NarrowMemProj` out of the `Initialize`. After > split unique types, the `NarrowMemProj` is for the slice of a > particular allocation. So `Store`s to some other objects shouldn't > use that memory state but the memory state before the `Allocate`. > > For that, I added logic to update the adr type of `NarrowMemProj` > during split unique types and update the memory input of `Store`s that > don't depend on the memory state ... This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/24570 From missa at openjdk.org Sat Sep 6 06:38:20 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 6 Sep 2025 06:38:20 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v8] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Use applyIfCPUFeatureAnd to check multiple CPU feature pairs in tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/709b4439..b7d3ae34 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=06-07 Stats: 16 lines in 2 files changed: 0 ins; 0 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From missa at openjdk.org Sat Sep 6 09:44:56 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 6 Sep 2025 09:44:56 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v9] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Check for scalar casting instead of vector casting in tests when disabling vector alignment or compact object headers ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/b7d3ae34..4d8f3ab6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=07-08 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From duke at openjdk.org Mon Sep 8 01:25:18 2025 From: duke at openjdk.org (erifan) Date: Mon, 8 Sep 2025 01:25:18 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 10:12:35 GMT, Aleksey Shipilev wrote: >> Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 >> microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 >> microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 >> microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 >> microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 >> microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 >> microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 >> microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 >> microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 >> microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 >> >> >> Benchmarks on Intel 6444y machine with 512-bit avx3: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 >> microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 >> microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 >> microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 >> microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 >> microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 >> microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 >> microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 >> microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 >> microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 >> microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 >> microMaskLaneIsSetInt512_var ops/ms 573... > > test/micro/org/openjdk/bench/jdk/incubator/vector/VectorExtractBenchmark.java line 34: > >> 32: @Warmup(iterations = 5, time = 1) >> 33: @Measurement(iterations = 5, time = 1) >> 34: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"}) > > Don't do 1 fork, do at least 3. The test results show that this test is stable, so I think forking once is enough? We have many JMH benchmarks that fork once. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27113#discussion_r2328949227 From xgong at openjdk.org Mon Sep 8 02:33:14 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 8 Sep 2025 02:33:14 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> Message-ID: On Fri, 5 Sep 2025 10:32:58 GMT, Emanuel Peter wrote: > To me a `false` means this: If we support gater/scalter, then we do not need a vector index, we can do without it. > > Is that correct? Thanks for your review! Actually gather/scatter always need an index input. What this function want to decide is how the index elements are passed to the operations. It doesn't take an assumption whether vector gather_load/scatter_store is supported or not in backend. It just checks whether the `index` input of such operations requires a vector register or an address which stores the indexes. Currently, on x86, it passes an array address for subword types (the indexes are then will be loaded one-by-one in backend codegen). However, on AArch64, we requires it a vector type for all types instead (the indexes have been loaded and saved into vector registers in IR level). > The current platform does not support vector gather-load or scatter-store at all. I'm sorry that I didn't clarify very clear about @fg1417 's second statement. Whether the current platform supports vector gather-load/scatter-store is still decided by `Matcher::match_rule_supported_vector()` like other operations. It return `false` here just because arm doesn't support any vector operations. Assume if it want to support a vector gather/scatter, the index input must not be a vector, right? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2328999842 From xgong at openjdk.org Mon Sep 8 02:33:15 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 8 Sep 2025 02:33:15 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: <0lnaxN7YsQEddGZfWLgFi2YOl_XtXntDoHRr57Bjp7k=.946b3e40-04c1-4eb5-a205-53347cdc91eb@github.com> On Fri, 5 Sep 2025 10:47:58 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/vectornode.hpp line 1769: >> >>> 1767: // dst = [h g f e d c b a] >>> 1768: // >>> 1769: class VectorConcatenateNode : public VectorNode { >> >> That semantic is not quite what I would expect from `Concatenate`. Maybe we can call it something else? >> `VectorConcatenateAndNarrowNode`? > > Have you considered using `2x Cast + Concatenate` instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the `unslice` or some other operation that concatenates two vectors. > That semantic is not quite what I would expect from `Concatenate`. Maybe we can call it something else? `VectorConcatenateAndNarrowNode`? Yeah, `VectorConcatenateAndNarrowNode` would be much match. I just thought the name would be too long. I will change it in next commit. Thanks for your suggestion! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2329001531 From xgong at openjdk.org Mon Sep 8 03:03:13 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 8 Sep 2025 03:03:13 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: <0lnaxN7YsQEddGZfWLgFi2YOl_XtXntDoHRr57Bjp7k=.946b3e40-04c1-4eb5-a205-53347cdc91eb@github.com> References: <0lnaxN7YsQEddGZfWLgFi2YOl_XtXntDoHRr57Bjp7k=.946b3e40-04c1-4eb5-a205-53347cdc91eb@github.com> Message-ID: On Mon, 8 Sep 2025 02:30:20 GMT, Xiaohong Gong wrote: >> Have you considered using `2x Cast + Concatenate` instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the `unslice` or some other operation that concatenates two vectors. > >> That semantic is not quite what I would expect from `Concatenate`. Maybe we can call it something else? `VectorConcatenateAndNarrowNode`? > > Yeah, `VectorConcatenateAndNarrowNode` would be much match. I just thought the name would be too long. I will change it in next commit. Thanks for your suggestion! > Have you considered using `2x Cast + Concatenate` instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the `unslice` or some other operation that concatenates two vectors. Would using `2x Cast + Concatenate` make the IRs and match rule more complex? Mere concatenate would be something like `vector slice` in Vector API. It concatenates two vectors into one with an index denoting the merging position. And it requires the vector types are the same for two input vectors and the dst vector. Hence, if we want to separate this operation with cast and concatenate, the IRs would be (assume original type of `v1/v2` is `4-int`, the result type should be `8-short`): 1) Narrow two input vectors: `v1 = VectorCast(v1) (4-short); v2 = VectorCast(v2) (4-short)`. The vector length are not changed while the element size is half size. Hence the vector length in bytes is half size as well. 2) Resize `v1` and `v2` to double vector length. The higher bits are cleared: `v1 = VectorReinterpret(v1) (8-short); v2 = VectorReinterpret(v2) (8-short)`. 3) Concatenate `v1` and `v2` like slice. The position is the middle of the vector length. `v = VectorSlice(v1, v2, 4) (8-short)`. If we want to merging these IRs in backend, would the match rule be more complex? I will take a considering. >> And what about the vector length being consistent between `vec1`, `vec2` and `vt`? > >> What about asserting that `vec1` and `vec2` have the same `vect`? > > That would be fine. Thanks! I will add it in next commit. > And what about the vector length being consistent between `vec1`, `vec2` and `vt`? Yes, I think the vector length in bytes must be consistent. I will add the assertion as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2329024826 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2329027242 From xgong at openjdk.org Mon Sep 8 03:03:14 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 8 Sep 2025 03:03:14 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 10:41:15 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/vectornode.hpp line 1774: >> >>> 1772: : VectorNode(vec1, vec2, vt) { >>> 1773: assert(type2aelembytes(vec1->bottom_type()->is_vect()->element_basic_type()) == >>> 1774: type2aelembytes(vt->element_basic_type()) * 2, "must be half size"); >> >> What about asserting that `vec1` and `vec2` have the same `vect`? > > And what about the vector length being consistent between `vec1`, `vec2` and `vt`? > What about asserting that `vec1` and `vec2` have the same `vect`? That would be fine. Thanks! I will add it in next commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2329026579 From xgong at openjdk.org Mon Sep 8 03:15:22 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 8 Sep 2025 03:15:22 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 10:44:28 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/vectornode.hpp line 1841: >> >>> 1839: >>> 1840: // Unpack the elements to twice size. >>> 1841: class VectorMaskWidenNode : public VectorNode { >> >> Can you add a visual example like above for `VectorConcatenateNode`, please? > > Did you consider the alternative of `Extract` + `Cast`? Not sure if that would be better, you know more about the code complexity. It would just allow us to have one fewer nodes. It just has the `Extract` node to extract an element from vector in C2, right? Extracting the lowest part can be implemented with `VectorReinterpret` easily. But how about the higher parts? Maybe this can also be implemented with operations like `slice` ? But, seems this will also make the IR more complex? For `Cast`, we have `VectorCastMask` now, but it assumes the vector length should be the same for input and output. So the `VectorReinterpret` or an `VectorExtract` is sill needed. I can have a try with separating the IR. But I guess an additional new node is still necessary. > It would just allow us to have one fewer nodes. This is also what I expect really. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2329040437 From epeter at openjdk.org Mon Sep 8 05:56:19 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 05:56:19 GMT Subject: RFR: 8366878: Improve flags of compiler/loopopts/superword/TestAlignVectorFuzzer.java In-Reply-To: References: Message-ID: <5PWmoHhlhYHDD7WBje51yGzGHr1Dq3QCDRNApA64MmY=.ed2e0b11-e144-4e24-97dd-7a7ccdd208c0@github.com> On Fri, 5 Sep 2025 16:46:09 GMT, Manuel H?ssig wrote: > The test definitions of `TestAlignVectorFuzzer.java` all contain `printcompilation` directives. These are redundant and slow down the test execution of a test that already often times out. @eme64 also suggested adding a `compileonly` directive to one of the four tests. > > Testing: > - [ ] Github Actions > - [ ] tier1 and stress testing (features `TestAlignVectorFuzzer.java`) test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVectorFuzzer.java line 35: > 33: * -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestAlignVectorFuzzer::* > 34: * compiler.loopopts.superword.TestAlignVectorFuzzer > 35: */ I think it would be good if we also had the same run but without the compileonly. That's what I meant by duplication ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27122#discussion_r2329202898 From epeter at openjdk.org Mon Sep 8 06:01:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 06:01:11 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: <4IAtrzmFmfx0HgchA6HNgqifFCbTFxAmfJyQgym5O3w=.c0ed236e-cd04-447e-ac6f-1e4cd14ebdb8@github.com> On Fri, 5 Sep 2025 17:25:01 GMT, Manuel H?ssig wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > src/hotspot/share/opto/superwordVTransformBuilder.cpp line 155: > >> 153: const VTransformVectorNodePrototype prototype = VTransformVectorNodePrototype::make_from_pack(pack, _vloop_analyzer); >> 154: const int sopc = prototype.scalar_opcode(); >> 155: const uint vlen = prototype.vector_length(); > > As someone that is not familiar with the Superword code: is "pack size" and "vector length" often used interchangeably? if not, then I would keep the name. `SuperWord` works with packs, i.e. packing scalars. So then we can measure the size of a pack, that measures how many scalars we have packed. But `VTransform` is a "preview" of the new vectorized C2 IR. And there, we talk about the length of vectors. So I think this is exactly the right place to do the transition ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329211537 From epeter at openjdk.org Mon Sep 8 06:04:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 06:04:11 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: <-ntIikGlF88WGhDkEeTA2rU7xWHBuSRlaLUVwDvDzUY=.b482c2ed-ddc5-46e8-b7ad-2ee8e9f8fd67@github.com> References: <-ntIikGlF88WGhDkEeTA2rU7xWHBuSRlaLUVwDvDzUY=.b482c2ed-ddc5-46e8-b7ad-2ee8e9f8fd67@github.com> Message-ID: On Fri, 5 Sep 2025 17:37:21 GMT, Manuel H?ssig wrote: >> src/hotspot/share/opto/vtransform.cpp line 933: >> >>> 931: phase->register_new_node(vn, apply_state.vloop().cl()); >>> 932: phase->igvn()._worklist.push(vn); >>> 933: VectorNode::trace_new_vector(vn, "AutoVectorization"); >> >> Removing the argument here allows us yet another removal of dependency on the old scalar graph. We only needed it for using the same control as the old graph - but that is not necessary, we can just use the CountedLoop as control, which is good enough. > >> we can just use the CountedLoop as control, which is good enough > > For my understanding: this is because we can only vectorize if there are no other control dependencies in the loop? Setting control is only for PhaseIdealLoop. It sets the internal ctrl that other loop-opts would rely on if we kept on optimizing the loop in the same PhaseIdealLoop. But if we set major progress, that means that we have messed up the graph so much that the state of PhaseIdealLoop may no longer be correct/accurate enough. So if we set major progress, we are essencially allowed to mess up the PhaseIdealLoop state. I still have to set ctrl, but it does not matter if it is not correct ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329215458 From epeter at openjdk.org Mon Sep 8 06:07:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 06:07:12 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 17:20:36 GMT, Manuel H?ssig wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > src/hotspot/share/opto/superwordVTransformBuilder.cpp line 115: > >> 113: VTransformBoolVectorNode* vtn_mask_cmp = vtn->in_req(3)->isa_BoolVector(); >> 114: if (vtn_mask_cmp->test()._is_negated) { >> 115: vtn->swap_req(1, 2); // swap if test was negated. > > Suggestion: > > // Inputs must be permuted from (mask, blend1, blend2) -> (blend1, blend2, mask) > // Or, if the test was negated: (blend1, blend2, mask) -> (blend2, blend1, mask) > vtn->swap_req(1, 3); // Now, the reqs are negated. > VTransformBoolVectorNode* vtn_mask_cmp = vtn->in_req(3)->isa_BoolVector(); > if (!vtn_mask_cmp->test()._is_negated) { > vtn->swap_req(1, 2); // Swap if test was not negated. > > This would save to a swap, but I am unsure if this is also more readable. It would be a nice optimization if swap was expensive. But it is not really. I think I prefer the more readable solution here. But it's a bit of a toss-up. If another reviewer has a preference I'm willing to go with the majority ;) > src/hotspot/share/opto/superwordVTransformBuilder.cpp line 154: > >> 152: Node* p0 = pack->at(0); >> 153: const VTransformVectorNodePrototype prototype = VTransformVectorNodePrototype::make_from_pack(pack, _vloop_analyzer); >> 154: const int sopc = prototype.scalar_opcode(); > > Suggestion: > > const int sopc = prototype.scalar_opcode(); > > Nit: whitespace > Or were you trying to align with the line below? Personally, I find this a bit too much, but up to you. Yes, I was trying to get alignment. I'll try some alternatives. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329219093 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329220235 From dzhang at openjdk.org Mon Sep 8 06:08:10 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 8 Sep 2025 06:08:10 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture Message-ID: Hi, Can you help to review this patch? Thanks! This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. ### Test - [x] Run tier1 and tier2 on sg2042 ------------- Commit messages: - 8367048: RISC-V: Correct pipeline descriptions of the architecture Changes: https://git.openjdk.org/jdk/pull/27134/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27134&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367048 Stats: 12 lines in 1 file changed: 5 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/27134.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27134/head:pull/27134 PR: https://git.openjdk.org/jdk/pull/27134 From epeter at openjdk.org Mon Sep 8 06:16:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 06:16:13 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: <9l9xVu-ih86ETpJvp7_L85Jlzzv-lOpiMGR2C004T4E=.48f81a2d-d964-445b-a448-5b2184e4b6d6@github.com> On Mon, 8 Sep 2025 06:05:02 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/superwordVTransformBuilder.cpp line 154: >> >>> 152: Node* p0 = pack->at(0); >>> 153: const VTransformVectorNodePrototype prototype = VTransformVectorNodePrototype::make_from_pack(pack, _vloop_analyzer); >>> 154: const int sopc = prototype.scalar_opcode(); >> >> Suggestion: >> >> const int sopc = prototype.scalar_opcode(); >> >> Nit: whitespace >> Or were you trying to align with the line below? Personally, I find this a bit too much, but up to you. > > Yes, I was trying to get alignment. I'll try some alternatives. Variant 1: no alignment ~ 787 uint vlen = vector_length(); ~ 788 int vopc = _vector_opcode; 789 BasicType bt = element_basic_type(); 790 const TypeVect* vt = TypeVect::make(bt, vlen); It looks a bit noisy to me. Variant 2: align on assignment operator ~ 787 uint vlen = vector_length(); ~ 788 int vopc = _vector_opcode; ~ 789 BasicType bt = element_basic_type(); 790 const TypeVect* vt = TypeVect::make(bt, vlen); Better. But somehow I'd still prefer if the names were also aligned. Question if left or right aligned looks better. 3a ~ 787 uint vlen = vector_length(); ~ 788 int vopc = _vector_opcode; ~ 789 BasicType bt = element_basic_type(); ~ 790 const TypeVect* vt = TypeVect::make(bt, vlen); 3b ~ 787 uint vlen = vector_length(); ~ 788 int vopc = _vector_opcode; ~ 789 BasicType bt = element_basic_type(); 790 const TypeVect* vt = TypeVect::make(bt, vlen); Personally the last one looks the calmest to me. But it's a bit of a funky choice compared to the rest of hotspot, so it is probably just my own brain that thinks it is good. I think I'm going with variant 2, since that is a little less controversial I think, and still a bit better than no alignment at all. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329234097 From epeter at openjdk.org Mon Sep 8 06:21:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 06:21:11 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: <9l9xVu-ih86ETpJvp7_L85Jlzzv-lOpiMGR2C004T4E=.48f81a2d-d964-445b-a448-5b2184e4b6d6@github.com> References: <9l9xVu-ih86ETpJvp7_L85Jlzzv-lOpiMGR2C004T4E=.48f81a2d-d964-445b-a448-5b2184e4b6d6@github.com> Message-ID: On Mon, 8 Sep 2025 06:13:21 GMT, Emanuel Peter wrote: >> Yes, I was trying to get alignment. I'll try some alternatives. > > Variant 1: no alignment > > ~ 787 uint vlen = vector_length(); > ~ 788 int vopc = _vector_opcode; > 789 BasicType bt = element_basic_type(); > 790 const TypeVect* vt = TypeVect::make(bt, vlen); > > It looks a bit noisy to me. > > Variant 2: align on assignment operator > > ~ 787 uint vlen = vector_length(); > ~ 788 int vopc = _vector_opcode; > ~ 789 BasicType bt = element_basic_type(); > 790 const TypeVect* vt = TypeVect::make(bt, vlen); > > Better. But somehow I'd still prefer if the names were also aligned. Question if left or right aligned looks better. > > 3a > > ~ 787 uint vlen = vector_length(); > ~ 788 int vopc = _vector_opcode; > ~ 789 BasicType bt = element_basic_type(); > ~ 790 const TypeVect* vt = TypeVect::make(bt, vlen); > > 3b > > ~ 787 uint vlen = vector_length(); > ~ 788 int vopc = _vector_opcode; > ~ 789 BasicType bt = element_basic_type(); > 790 const TypeVect* vt = TypeVect::make(bt, vlen); > > > Personally the last one looks the calmest to me. But it's a bit of a funky choice compared to the rest of hotspot, so it is probably just my own brain that thinks it is good. > I think I'm going with variant 2, since that is a little less controversial I think, and still a bit better than no alignment at all. I'm also refactoring away some of the assignments, and put the value directly at the use-site. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329242463 From fyang at openjdk.org Mon Sep 8 06:39:12 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 8 Sep 2025 06:39:12 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 05:13:32 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. > Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. > > ### Test > - [x] Run tier1 and tier2 on sg2042 Looks good. Seems a leftover when adding support for compressed instructions. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27134#pullrequestreview-3195142444 From chagedorn at openjdk.org Mon Sep 8 06:47:12 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 8 Sep 2025 06:47:12 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: <-ntIikGlF88WGhDkEeTA2rU7xWHBuSRlaLUVwDvDzUY=.b482c2ed-ddc5-46e8-b7ad-2ee8e9f8fd67@github.com> Message-ID: On Mon, 8 Sep 2025 06:01:35 GMT, Emanuel Peter wrote: >>> we can just use the CountedLoop as control, which is good enough >> >> For my understanding: this is because we can only vectorize if there are no other control dependencies in the loop? > > Setting control is only for PhaseIdealLoop. It sets the internal ctrl that other loop-opts would rely on if we kept on optimizing the loop in the same PhaseIdealLoop. But if we set major progress, that means that we have messed up the graph so much that the state of PhaseIdealLoop may no longer be correct/accurate enough. > > So if we set major progress, we are essencially allowed to mess up the PhaseIdealLoop state. I still have to set ctrl, but it does not matter if it is not correct ;) I think it should generally still be correct but might not need to be as accurate as possible. You'll never know if some code will rely on the correctness later in the same loop opts - might not today but at some point, especially when trying to add some verification code. So, IIUC, you are just more conservative/less accurate now while still being correct. Maybe you can tweak the comment to express that more clearly since "not always correct" could also imply some actual illegal control. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329284434 From fjiang at openjdk.org Mon Sep 8 06:55:10 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 8 Sep 2025 06:55:10 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 05:13:32 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. > Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. > > ### Test > - [x] Run tier1 and tier2 on sg2042 Looks good. Thanks for finding this! ------------- Marked as reviewed by fjiang (Committer). PR Review: https://git.openjdk.org/jdk/pull/27134#pullrequestreview-3195188030 From epeter at openjdk.org Mon Sep 8 07:00:54 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 07:00:54 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v2] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: review comment implemented ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27056/files - new: https://git.openjdk.org/jdk/pull/27056/files/05ee2800..9bc510e4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=00-01 Stats: 49 lines in 3 files changed: 7 ins; 12 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/27056.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27056/head:pull/27056 PR: https://git.openjdk.org/jdk/pull/27056 From epeter at openjdk.org Mon Sep 8 07:00:54 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 07:00:54 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v2] In-Reply-To: References: <-ntIikGlF88WGhDkEeTA2rU7xWHBuSRlaLUVwDvDzUY=.b482c2ed-ddc5-46e8-b7ad-2ee8e9f8fd67@github.com> Message-ID: On Mon, 8 Sep 2025 06:43:20 GMT, Christian Hagedorn wrote: >> Setting control is only for PhaseIdealLoop. It sets the internal ctrl that other loop-opts would rely on if we kept on optimizing the loop in the same PhaseIdealLoop. But if we set major progress, that means that we have messed up the graph so much that the state of PhaseIdealLoop may no longer be correct/accurate enough. >> >> So if we set major progress, we are essencially allowed to mess up the PhaseIdealLoop state. I still have to set ctrl, but it does not matter if it is not correct ;) > > I think it should generally still be correct but might not need to be as accurate as possible. You'll never know if some code will rely on the correctness later in the same loop opts - might not today but at some point, especially when trying to add some verification code. So, IIUC, you are just more conservative/less accurate now while still being correct. Maybe you can tweak the comment to express that more clearly since "not always correct" could also imply some actual illegal control. Adjusted the comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329312853 From chagedorn at openjdk.org Mon Sep 8 08:04:14 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 8 Sep 2025 08:04:14 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v2] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 07:00:54 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > review comment implemented Nice refactoring! Just some small suggestions, otherwise, it looks good to me. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 154: > 152: Node* p0 = pack->at(0); > 153: const VTransformVectorNodePrototype prototype = VTransformVectorNodePrototype::make_from_pack(pack, _vloop_analyzer); > 154: const int sopc = prototype.scalar_opcode(); You use this at other places already but could be more readable when renamed to `scalar_opc` or `scalar_opcode`. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 173: > 171: vtn = new (_vtransform.arena()) VTransformBoolVectorNode(_vtransform, prototype, kind); > 172: } else if (p0->is_CMove()) { > 173: vtn = new (_vtransform.arena()) VTransformElementWiseVectorNode(_vtransform, p0->req(), prototype, Op_VectorBlend); You also seem to use `p0->req()` a lot. Should we create a `const` above for easier access? Could we also have a better name than `p0`? But again, you are using `p0` a lot at other places already and it might be evidently clear in this context. src/hotspot/share/opto/vtransform.hpp line 600: > 598: > 599: // Bundle the information needed for vector nodes. > 600: class VTransformVectorNodePrototype : public StackObj { Prototype sounds like actually having something concrete, not fully set up or just something to copy/clone from as a starting point. But IIUC, this class just serves as a holder class for some information. How about naming it `Prototype` -> `Properties`? src/hotspot/share/opto/vtransform.hpp line 617: > 615: > 616: public: > 617: static VTransformVectorNodePrototype make_from_pack(const Node_List* pack, const VLoopAnalyzer& vloop_analyzer) { When switching to "Properties", you could also rename this to something like "fetch_from_pack" since `make` also suggests to actually creating a dummy-kind node when it's only trying to fetch useful information. ------------- PR Review: https://git.openjdk.org/jdk/pull/27056#pullrequestreview-3195283374 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329386959 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329409899 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329374048 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329392637 From shade at openjdk.org Mon Sep 8 08:10:10 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 8 Sep 2025 08:10:10 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 01:22:46 GMT, erifan wrote: >> test/micro/org/openjdk/bench/jdk/incubator/vector/VectorExtractBenchmark.java line 34: >> >>> 32: @Warmup(iterations = 5, time = 1) >>> 33: @Measurement(iterations = 5, time = 1) >>> 34: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"}) >> >> Don't do 1 fork, do at least 3. > > The test results show that this test is stable, so I think forking once is enough? We have many JMH benchmarks that fork once. OK then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27113#discussion_r2329468832 From epeter at openjdk.org Mon Sep 8 08:14:54 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 08:14:54 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v3] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: prototype -> properties ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27056/files - new: https://git.openjdk.org/jdk/pull/27056/files/9bc510e4..8a63899a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=01-02 Stats: 51 lines in 3 files changed: 0 ins; 0 del; 51 mod Patch: https://git.openjdk.org/jdk/pull/27056.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27056/head:pull/27056 PR: https://git.openjdk.org/jdk/pull/27056 From epeter at openjdk.org Mon Sep 8 08:14:54 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 08:14:54 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v2] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 07:25:47 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment implemented > > src/hotspot/share/opto/vtransform.hpp line 600: > >> 598: >> 599: // Bundle the information needed for vector nodes. >> 600: class VTransformVectorNodePrototype : public StackObj { > > Prototype sounds like actually having something concrete, not fully set up or just something to copy/clone from as a starting point. But IIUC, this class just serves as a holder class for some information. How about naming it `Prototype` -> `Properties`? I like the idea :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329474779 From epeter at openjdk.org Mon Sep 8 08:18:15 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 08:18:15 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v2] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 07:31:40 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment implemented > > src/hotspot/share/opto/superwordVTransformBuilder.cpp line 154: > >> 152: Node* p0 = pack->at(0); >> 153: const VTransformVectorNodePrototype prototype = VTransformVectorNodePrototype::make_from_pack(pack, _vloop_analyzer); >> 154: const int sopc = prototype.scalar_opcode(); > > You use this at other places already but could be more readable when renamed to `scalar_opc` or `scalar_opcode`. We use `sopc` and `vopc` in many places in vectorization code - just grep for it ;) Otherwise some lines just get much longer, and I think that makes the code less readable generally. > src/hotspot/share/opto/vtransform.hpp line 617: > >> 615: >> 616: public: >> 617: static VTransformVectorNodePrototype make_from_pack(const Node_List* pack, const VLoopAnalyzer& vloop_analyzer) { > > When switching to "Properties", you could also rename this to something like "fetch_from_pack" since `make` also suggests to actually creating a dummy-kind node when it's only trying to fetch useful information. I prefer the `make_...` naming. I think it is generally used as a "factory" prefix all over the place. `fetch` means we would be "loading" if from somewhere, and that's not what we do here - rather we just construct the `Properties` given the pack. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329483851 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329487912 From epeter at openjdk.org Mon Sep 8 08:28:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 08:28:51 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27056/files - new: https://git.openjdk.org/jdk/pull/27056/files/8a63899a..e3fe36ee Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=02-03 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27056.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27056/head:pull/27056 PR: https://git.openjdk.org/jdk/pull/27056 From epeter at openjdk.org Mon Sep 8 08:28:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 08:28:51 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v2] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 08:01:52 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment implemented > > Nice refactoring! Just some small suggestions, otherwise, it looks good to me. @chhagedorn Thanks for reviewing! I responded to all your suggestions :) > src/hotspot/share/opto/superwordVTransformBuilder.cpp line 173: > >> 171: vtn = new (_vtransform.arena()) VTransformBoolVectorNode(_vtransform, prototype, kind); >> 172: } else if (p0->is_CMove()) { >> 173: vtn = new (_vtransform.arena()) VTransformElementWiseVectorNode(_vtransform, p0->req(), prototype, Op_VectorBlend); > > You also seem to use `p0->req()` a lot. Should we create a `const` above for easier access? Could we also have a better name than `p0`? But again, you are using `p0` a lot at other places already and it might be evidently clear in this context. Personally, I'd like to keep `p0`. An alternative is `first`. Or something even much longer that just inflates the code and does not make it more readable either. We also have `t0` and `s0` all over the SuperWord code. And honestly we do the same in all sorts of IGVN code as well, right? We could make a `uint req = p0->req()` but I don't think that is more helpful. `req` is not a very great name but we are stuck with it because of the definition in `Node`. Detaching it from `p0` would probably not help but rather make it harder to read. All of this is rather subjective though :/ If a second reviewer wants to see the change, I propose we do that in a separate RFE, and then consistently over the SuperWord code at large. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27056#issuecomment-3265144091 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329502044 From qxing at openjdk.org Mon Sep 8 08:32:11 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Mon, 8 Sep 2025 08:32:11 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v11] In-Reply-To: References: Message-ID: > The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases. > > This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch: > > > public static int numberOfNibbles(int i) { > int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i); > return Math.max((mag + 3) / 4, 1); > } > > > Testing: tier1, IR test Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: Add proof of correstness comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25928/files - new: https://git.openjdk.org/jdk/pull/25928/files/f1c0b45a..5cfe39b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=09-10 Stats: 36 lines in 1 file changed: 36 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/25928.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25928/head:pull/25928 PR: https://git.openjdk.org/jdk/pull/25928 From qxing at openjdk.org Mon Sep 8 08:32:13 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Mon, 8 Sep 2025 08:32:13 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v10] In-Reply-To: <8cq6Lhw9sc_Fd7adnL0t1F10UowOHDr8eEgZSD9MFUc=.d6b189a1-ac3e-4175-8e15-5e16691b6422@github.com> References: <8cq6Lhw9sc_Fd7adnL0t1F10UowOHDr8eEgZSD9MFUc=.d6b189a1-ac3e-4175-8e15-5e16691b6422@github.com> Message-ID: <2GMwFeMbO1iD2MoDsbJs0mAc6ayBANcl_XNS2c9lm4I=.dc9cc10b-ae8a-490c-9513-8eaca2890ab4@github.com> On Tue, 19 Aug 2025 13:51:36 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/countbitsnode.cpp line 57: >> >>> 55: const TypeInt* ti = t->is_int(); >>> 56: return TypeInt::make(count_leading_zeros_int(~ti->_bits._zeros), >>> 57: count_leading_zeros_int(ti->_bits._ones), >> >> I think this is correct, but I would like to see a short comment why it is correct. > > Same in other cases below Updated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25928#discussion_r2329519407 From chagedorn at openjdk.org Mon Sep 8 08:45:17 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 8 Sep 2025 08:45:17 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 08:28:51 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Looks good, thanks for the updates! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27056#pullrequestreview-3195544178 From chagedorn at openjdk.org Mon Sep 8 08:45:19 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 8 Sep 2025 08:45:19 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v2] In-Reply-To: References: Message-ID: <8mjKHo07OuMaW0hIahWBm_N-5RhK4umUzrOpV0wr8cs=.bca1780b-9091-4560-a3ab-01cf46d3ed1f@github.com> On Mon, 8 Sep 2025 08:21:29 GMT, Emanuel Peter wrote: > Personally, I'd like to keep p0. An alternative is first. Or something even much longer that just inflates the code and does not make it more readable either. "first_node_in_pack" would be more understandable I think. But it's much longer than `p0` indeed. Does it matter here if we pick the first, second or just any other node? If not, than maybe "pack_node" would just be expressive enough? But anyways, as you point out, we already use `p0` all over the place. And doing an extensive renaming should be done in a separate task in one go and more people should agree to it before doing it. > We also have t0 and s0 all over the SuperWord code. And honestly we do the same in all sorts of IGVN code as well, right? Yes, I personally would prefer to have more names than abbreviations. But that's subjective again :-) > We could make a uint req = p0->req() but I don't think that is more helpful. req is not a very great name but we are stuck with it because of the definition in Node. Detaching it from p0 would probably not help but rather make it harder to read. What if you just name it `p0_req`? It was more about sharing and making it `const` since `p0` does not change. But feel free to leave it as it is. > All of this is rather subjective though :/ Indeed... >> src/hotspot/share/opto/vtransform.hpp line 617: >> >>> 615: >>> 616: public: >>> 617: static VTransformVectorNodePrototype make_from_pack(const Node_List* pack, const VLoopAnalyzer& vloop_analyzer) { >> >> When switching to "Properties", you could also rename this to something like "fetch_from_pack" since `make` also suggests to actually creating a dummy-kind node when it's only trying to fetch useful information. > > I prefer the `make_...` naming. I think it is generally used as a "factory" prefix all over the place. `fetch` means we would be "loading" if from somewhere, and that's not what we do here - rather we just construct the `Properties` given the pack. I guess with `Properties` in the name, it's more clear now ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329554430 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2329553671 From epeter at openjdk.org Mon Sep 8 09:06:33 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 09:06:33 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 08:42:51 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix typo > > Looks good, thanks for the updates! @chhagedorn Thanks for the approval! @mhaessig is on vacation - so I'm hoping someone else can help review here ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27056#issuecomment-3265317349 From dfenacci at openjdk.org Mon Sep 8 09:29:26 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 8 Sep 2025 09:29:26 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v3] In-Reply-To: References: Message-ID: > # Issue > While compiling `java.util.zip.ZipFile` in C2 this assert is triggered > https://github.com/openjdk/jdk/blob/a2e86ff3c56209a14c6e9730781eecd12c81d170/src/hotspot/share/opto/memnode.cpp#L4235 > > # Cause > While compiling the constructor of java.util.zip.ZipFile$CleanableResource the following happens: > * we insert a trailing `MemBarStoreStore` in the constructor > before_folding > > * during IGVN we completely fold the memory subtree of the `MemBarStoreStore` node. The node still has a control output attached. > after_folding > > * later during the same IGVN run the `MemBarStoreStore` node is handled and we try to remove it (because the `Allocate` node of the `MembBar` is not escaping the thread ) https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4301-L4302 > * the assert https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4235 > triggers because the barrier has only 1 (control) output and is a `MemBarStoreStore` (not `Initialize`) barrier > > The issue happens only when the `UseStoreStoreForCtor` is set (default as well), which makes C2 use `MemBarStoreStore` instead of `MemBarRelease` at the end of constructors. `MemBarStoreStore` are processed separately by EA and this happens after the IGVN pass that folds the memory subtree. `MemBarRelease` on the other hand are handled during same IGVN pass before the memory subtree gets removed and it?s still got 2 outputs (assert skipped). > > # Fix > Adapting the assert to accept that `MemBarStoreStore` can also have `!= 2` outputs (when `+UseStoreStoreForCtor` is used) seems to be an OK solution as this seems like a perfectly plausible situation. > > # Testing > Unfortunately reproducing the issue with a simple regression test has proven very hard. The test seems to rely on very peculiar profiling and IGVN worklist sequence. JBS replay compilation passes. Running JCK's `api/java_util` 100 times triggers the assert a couple of times on average before the fix, none after. > Tier 1-3+ tests passed. Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8360031: add MemBarStoreStore node to worklist during escape analysis/adapt remove assert ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26556/files - new: https://git.openjdk.org/jdk/pull/26556/files/f7bc08c9..57073b96 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26556&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26556&range=01-02 Stats: 6 lines in 2 files changed: 0 ins; 4 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26556.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26556/head:pull/26556 PR: https://git.openjdk.org/jdk/pull/26556 From dfenacci at openjdk.org Mon Sep 8 09:29:27 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 8 Sep 2025 09:29:27 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v2] In-Reply-To: References: <5CGrcWjFZ7Zqj_Tm0LO6Tqg9cUA-xxvcaa2J-yWW8BE=.af4dea7c-e39d-491d-b924-c89fa82e757a@github.com> Message-ID: On Fri, 5 Sep 2025 09:45:38 GMT, Dean Long wrote: > What happens in the replay crash is the MemBarStoreStore gets onto the worklist through an indirect route in ConnectionGraph::split_unique_types() because of its memory edge. Oh I see! Thanks @dean-long! I noticed that `MemBarStoreStore` was added later on but didn't really figure out where/why. > I think the conservative fix is to have compute_escape() always add the MemBarStoreStore to the worklist if it has a Precedent edge. Because of StressIGVN randomizing the worklist, I think the outcnt() can be 1 for either MemBarStoreStore or MemBarRelease, so we should relax the assert accordingly. I'm not sure how useful the assert will be after that. It might be better to remove it. I made `compute_escape` add `MemBarStoreStore` to the worklist. By doing so the assert doesn't trigger anymore with the reproducer but, as you wrote, there seems to be no reason why `outcnt()` couldn't be 1 for `MemBarStoreStore` or `MemBarRelease`. So I modified the assert to only leave the `outcnt() <=2` part. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26556#issuecomment-3265399663 From qxing at openjdk.org Mon Sep 8 09:32:58 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Mon, 8 Sep 2025 09:32:58 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v12] In-Reply-To: References: Message-ID: > The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases. > > This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch: > > > public static int numberOfNibbles(int i) { > int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i); > return Math.max((mag + 3) / 4, 1); > } > > > Testing: tier1, IR test Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: Add more constant folding tests for CLZ/CTZ ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25928/files - new: https://git.openjdk.org/jdk/pull/25928/files/5cfe39b6..d09d4cb0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=10-11 Stats: 279 lines in 1 file changed: 223 ins; 0 del; 56 mod Patch: https://git.openjdk.org/jdk/pull/25928.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25928/head:pull/25928 PR: https://git.openjdk.org/jdk/pull/25928 From qxing at openjdk.org Mon Sep 8 09:32:59 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Mon, 8 Sep 2025 09:32:59 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v12] In-Reply-To: References: Message-ID: On Tue, 19 Aug 2025 13:54:31 GMT, Emanuel Peter wrote: >> Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: >> >> Add more constant folding tests for CLZ/CTZ > > src/hotspot/share/opto/countbitsnode.cpp line 47: > >> 45: if (x >> 30 == 0) { n += 2; x <<= 2; } >> 46: n -= x >> 31; >> 47: return TypeInt::make(n); > > Is there already a test that covers all the cases that constant fold here? Just to make sure we do not get regressions here. Added in IR test `TestCountBitsRange.java`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25928#discussion_r2329682681 From mbaesken at openjdk.org Mon Sep 8 10:26:09 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 8 Sep 2025 10:26:09 GMT Subject: RFR: 8366775: TestCompileTaskTimeout should use timeoutFactor In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 15:11:14 GMT, Manuel H?ssig wrote: > Is the reduced default a problem on your side? I added the PR to our build/test queue. The reduced default should be okay for us. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27094#issuecomment-3265626746 From epeter at openjdk.org Mon Sep 8 11:03:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 11:03:32 GMT Subject: RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2] In-Reply-To: References: <3upl3uiPM5gnO1HCV7vb1C7CFyV3HQ2ztGXVJkss-AM=.09da8cb2-e384-420a-91d1-f3bb8d8cfc6a@github.com> Message-ID: On Wed, 3 Sep 2025 16:55:45 GMT, Fei Gao wrote: >> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the >> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go. >> >> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`. >> >> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop. >> >> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop. >> >> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow. >> >> The whole process is done by the function `insert_post_loop()`. >> >> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`: >> >> 1. The fall-in control flow to the vectorized drain loop comes fr... > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits: > > - Merge branch 'master' into optimize-atomic-post > - Clean up comments for consistency and add spacing for readability > - Fix some corner case failures and refined part of code > - Merge branch 'master' into optimize-atomic-post > - Refine ascii art, rename some variables and resolve conflicts > - Merge branch 'master' into optimize-atomic-post > - Add necessary ASCII art, refactor insert_post_loop() and rename > "atomic post loop" with "vectorized drain loop. > - Merge branch 'master' into optimize-atomic-post > - 8307084: C2: Vector atomic post loop is not executed for some small trip counts > > In C2's loop optimization, for a counted loop, if we have any of > these conditions (RCE, unrolling) met, we switch to the > pre-main-post-loop model. Then a counted loop could be split into > pre-main-post loops. Meanwhile, C2 inserts minimum trip guards > (a.k.a. zero-trip guards) before the main loop and the post loop. > These guards test if the remaining trip count is less than the > loop stride (after unrolling). If yes, The execution jumps over > the loop code to avoid loop over-running. For example, if a main > loop is unrolled to 8x, the main loop guard tests if the loop has > less than 8 iterations and then decide which way to go. > > Usually, the vectorized main loop will be super-unrolled after > vectorization. In such cases, the main loop's stride is going to > be further multiplied. After the main loop is super-unrolled, the > minimum trip guard test will be updated. Assuming one vector can > operate 8 iterations and the super-unrolling count is 4, the trip > guard of the main loop will test if remaining trip is less than > 8 * 4 = 32. > > To avoid the scalar post loop running too many iterations after > super-unrolling, C2 clones the main loop before super-unrolling to > create a vector drain loop, i.e. atomic post loop. The newly > inserted post loop also has a minimum trip guard. And, both trip > guards of the main loop and vector post loop jump to the scalar > post loop. > > The problem here is, if the remaining trip count when exiting from > the pre-loop is relatively small but larger than the vector length, > the vector atomic post loop will never be executed. Because the > minimum trip guard test of main loop fails, the execution will > jump over both the main loop and the atomic p... I'm really impressed by this change. This was a lot of work @fg1417 ! Please don't be discouraged by the many comments / suggestions. A lot of them are minor code style issues, so should be quick to address. I have not made it through every detail yet, but I'll get there in the next cycle. I have one concern: We now have changed the branches. There is now a long sequence of branches if we have very few iterations, so that we only go through pre and post loop. It would be interesting to see what the performance difference is between master and patch. It would also be interesting to see a case where the SIZE of the array is not constant, and so the branches become impossible to predict, and there are a lot of branch misses. What do you think? I would also suggest that @chhagedorn or @rwestrel should review this patch, since they are much more familiar with loop-opts structures than I. I'm super happy that you are putting the time in for this. I think it is a really important task that closes a gap in the small-iteration space. And that is actually quite important :) src/hotspot/share/opto/loopTransform.cpp line 1325: > 1323: // - Clone 'n' into 'preheader_ctrl' if its block does not strictly dominate 'preheader_ctrl'. > 1324: // - Otherwise, return 'n'. > 1325: Node *PhaseIdealLoop::clone_up_backedge_goo(Node *back_ctrl, Node *preheader_ctrl, Node *n, VectorSet &visited, Node_Stack &clones) { Could you please add a general comment about what this does at the top? The name is a bit funny with `goo`, but that's not your fault. If you have a better name feel free to rename ;) src/hotspot/share/opto/loopTransform.cpp line 1332: > 1330: if (!requires_clone_from_preloop_exit) return n; > 1331: } else { > 1332: if (get_ctrl(n) != back_ctrl) return n; Suggestion: if (!requires_clone_from_preloop_exit) { return n; } } else { if (get_ctrl(n) != back_ctrl) { return n; } We generally like to be explicit with the brackets src/hotspot/share/opto/loopTransform.cpp line 1394: > 1392: // now we need to make the fall-in values to the vectorized drain > 1393: // loop come from phis merging exit values from the pre loop and > 1394: // the main loop. Suggestion: // After inserting zero trip guard for the vectorized drain loop, // we now need to make the fall-in values to the vectorized drain // loop come from phis merging exit values from the pre loop and // the main loop. src/hotspot/share/opto/loopTransform.cpp line 1437: > 1435: // We look for an existing Phi node 'drain_input' among the uses of 'main_incr'. > 1436: // If no valid Phi is found, we create a new Phi that merges output data edges > 1437: // from both the pre-loop and main loop. Why can that happen? Do you have a small example? src/hotspot/share/opto/loopTransform.cpp line 1848: > 1846: // / / > 1847: // after loop > 1848: Node* PhaseIdealLoop::insert_post_loop(IdealLoopTree* loop, Node_List& old_new, Consider renaming to `insert_post_or_drain_loop` src/hotspot/share/opto/loopTransform.cpp line 1865: > 1863: int dd_main_exit = dom_depth(main_exit); > 1864: > 1865: // Step 1: Clone the loop body of main loop. The clone becomes the new loop. Suggestion: // Step 1: Clone the loop body of main loop. The clone becomes the new loop (post or drain). src/hotspot/share/opto/loopTransform.cpp line 1887: > 1885: // from the main loop and the pre loop. > 1886: zero_ctrl = main_exit->unique_ctrl_out_or_null(); > 1887: assert(zero_ctrl, "if zero_ctrl doesn't exist, pre-main-post model fails."); Style guide forbids implicit null / zero checks. Suggestion: assert(zero_ctrl != nullptr, "if zero_ctrl doesn't exist, pre-main-post model fails."); src/hotspot/share/opto/loopTransform.cpp line 1910: > 1908: // Step 2.2: Find 'exit_point', which is taken when zero trip guard fails. > 1909: Node* exit_point = nullptr; > 1910: uint replace_idx = 0; Why not name it `exit_ctrl` and `exit_ctrl_idx`? Maybe you have a better name, but I'd make sure that they have a parallel name so it is obvious that they belong together. src/hotspot/share/opto/loopTransform.cpp line 1927: > 1925: assert(exit_point->in(replace_idx) == zero_ctrl, > 1926: "The zero_ctrl should be the second input"); > 1927: ) Here, `exit_point` is a region, right? Why not assert that? Also: thee is no need to wrap an `assert` in a `DEBUG_ONLY` - it only makes sense if you define local variables ;) src/hotspot/share/opto/loopTransform.cpp line 1934: > 1932: > 1933: // Step 3: Find a 'new_phi' which is the input trip count of the zero trip guard. > 1934: Node* new_incr = nullptr; Is it called `new_phi` or `new_incr`? src/hotspot/share/opto/loopTransform.cpp line 1951: > 1949: Node* cmp = main_guard_opaq->unique_out(); > 1950: Node* pre_incr = cmp->in(1); > 1951: assert(new_incr && new_incr->in(1) == pre_incr && new_incr->in(2) == main_incr, ""); Suggestion: assert(new_incr != nullptr && new_incr->in(1) == pre_incr && new_incr->in(2) == main_incr, ""); No implicit null check src/hotspot/share/opto/loopTransform.cpp line 1965: > 1963: // trip guard until all unrolling is done. > 1964: // For example, when we're inserting vectorized drain loop, after several steps above, > 1965: // the loop structure is showed in the comments for handle_data_uses_for_vectorized_drain(). Which "several steps" are your referencing here? - The steps 1-3 from above? - Or several steps further down, to the point we draw in `handle_data_uses_for_vectorized_drain`? Can you please reformulate a bit? src/hotspot/share/opto/loopTransform.cpp line 2090: > 2088: _igvn.hash_delete(post_phi); > 2089: post_phi->set_req(LoopNode::EntryControl, fallnew); > 2090: } Looks like a bit much code duplication. But maybe that is justified here. Up to you. src/hotspot/share/opto/loopnode.hpp line 1359: > 1357: // from old-loop now should use new Phis that merges Phis which merges > 1358: // values from pre-loop and main-loop and values from the new-loop > 1359: // (vectorized drain loop) equivalents. I'm struggling with reading this. "x that merges y that merges z and w and v" - where do I have to place the brackets? `x that merges y that (merges z and w) and v` Probably this? src/hotspot/share/opto/loopnode.hpp line 1410: > 1408: > 1409: // Add post loop after the given loop. > 1410: Node* insert_post_loop(IdealLoopTree* loop, Node_List& old_new, Consider renamint to `insert_post_or_drain_loop`, and adjust comment above. src/hotspot/share/opto/loopnode.hpp line 1434: > 1432: Node* get_vectorized_drain_input(Node* main_backedge_ctrl, VectorSet& visited, > 1433: Node_Stack& clones, Node* main_merge_region, > 1434: Node* main_phi); We don't just do this for the trip-counter though, right? Because the `main_incr` suggests that a bit here. Could you rephrase to make it more accurate? Do you think that could be worth it? It is also nice to have the analogy to the trip-counter, so I like that in the example ASCII art. src/hotspot/share/opto/loopopts.cpp line 2341: > 2339: // Take the loop increment "i" as an example. > 2340: // Now the data uses about "i" are like: > 2341: Nit: I would do the `//` continuously, like elsewhere. src/hotspot/share/opto/loopopts.cpp line 2351: > 2349: // | / > 2350: // | / > 2351: // main zero-trip guard Kinda subjective, but I'd prefer if the corners were the other way around ;) Suggestion: // -----> pre loop head ... // | | \ / // IfTrue | -----> PhiNode // | v | | // loop end ------ addI('pre_incr') // | / // IfFalse / // | / // | / // main zero-trip guard Otherwise I'm wondering if the line may continue further up and just be cropped? Of course not. Putting the IfTrue above the `loop end` can also be a little confusing. But it does save space. But not much. You could just extend the picture a little further to the right. Sorry, this is very much a nit, so feel free to ignore ;) src/hotspot/share/opto/loopopts.cpp line 2356: > 2354: // / | \_____________________________ > 2355: // / | \ > 2356: // | |--> main loop head |---> vectorized drain loop head Is there some IfNode here that decides between main and drain? Or does that come later? Suggestion: // IfFalse IfTrue // / | ________(moved later)________ // / | \ // | |--> main loop head |---> vectorized drain loop head src/hotspot/share/opto/loopopts.cpp line 2394: > 2392: // The data uses will become: > 2393: // (new edges are marked with "*/*" or "*\*".) > 2394: Again: use trip-counter phi instead of `i` src/hotspot/share/opto/loopopts.cpp line 2435: > 2433: void PhaseIdealLoop::handle_data_uses_for_vectorized_drain(Node* main_old, Node_List &old_new, > 2434: IdealLoopTree* loop, IdealLoopTree* outer_loop, > 2435: Node_List& worklist, uint new_counter) { `handle` is a very generic verb. `fix` is already used elsewhere for the same purpose, so why not use that instead? Maybe `fix_data_uses_with_drain_merge_phis`? I'll have to read the code below to make sure that makes sense now. src/hotspot/share/opto/loopopts.cpp line 2452: > 2450: _igvn.replace_node(use, hit); > 2451: } > 2452: }; The existing code style is to avoid lambdas and use helper methods instead. Would that be possible here? Probably just requires a few more arguments, right? `new_counter` for example. src/hotspot/share/opto/loopopts.cpp line 2455: > 2453: > 2454: for (DUIterator_Fast jmax, j = main_old->fast_outs(jmax); j < jmax; j++) > 2455: worklist.push(main_old->fast_out(j)); Please use explicit {} everywhere :) src/hotspot/share/opto/loopopts.cpp line 2458: > 2456: > 2457: while (worklist.size()) { > 2458: Node* use = worklist.pop(); Can you add a quick comment what kind of traversal this is? BFS? Over what nodes? src/hotspot/share/opto/loopopts.cpp line 2461: > 2459: if (!has_node(use)) continue; // Ignore dead nodes > 2460: if (use->in(0) == C->top()) continue; > 2461: IdealLoopTree* use_loop = get_loop(has_ctrl(use) ? get_ctrl(use) : use); Could you do this with `ctrl_or_self` instead? src/hotspot/share/opto/loopopts.cpp line 2466: > 2464: // Find the phi node merging the data from pre-loop and vector main-loop. > 2465: Node_List visit_list; > 2466: Node_List phi_list; You are doing this in a loop. And you set no `ResouceMark`. I'm afraid this could end up allocating a lot of memory. What do you think? src/hotspot/share/opto/loopopts.cpp line 2475: > 2473: // Use BFS to clone all necessary nodes starting from the 'use' node, which exits the main loop, > 2474: // until reaching a merge point with a path from the pre-loop. > 2475: while (visit_list.size()) { Suggestion: while (visit_list.size() != 0) { src/hotspot/share/opto/loopopts.cpp line 2477: > 2475: while (visit_list.size()) { > 2476: Node* curr = visit_list.at(0); > 2477: visit_list.remove(0); That `remove` ends up calling `Node_Array::remove`, which copies all upper entries. Generally not very performant. Not sure if it matters here, just noticed it. src/hotspot/share/opto/loopopts.cpp line 2481: > 2479: if (newcurr) { > 2480: continue; > 2481: } Suggestion: if (newcurr != nullptr) { continue; } src/hotspot/share/opto/loopopts.cpp line 2514: > 2512: assert(!has_ctrl(outn) || !has_ctrl(curr) || is_dominator(get_ctrl(curr), get_ctrl(outn)), > 2513: "Only these nodes controlled by loop exit edge need to be cloned"); > 2514: visit_list.push(outn); Might we visit nodes more than once? Or is that already prevented? src/hotspot/share/opto/loopopts.cpp line 2518: > 2516: } > 2517: > 2518: // 'use' may have more than one valid "Phi" uses. Example? src/hotspot/share/opto/loopopts.cpp line 2844: > 2842: // from old-loop now should use new Phis that merges Phis which merges > 2843: // values from pre-loop and main-loop and values from the new-loop > 2844: // (vectorized drain loop) equivalents. Same issue as above. Nit: Language should also just declare what the new form "is", not what is "should" be. Nit: use space after period `.All` -> `. All` src/hotspot/share/opto/loopopts.cpp line 2921: > 2919: worklist, new_counter); > 2920: } > 2921: break; Do we need to do both `fix_data_uses` and `handle_data_uses_for_vectorized_drain`? Ah, they do it one for the old and one for the new loop? It is kinda funny that we do a loop here for the `old` loop, but then do the loop inside `fix_data_uses` for the other loop - did I understand this right? Is there a good way to refactor this a little? We can also do that in a separate RFE first maybe? Because now with the large switch case here things are harder to read and get an overview quickly. What do you think? test/hotspot/jtreg/compiler/loopopts/superword/TestMultiversionRemoveUselessSlowLoop.java line 86: > 84: "multiversion_delayed_slow", "= 0", // The second loop's multiversion_if was also not used, so it is constant folded after loop opts. > 85: "multiversion", ">= 5", // nothing unexpected > 86: "multiversion", "<= 7", // nothing unexpected Can you please also add a lower bound for `"post .* multiversion_fast", ">= 3",` That should be correct, right? Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct? I'm a little bit unsure where the `On platforms (> 32 bytes)` is coming from. Does this IR rule fail with a smaller `MaxVectorSize=32`? I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it. You could even automate this to some degree with the template framework. We could also make this a follow-up RFE. test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 31: > 29: * generated by fuzzer. > 30: * > 31: * @run main/othervm -Xint compiler.loopopts.superword.TestVectorizedDrainLoop What is the interpreter run good for? Why not just have a run without any flags instead? test/micro/org/openjdk/bench/vm/compiler/VectorizedDrainLoopPerf.java line 51: > 49: @CompilerControl(CompilerControl.Mode.DONT_INLINE) > 50: > 51: public class VectorizedDrainLoopPerf { Can you add some comments to this benchmark and to `test/micro/org/openjdk/bench/vm/compiler/VectorThroughputForIterationCount.java`, making sure that people are aware of both if they look at one? I'm also wondering if we really need to add `VectorizedDrainLoopPerf.java`, since the other benchmark does the same and even more. I have not compared them in super detail now, so maybe there are reasons. In the end, I would prefer to have one benchmark that is really good, rather than multiple ones that do similar things. So feel free to modify `VectorThroughputForIterationCount.java` if it does not do everything you need it to do. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/22629#pullrequestreview-3195298102 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329800122 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329802649 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329806392 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329812997 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329687426 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329692663 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329698477 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329732109 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329722628 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329736777 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329753240 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329763943 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329780926 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329517111 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329690592 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329789991 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329568979 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329604565 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329613322 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329640978 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329634882 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329829785 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329842181 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329850866 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329846345 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329849286 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329858268 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329864514 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329867796 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329881209 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329877189 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329554220 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329661057 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329435125 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329384829 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329401879 From epeter at openjdk.org Mon Sep 8 11:03:33 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 11:03:33 GMT Subject: RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2] In-Reply-To: References: <3upl3uiPM5gnO1HCV7vb1C7CFyV3HQ2ztGXVJkss-AM=.09da8cb2-e384-420a-91d1-f3bb8d8cfc6a@github.com> Message-ID: On Mon, 8 Sep 2025 10:22:14 GMT, Emanuel Peter wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits: >> >> - Merge branch 'master' into optimize-atomic-post >> - Clean up comments for consistency and add spacing for readability >> - Fix some corner case failures and refined part of code >> - Merge branch 'master' into optimize-atomic-post >> - Refine ascii art, rename some variables and resolve conflicts >> - Merge branch 'master' into optimize-atomic-post >> - Add necessary ASCII art, refactor insert_post_loop() and rename >> "atomic post loop" with "vectorized drain loop. >> - Merge branch 'master' into optimize-atomic-post >> - 8307084: C2: Vector atomic post loop is not executed for some small trip counts >> >> In C2's loop optimization, for a counted loop, if we have any of >> these conditions (RCE, unrolling) met, we switch to the >> pre-main-post-loop model. Then a counted loop could be split into >> pre-main-post loops. Meanwhile, C2 inserts minimum trip guards >> (a.k.a. zero-trip guards) before the main loop and the post loop. >> These guards test if the remaining trip count is less than the >> loop stride (after unrolling). If yes, The execution jumps over >> the loop code to avoid loop over-running. For example, if a main >> loop is unrolled to 8x, the main loop guard tests if the loop has >> less than 8 iterations and then decide which way to go. >> >> Usually, the vectorized main loop will be super-unrolled after >> vectorization. In such cases, the main loop's stride is going to >> be further multiplied. After the main loop is super-unrolled, the >> minimum trip guard test will be updated. Assuming one vector can >> operate 8 iterations and the super-unrolling count is 4, the trip >> guard of the main loop will test if remaining trip is less than >> 8 * 4 = 32. >> >> To avoid the scalar post loop running too many iterations after >> super-unrolling, C2 clones the main loop before super-unrolling to >> create a vector drain loop, i.e. atomic post loop. The newly >> inserted post loop also has a minimum trip guard. And, both trip >> guards of the main loop and vector post loop jump to the scalar >> post loop. >> >> The problem here is, if the remaining trip count when exiting from >> the pre-loop is relatively small but larger than the vector length, >> the vector atomic post loop will never be executed. Because the >> minimum trip guard test o... > > src/hotspot/share/opto/loopTransform.cpp line 1437: > >> 1435: // We look for an existing Phi node 'drain_input' among the uses of 'main_incr'. >> 1436: // If no valid Phi is found, we create a new Phi that merges output data edges >> 1437: // from both the pre-loop and main loop. > > Why can that happen? Do you have a small example? The solution looks a little complex, so I just want to understand why we need it ;) > src/hotspot/share/opto/loopTransform.cpp line 1848: > >> 1846: // / / >> 1847: // after loop >> 1848: Node* PhaseIdealLoop::insert_post_loop(IdealLoopTree* loop, Node_List& old_new, > > Consider renaming to `insert_post_or_drain_loop` Should we have an assert, that `mode` can only be - `ControlAroundStripMined` -> pre - `InsertVectorizedDrain` -> drain That might also help the reader understand the options here. > src/hotspot/share/opto/loopTransform.cpp line 1887: > >> 1885: // from the main loop and the pre loop. >> 1886: zero_ctrl = main_exit->unique_ctrl_out_or_null(); >> 1887: assert(zero_ctrl, "if zero_ctrl doesn't exist, pre-main-post model fails."); > > Style guide forbids implicit null / zero checks. > Suggestion: > > assert(zero_ctrl != nullptr, "if zero_ctrl doesn't exist, pre-main-post model fails."); What do you mean by `pre-main-post model fails`? PreMainPost has presumably already succeeded. Can you reformulate? > src/hotspot/share/opto/loopTransform.cpp line 1934: > >> 1932: >> 1933: // Step 3: Find a 'new_phi' which is the input trip count of the zero trip guard. >> 1934: Node* new_incr = nullptr; > > Is it called `new_phi` or `new_incr`? `phi` is usually the `PhiNode`, and `incr` is the `AddINode`, right? > src/hotspot/share/opto/loopnode.hpp line 1359: > >> 1357: // from old-loop now should use new Phis that merges Phis which merges >> 1358: // values from pre-loop and main-loop and values from the new-loop >> 1359: // (vectorized drain loop) equivalents. > > I'm struggling with reading this. "x that merges y that merges z and w and v" - where do I have to place the brackets? > `x that merges y that (merges z and w) and v` > Probably this? Maybe you can just write it like this instead: Before: r_old = Region(pre_loop_exit ... or zero-trip-guard?, main_loop_exit) phi_old = Phi(pre_loop_outputs, main_loop_outputs) After: r_old = Region(pre_loop_exit, main_loop_exit) phi_old = Phi(....) r_new = Region(r_old, drain_loop_exit) phi_new = Phi(phi_old, drain_loop_outputs) Or you just say that we first merge pre-loop and main-loop, and then merge that with drain-loop? So a more high level comment, and then refer for more details elsewhere? > src/hotspot/share/opto/loopopts.cpp line 2341: > >> 2339: // Take the loop increment "i" as an example. >> 2340: // Now the data uses about "i" are like: >> 2341: > > Nit: I would do the `//` continuously, like elsewhere. With `i` do you mean the trip-counter phi? You don't really use `i` below anyway, so I'd just drop it. Suggestion: // This function is going to fix all data uses of the new loop body. // // Let us look at the trip-counter phi, as an example to understand the data uses: // > src/hotspot/share/opto/loopopts.cpp line 2356: > >> 2354: // / | \_____________________________ >> 2355: // / | \ >> 2356: // | |--> main loop head |---> vectorized drain loop head > > Is there some IfNode here that decides between main and drain? Or does that come later? > Suggestion: > > // IfFalse IfTrue > // / | ________(moved later)________ > // / | \ > // | |--> main loop head |---> vectorized drain loop head Ah, also the input to the PhiNode below is not yet fixed, right? Maybe it's not worth mentioning any of it at all then... not sure. > src/hotspot/share/opto/loopopts.cpp line 2435: > >> 2433: void PhaseIdealLoop::handle_data_uses_for_vectorized_drain(Node* main_old, Node_List &old_new, >> 2434: IdealLoopTree* loop, IdealLoopTree* outer_loop, >> 2435: Node_List& worklist, uint new_counter) { > > `handle` is a very generic verb. `fix` is already used elsewhere for the same purpose, so why not use that instead? > > Maybe `fix_data_uses_with_drain_merge_phis`? > I'll have to read the code below to make sure that makes sense now. It would probably make sense to have things matching with: `fix_ctrl_uses_for_vectorized_drain` Suggestion alternatives: - `fix_ctrl_uses_for_vectorized_drain` and `fix_data_uses_for_vectorized_drain` - `fix_ctrl_uses_with_drain_merge_region` and `fix_data_uses_with_drain_merge_phis` > src/hotspot/share/opto/loopopts.cpp line 2452: > >> 2450: _igvn.replace_node(use, hit); >> 2451: } >> 2452: }; > > The existing code style is to avoid lambdas and use helper methods instead. Would that be possible here? Probably just requires a few more arguments, right? `new_counter` for example. The comment is a little hard to read. Maybe say this instead: `For the 'use' node, replace all input occurances of 'old_in' with 'new_in'.` You also do more in the method than the name/comment promises: you replace use with hit. > src/hotspot/share/opto/loopopts.cpp line 2458: > >> 2456: >> 2457: while (worklist.size()) { >> 2458: Node* use = worklist.pop(); > > Can you add a quick comment what kind of traversal this is? BFS? Over what nodes? Ah, are we only removing nodes? > src/hotspot/share/opto/loopopts.cpp line 2477: > >> 2475: while (visit_list.size()) { >> 2476: Node* curr = visit_list.at(0); >> 2477: visit_list.remove(0); > > That `remove` ends up calling `Node_Array::remove`, which copies all upper entries. Generally not very performant. Not sure if it matters here, just noticed it. Maybe you can construct some graph where this really visits a lot of nodes, then this could blow up quadratically. > src/hotspot/share/opto/loopopts.cpp line 2481: > >> 2479: if (newcurr) { >> 2480: continue; >> 2481: } > > Suggestion: > > if (newcurr != nullptr) { continue; } You have more implicit zero/null checks below. > src/hotspot/share/opto/loopopts.cpp line 2518: > >> 2516: } >> 2517: >> 2518: // 'use' may have more than one valid "Phi" uses. > > Example? Can you quickly say what this loop does with each phi? > test/hotspot/jtreg/compiler/loopopts/superword/TestMultiversionRemoveUselessSlowLoop.java line 86: > >> 84: "multiversion_delayed_slow", "= 0", // The second loop's multiversion_if was also not used, so it is constant folded after loop opts. >> 85: "multiversion", ">= 5", // nothing unexpected >> 86: "multiversion", "<= 7", // nothing unexpected > > Can you please also add a lower bound for > `"post .* multiversion_fast", ">= 3",` > That should be correct, right? > > Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct? > > > I'm a little bit unsure where the `On platforms (> 32 bytes)` is coming from. Does this IR rule fail with a smaller `MaxVectorSize=32`? > > I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it. > > You could even automate this to some degree with the template framework. We could also make this a follow-up RFE. I'm also wondering if it would not be nicer to have a different tag for the vectorized drain loop, instead of `post`. Could we call it `vector_drain` maybe? That would make it easier to spot it correctly and to write more expressive IR rules. > test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 31: > >> 29: * generated by fuzzer. >> 30: * >> 31: * @run main/othervm -Xint compiler.loopopts.superword.TestVectorizedDrainLoop > > What is the interpreter run good for? Why not just have a run without any flags instead? Ah, you have exact constant results that you compare with. Could be good to state this here as a comment, so that nobody removes this in the future. You are just making sure that the interpreter would have produced the same results. Still: why not add a run without any flags? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329815918 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329691247 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329702206 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329743606 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329531126 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329591377 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329624643 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329646048 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329841458 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329853895 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329870109 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329870887 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329884145 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329438824 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329388742 From epeter at openjdk.org Mon Sep 8 11:03:33 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 11:03:33 GMT Subject: RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2] In-Reply-To: References: <3upl3uiPM5gnO1HCV7vb1C7CFyV3HQ2ztGXVJkss-AM=.09da8cb2-e384-420a-91d1-f3bb8d8cfc6a@github.com> Message-ID: On Mon, 8 Sep 2025 09:38:11 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopTransform.cpp line 1887: >> >>> 1885: // from the main loop and the pre loop. >>> 1886: zero_ctrl = main_exit->unique_ctrl_out_or_null(); >>> 1887: assert(zero_ctrl, "if zero_ctrl doesn't exist, pre-main-post model fails."); >> >> Style guide forbids implicit null / zero checks. >> Suggestion: >> >> assert(zero_ctrl != nullptr, "if zero_ctrl doesn't exist, pre-main-post model fails."); > > What do you mean by `pre-main-post model fails`? PreMainPost has presumably already succeeded. Can you reformulate? Why not add the `Region` check already here? >> src/hotspot/share/opto/loopTransform.cpp line 1934: >> >>> 1932: >>> 1933: // Step 3: Find a 'new_phi' which is the input trip count of the zero trip guard. >>> 1934: Node* new_incr = nullptr; >> >> Is it called `new_phi` or `new_incr`? > > `phi` is usually the `PhiNode`, and `incr` is the `AddINode`, right? So you could actually make the type more precise than `Node*` :) >> src/hotspot/share/opto/loopopts.cpp line 2458: >> >>> 2456: >>> 2457: while (worklist.size()) { >>> 2458: Node* use = worklist.pop(); >> >> Can you add a quick comment what kind of traversal this is? BFS? Over what nodes? > > Ah, are we only removing nodes? Oh, you have another implicit zero check here. >> src/hotspot/share/opto/loopopts.cpp line 2477: >> >>> 2475: while (visit_list.size()) { >>> 2476: Node* curr = visit_list.at(0); >>> 2477: visit_list.remove(0); >> >> That `remove` ends up calling `Node_Array::remove`, which copies all upper entries. Generally not very performant. Not sure if it matters here, just noticed it. > > Maybe you can construct some graph where this really visits a lot of nodes, then this could blow up quadratically. `pop` is more efficient, because it just takes it from the end. But then you'd get a DFS and not BFS. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329711083 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329746816 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329856226 PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329879100 From epeter at openjdk.org Mon Sep 8 11:03:33 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 11:03:33 GMT Subject: RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2] In-Reply-To: References: <3upl3uiPM5gnO1HCV7vb1C7CFyV3HQ2ztGXVJkss-AM=.09da8cb2-e384-420a-91d1-f3bb8d8cfc6a@github.com> Message-ID: On Mon, 8 Sep 2025 09:53:49 GMT, Emanuel Peter wrote: >> `phi` is usually the `PhiNode`, and `incr` is the `AddINode`, right? > > So you could actually make the type more precise than `Node*` :) Or do we have to somehow support `long` loops too here? then we could just make it an `AddNode*`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329751232 From mli at openjdk.org Mon Sep 8 12:49:13 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 8 Sep 2025 12:49:13 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 05:13:32 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. > Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. > > ### Test > - [x] Run tier1 and tier2 on sg2042 Marked as reviewed by mli (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27134#pullrequestreview-3196372198 From epeter at openjdk.org Mon Sep 8 13:34:26 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 13:34:26 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 21:29:43 GMT, Vladimir Ivanov wrote: >> This PR introduces C2 support for `Reference.reachabilityFence()`. >> >> After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. >> >> `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. >> >> Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. >> >> Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 >> "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." >> >> Testing: >> - [x] hs-tier1 - hs-tier8 >> - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations >> - [x] java/lang/foreign microbenchmarks > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > whitespaces Thanks for all the updates. I went over all your responses quickly, but still need to read through the description in `reachability.cpp` now. I'll do another pass over everything once you address my responses ;) src/hotspot/share/opto/callGenerator.cpp line 623: > 621: return; // keep the original call node as the holder of reachability info > 622: } > 623: } Maybe that's just me. But people use the assert messages both in positive and negative ways, and so this is a bit ambiguous. Maybe you can write: `no reachability edge should be present` I'm still a bit unsure what the `SafePointNode::grow_stack` comment means. In the previous comment https://github.com/openjdk/jdk/pull/25315#discussion_r2320120466 you explained more. Why not add that here instead? src/hotspot/share/opto/compile.cpp line 2522: > 2520: if (failing()) return; > 2521: assert(_reachability_fences.length() == 0, "no RF nodes allowed"); > 2522: } Looks better than before :) I'm still wondering: do we need to do a whole loop-opts phase here? It probably has a performance impact, right? Have you measured that? If it is measurable: could we just go through `_reachability_fences`, and hack the graph and clean up with IGVN? Or do we really need the loop state to do this successfully? src/hotspot/share/opto/loopTransform.cpp line 66: > 64: //------------------------------unique_loop_exit_or_null---------------------- > 65: // Return the loop-exit projection if it is unique. > 66: Node* IdealLoopTree::unique_loop_exit_or_null() { I suggested it here: https://github.com/openjdk/jdk/pull/25315#discussion_r2149677594 Can we change the return type to `IfProjNode`? Also: when is it possible that there are none or multiple loop exits? Can you add a comment below where you return nullptr? src/hotspot/share/opto/macro.cpp line 973: > 971: _igvn._worklist.push(ac); > 972: } else if (use->is_ReachabilityFence() && OptimizeReachabilityFences) { > 973: use->as_ReachabilityFence()->clear_referent(_igvn); // redundant fence Thanks for refactoring a bit here :) Is this rf guaranteed to belong to the Allocation somehow? src/hotspot/share/opto/parse1.cpp line 2233: > 2231: insert_reachability_fence(referent); > 2232: } > 2233: } Comments look better, thanks :) But `StressReachabilityFences` seems to promise that it should happen randomly. Did you want to do that or adjust the flag comment? src/hotspot/share/opto/reachability.cpp line 136: > 134: return true; > 135: } > 136: } Nit: `an no-op` -> `a no-op` Also: do you need the return value? The only use case does not do anything with it. src/hotspot/share/opto/reachability.cpp line 438: > 436: if (!OptimizeReachabilityFences) { > 437: return false; > 438: } Can this ever fail? Could it be an assert? src/hotspot/share/opto/reachability.cpp line 441: > 439: > 440: Unique_Node_List redundant_rfs; > 441: Node_List worklist; Not sure if necessary, but maybe good practice anyway: add `ResourceMark`. src/hotspot/share/opto/reachability.cpp line 453: > 451: SafePointNode* sfpt = safepoints.pop()->as_SafePoint(); > 452: assert(is_dominator(get_ctrl(referent), sfpt), ""); > 453: assert(sfpt->req() == rf_start_offset(sfpt), ""); Is this the only reason we need this to happend during LoopOpts - i.e. that we can call `get_ctrl` and `is_dominator`? Because it is potentially a lot of overhead to create the whole loop-opts structures just for this. ------------- PR Review: https://git.openjdk.org/jdk/pull/25315#pullrequestreview-3196301873 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330095168 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330176841 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330209593 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330230044 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330256973 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330221500 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330181204 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330188708 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330192891 From epeter at openjdk.org Mon Sep 8 13:34:28 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 13:34:28 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 20:19:38 GMT, Vladimir Ivanov wrote: >> src/hotspot/share/opto/c2_globals.hpp line 83: >> >>> 81: \ >>> 82: product(bool, StressReachabilityFences, false, DIAGNOSTIC, \ >>> 83: "Randomly insert ReachabilityFence nodes") \ >> >> Drive-by sniping: what about a hello-world test where you test out these flags? > > Good idea. Added one. Also: you promise that it happens randomly. But it seems to be added deterministically everywhere. Did I miss something? >> src/hotspot/share/opto/callnode.hpp line 497: >> >>> 495: // Are we guaranteed that this node is a safepoint? Not true for leaf calls and >>> 496: // for some macro nodes whose expansion does not have a safepoint on the fast path. >>> 497: virtual bool guaranteed_safepoint() { return true; } >> >> I see you only copied it. It makes me a little nervous when we call the "default" case safe. Because when you add more cases, you just assume it is safe... and if it is not we first have to discover that through a bug. What do you think? > > Well, it's a SafePointNode class after all. I lifted it from `CallNode` subclass to avoid elaborate check on SafePoint nodes (!is_Call() || as_Call() && guaranteed_safepoint()`)). > > If some node extends SafePointNode, but doesn't keep JVM state, it has to communicate it to users one way or another. And changing the default doesn't improve the situation IMO: reporting a safepoint node as a non-safepoint is still a bug. Hmm. The way it is formulated it sounds more like: - `true` -> we are guaranteed that it is a safepoint. - `false` -> it may or may not be a safepoint - no guarantees. Am I understanding this right? If yes, then it would make more sense to have a default that is `no guarantee`. But maybe that makes things more complicated in other ways. All I'm saying it makes me nervous ;) >> src/hotspot/share/opto/parse.hpp line 361: >> >>> 359: bool _wrote_fields; // Did we write any field? >>> 360: Node* _alloc_with_final_or_stable; // An allocation node with final or @Stable field >>> 361: Node* _stress_rf_hook; // StressReachabilityFences support >> >> You could write out the `rf` > > I'd like to avoid that. `_stress_reachability_fence_hook` is way too verbose IMO. The declaration and all the accesses are accompanied by `StressReachabilityFences` which should make it clear what `rf` refers to. Fair enough. It's always a trade-off. Works here because of `StressReachabilityFences` :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330253854 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330166192 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330245481 From epeter at openjdk.org Mon Sep 8 13:34:29 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 13:34:29 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> On Mon, 8 Sep 2025 12:29:15 GMT, Emanuel Peter wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> whitespaces > > src/hotspot/share/opto/callGenerator.cpp line 623: > >> 621: return; // keep the original call node as the holder of reachability info >> 622: } >> 623: } > > Maybe that's just me. But people use the assert messages both in positive and negative ways, and so this is a bit ambiguous. Maybe you can write: > `no reachability edge should be present` > > I'm still a bit unsure what the `SafePointNode::grow_stack` comment means. > In the previous comment https://github.com/openjdk/jdk/pull/25315#discussion_r2320120466 you explained more. Why not add that here instead? I'm also not sure yet why there is a difference between incremental inlining and regular inlining. Do you think it would make sense to explain that here, or is it explained elsewhere? > src/hotspot/share/opto/macro.cpp line 973: > >> 971: _igvn._worklist.push(ac); >> 972: } else if (use->is_ReachabilityFence() && OptimizeReachabilityFences) { >> 973: use->as_ReachabilityFence()->clear_referent(_igvn); // redundant fence > > Thanks for refactoring a bit here :) > > Is this rf guaranteed to belong to the Allocation somehow? Ah, you could mention that later `ReachabilityFenceNode::Identity` removes the rf. > src/hotspot/share/opto/reachability.cpp line 136: > >> 134: return true; >> 135: } >> 136: } > > Nit: `an no-op` -> `a no-op` > > Also: do you need the return value? The only use case does not do anything with it. You could mention that `Identity` will remove the node later. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330138204 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330236031 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330237394 From epeter at openjdk.org Mon Sep 8 13:34:30 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 13:34:30 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 20:28:40 GMT, Vladimir Ivanov wrote: >> Can you quickly comment why you changed this? > > Some call nodes inspected during `expand_reachability_fences` demonstrate this IR shape where some exception table projections are directly attached to the call node. > > Looks like a missed case in `CallNode::extract_projections` we simply never hit before. Alright, sounds good! Do you think this could have happened somehow, i.e. was this a bug that we could somehow reproduce? >> The arguments are less important for me. > > There are 2 types of methods here: internal ones (used solely in `reachability.cpp`) and those which are called from loop optimization code (`optimize_reachability_fences` and `eliminate_reachability_fences`). > > IMO it's counter-productive to repeatedly spell out what "RF" means inside `reachability.cpp`, so I kept the names intact. I split the declarations into public and private ones to stress the distinction. Great, the private/public split works for me :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330151757 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330213859 From epeter at openjdk.org Mon Sep 8 14:52:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 14:52:18 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 21:29:43 GMT, Vladimir Ivanov wrote: >> This PR introduces C2 support for `Reference.reachabilityFence()`. >> >> After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. >> >> `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. >> >> Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. >> >> Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 >> "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." >> >> Testing: >> - [x] hs-tier1 - hs-tier8 >> - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations >> - [x] java/lang/foreign microbenchmarks > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > whitespaces A few comments about the `reachability.cpp` intro. I think we are on a good way here :) src/hotspot/share/opto/reachability.cpp line 49: > 47: * > 48: * It is tempting to directly attach referents to interfering safepoints right from the beginning, but it > 49: * doesn't play well with some optimizations C2 does. Do you have an example for such optimizations? src/hotspot/share/opto/reachability.cpp line 67: > 65: * RF nodes may interfere with RA, so stand-alone RF nodes are eliminated and their referents are > 66: * transferred to corresponding safepoints (phase #2). When safepoints are pruned during macro expansion, > 67: * corresponding reachability edges also go away. Spell our RA on first use. Make more clear that this is why we eliminate RF before RA. Suggestion: * RF nodes may interfere with register allocation (RA), hence we eliminate RF nodes and transfer their * referents to corresponding safepoints (phase #2). When safepoints are pruned during macro expansion, * corresponding reachability edges also go away. `reachability edges also go away` ... and that is ok why? Sketch of what you could write, is it correct? - reachability only needs to be correct at SafePoints. If all the SafePoints are removed for a referent, then we don't need to ensure its reachablility. src/hotspot/share/opto/reachability.cpp line 71: > 69: * Unfortunately, it's not straightforward to stay with safepoint-attached representation till the very end, > 70: * because information about derived oops is attached to safepoints the very same similar way. So, for now RFs are > 71: * rematerialized at safepoints before RA (phase #3). `the very same similar way` sounds a little funny. I'm also not quite seeing the problem yet. What is the issue with the edges being attached to safepoints here? ------------- PR Review: https://git.openjdk.org/jdk/pull/25315#pullrequestreview-3196820681 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330441117 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330487392 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330491632 From epeter at openjdk.org Mon Sep 8 14:52:19 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 14:52:19 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 20:14:56 GMT, Vladimir Ivanov wrote: >> src/hotspot/share/opto/reachability.cpp line 51: >> >>> 49: * >>> 50: * It looks attractive to get rid of RF nodes early and transfer to safepoint-attached representation, >>> 51: * but it is not correct until loop opts are done. >> >> Why is it not correct? What could go wrong? Why is it safe to do it after loop opts? > > Live ranges of values are routinely extended during loop opts. And it can break the invariant that all interfering safepoints contain the referent in their oop map. (If an interfering safepoint doesn't keep the referent alive, then it becomes possible for the referent to be prematurely GCed.) > > After loop opts are over, it becomes possible to reliably enumerate all interfering safe points and ensure the referent present in their oop maps. Can you make sure this explanation is in the comment ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2330449580 From rcastanedalo at openjdk.org Mon Sep 8 15:43:10 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 8 Sep 2025 15:43:10 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> Message-ID: On Thu, 4 Sep 2025 07:44:52 GMT, Roberto Casta?eda Lozano wrote: > Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. Testing did not reveal any issue. I have, however, a high-level question: could the current two-step design ([SR state adjustment loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L300-L315) followed by a [NSR propagation loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L318-L320) miss marking allocations as NSR in more complex scenarios, e.g. involving longer points-to/merge chains? Wouldn't it be more principled to re-run the SR state adjustment loop until a fixed point is reached, keeping `reducible_merges` consistent as new allocations are discovered to be NSR? (e.g. by calling `revisit_reducible_phi_status` - with your clean-up applied - every time [an allocation is marked as NSR due to non-removable merges](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L2962-L2964)). ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3266887455 From dlunden at openjdk.org Mon Sep 8 15:49:27 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 8 Sep 2025 15:49:27 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 14:08:11 GMT, Emanuel Peter wrote: >> The main issue is that register masks are stored as part of certain nodes, and nodes get copied by `Node::clone`. If someone in the future decide to add a register mask to some type of node, and forget to add a special case (like what I've now added for `MachProj`) in `Node::clone` for the node type, this safeguard will catch it and complain. >> >> Register masks are used in peculiar ways throughout C2, and there may be other unexpected cases as well that this safeguard catches. I doubt the `_read_only` part has a measurable performance effect, I only added it because it was easy and couldn't hurt. > >> The main issue is that register masks are stored as part of certain nodes, and nodes get copied by Node::clone > > Ok, that answers it for me. Maybe you can expand the comment a little where you mention that masks are `shallowly copied` Sure, will do. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2330650060 From dlunden at openjdk.org Mon Sep 8 15:56:28 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 8 Sep 2025 15:56:28 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: <7oAZrBdRb6r_63mYjkvgPVjc_eTbqVwtD0SSp33MOzo=.54688223-6b53-4811-89b4-e1a1eac60355@github.com> On Tue, 2 Sep 2025 14:16:18 GMT, Emanuel Peter wrote: >> Yes, you are correct. There is a detailed explanation in `x86_64.ad` ("Definition of frame structure and management information"). > > Ok. But that's not immediately apparent here. If you already have a comment, why not mention caller/callee or inner/outer scope? Sure, I'll add that. >> Right, we should probably update this terminology as well. It comes from the fact that register masks can always represent all registers (+ a few stack slots), and anything beyond the mask is necessarily additional stack slots. So, if `_all_stack` is set, it means the register mask includes all of the stack slots. Any suggestion for a better name? > > So that could mean that we have stack slots that are in the mask, and that are off, but we still have `_all_stack = true`, right? That sounds a little contradictory to me. > > Some ideas: > - `_value_of_bits_above_mask` - though strictly speaking the mask also represents those bits, and so they are not really "above" the mask. > - `_value_of_bits_above_...` ah it is above the register mask `size`, right? Of course it is a bit suboptimal that the `size` is only for those that we explicitly represent, and does not capture that we implicitly represent. Maybe you can think about naming here too. Optional. I agree that the current naming is a bit contradictory, but I'm not sure how to rename it. I'll think a bit and propose something in the renaming-PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2330673189 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2330663867 From dlunden at openjdk.org Mon Sep 8 16:23:29 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 8 Sep 2025 16:23:29 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> Message-ID: On Tue, 2 Sep 2025 14:38:45 GMT, Daniel Lund?n wrote: >> Hmm ok. Now I went to `rm_up` and thought that you would do `i - _offset`. But that's not what happens. >> >> Hmm but then here there is a subtraction: >> >> bool Member(OptoReg::Name reg) const { >> reg = reg - offset_bits(); >> >> >> Is that consistent? I hope you understand why I'm confused ? > > Yes, the subtraction is consistent, because if the register mask is offset, we can no longer use the OptoReg to directly index the mask. Small simplified example: register mask with 5 bits, offset by 10. First bit (index 0) represents OptoReg 10, second bit (index 1) represents OptoReg 11, etc. If we call `Member(15)`, we need to subtract the offset so we look at the correct index in the register mask (index 5). Ah, I think I now better understand your question. `rm_up` is a low-level method for internal use in `regmask.hpp` and `regmask.cpp` only (perhaps I should prepend it with an underscore?). It basically makes it so that we can regard the backing storage (`_RM_UP` and `_RM_UP_EXT`) as one contiguous array. `Member` is exposed externally and so needs the offset logic. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2330742374 From epeter at openjdk.org Mon Sep 8 17:07:31 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 17:07:31 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:35 GMT, Marc Chevalier wrote: >> Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. >> >> Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. >> >> This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. >> >> For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. >> >> On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: >> >> 1 failure for node >> 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> At node >> 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) >> From path: >> [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) >> <-(0)- 210 IfFalse === 209 [[ 21... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > One more ResourceMark Alright, I have another barrage of comments. Things are much better already. Though it would be good to discuss a bit more how the patterns now look, especially if this becomes something that we do more widely eventually. I would like to know what are the advantages and disadvantages, and what alternatives we would have ;) src/hotspot/share/opto/compile.cpp line 702: > 700: , > 701: _in_dump_cnt(0), > 702: _invariant_checker(GraphInvariantChecker::make_default()) How does this interface with `ResouceMarks`? Because it is now resource allocated. And so is the `_checks`. How does this not trip the nesting asserts of allocation there? I'm probably missing something here. I would have expected that we need to allocate it from the `_comp_arena`. src/hotspot/share/opto/graphInvariants.cpp line 32: > 30: constexpr int LocalGraphInvariant::OutputStep; > 31: > 32: void LocalGraphInvariant::LazyReachableCFGNodes::fill() { Nit: I would call it `compute`. `fill` sounds like you are going to fill the nodes themselves or something. And: I know they are synonyms. But why do you use both `reachable` (class name) and `live` (array)? src/hotspot/share/opto/graphInvariants.cpp line 45: > 43: } > 44: } > 45: } It seems you are assuming that all CFG nodes are reachable "from below". That is true in most cases... but: Have we not had this pesky case where we have a "infinite loop", where there is really no reachability from below, but from above it is reachable. See `_root_and_safepoints` in `PhaseCCP`. I'm not sure we need to worry about this, but I'd like to be sure that we have considered infinite loops here. The risk is that otherwise you just call those nodes dead, and do not verify them, right? Or you would just ignore failures there. src/hotspot/share/opto/graphInvariants.cpp line 61: > 59: * and compositional to express complex structures from simple properties. > 60: * For instance, we have a pattern for saying "the first input of the center match P" where P is another > 61: * Pattern. We end up with trees of patterns matching the graph. `the first input of the center match P` does not sound like a proper assertion. Some alternatives I could think of: - `match P on the first input of center` ok - `the first input of center must match P` ok - `match the first input of center with P` meh src/hotspot/share/opto/graphInvariants.cpp line 63: > 61: * Pattern. We end up with trees of patterns matching the graph. > 62: */ > 63: struct Pattern : ResourceObj { Why not move the `Pattern` classes to a separate `pattern.hpp/cpp`? If we did ever use them for `IGVN`, then it would not make so much sense to have them in `graphInvariants.cpp`, right? src/hotspot/share/opto/graphInvariants.cpp line 64: > 62: */ > 63: struct Pattern : ResourceObj { > 64: virtual bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream&) const = 0; Since this is the abstract class, it could make sense to define all inputs, as well as their invariants: precondition / postcondition. Why do all args have a name except the stream? src/hotspot/share/opto/graphInvariants.cpp line 67: > 65: }; > 66: > 67: /* This pattern just accepts any node. This is convenient mostly as leaves in a pattern tree. Suggestion: /* This pattern just accepts any node. This is convenient mostly as leaf in a pattern tree. I think this is a bit more consistently singular? Optional. src/hotspot/share/opto/graphInvariants.cpp line 116: > 114: private: > 115: const N*& _binding; > 116: }; Would it not make sense to move it a bit closer to the related code? Do you need it much before `NodeClassIsAndBind`? src/hotspot/share/opto/graphInvariants.cpp line 128: > 126: * new AtInput(1, P1), > 127: * new AtInput(2, P2), > 128: * ) `In particular, check a node has enough inputs`: At first it is not clear if the code already does that, or if the user is supposed to do it. Why "in particular", does the statement make more clear what you just said? Ah no you are saying that it is best practice to do #input checking first for good reporting :) Suggestion: * Evaluation order is guaranteed to be left-to-right. * Good practice: * To get better reporting, the number of inputs should be checked first, before checking concrete inputs. * If you know a node has 3 inputs and want patterns to be applied to each input, it would look like * And::make( * new HasExactlyNInputs(3), * new AtInput(0, P0), * new AtInput(1, P1), * new AtInput(2, P2), * ) src/hotspot/share/opto/graphInvariants.cpp line 159: > 157: if (!_checks.at(i)->check(center, steps, path, ss)) { > 158: return false; > 159: } Why do you not update steps and path here? If there is a reason, add a comment ;) I suppose it is because you don't step to another `center`? src/hotspot/share/opto/graphInvariants.cpp line 175: > 173: } > 174: } > 175: } `make` suggests that this is a factory pattern. But it rather just prints / dumps. I'd suggest `print_list_of_inputs`. src/hotspot/share/opto/graphInvariants.cpp line 181: > 179: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { > 180: if (center->req() != _expect_req) { > 181: ss.print_cr("Unexpected number of input. Expected: %d. Found: %d", _expect_req, center->req()); Suggestion: ss.print_cr("Unexpected number of inputs. Expected exactly: %d. Found: %d", _expect_req, center->req()); Something should say that the expected number was exact. src/hotspot/share/opto/graphInvariants.cpp line 187: > 185: return true; > 186: } > 187: const uint _expect_req; Suggestion: private: const uint _expect_req; src/hotspot/share/opto/graphInvariants.cpp line 194: > 192: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { > 193: if (center->req() < _expect_req) { > 194: ss.print_cr("Too small number of input. Expected: %d. Found: %d", _expect_req, center->req()); Grammar: Either "Too few inputs" or "Number of inputs too small". src/hotspot/share/opto/graphInvariants.cpp line 200: > 198: return true; > 199: } > 200: const uint _expect_req; Suggestion: private: const uint _expect_req; src/hotspot/share/opto/graphInvariants.cpp line 211: > 209: AtInput(uint which_input, const Pattern* pattern) : _which_input(which_input), _pattern(pattern) {} > 210: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { > 211: assert(_which_input < center->req(), "Input number is out of range"); Hmm. Could still be nice if we did our best here, and responded nicely. Just in case someone messes up the pattern, and then we get an assert here. Maybe the bug is hard to reproduce, and having the printed statements would have helped a little? src/hotspot/share/opto/graphInvariants.cpp line 215: > 213: ss.print_cr("Input at index %d is nullptr.", _which_input); > 214: return false; > 215: } So we would never do `AtInput(0, ExpectNullptr())` for example? Fine with me, just an idea to consider ;) src/hotspot/share/opto/graphInvariants.cpp line 222: > 220: } > 221: return result; > 222: } Would this not read better? Suggestion: bool success = _pattern->check(center->in(_which_input), state); if (!success) { state.trace_failure_path(center, _which_input); } return success; } src/hotspot/share/opto/graphInvariants.cpp line 224: > 222: } > 223: const uint _which_input; > 224: const Pattern* const _pattern; Suggestion: private: const uint _which_input; const Pattern* const _pattern; src/hotspot/share/opto/graphInvariants.cpp line 234: > 232: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { > 233: if (!(center->*_type_check)()) { > 234: ss.print_cr("Unexpected type: %s.", center->Name()); Is there a way we could say what we actually do expect? Not really, right? We'd need to do it via macro again. src/hotspot/share/opto/graphInvariants.cpp line 239: > 237: return true; > 238: } > 239: bool (Node::*_type_check)() const; Suggestion: private: bool (Node::*_type_check)() const; I would also suggest that you use a `typedef` here. Something like: `typedef bool (Node::*TypeCheckMethod)() const;` Then you can write Suggestion: public: const TypeCheckMethod _type_check; src/hotspot/share/opto/graphInvariants.cpp line 282: > 280: > 281: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { > 282: Node_List outputs_of_correct_type; You should probably have a `ResourceMark` here. Or just avoid the allocation by first only holding a pointer, and then if you find multiple you just traverse again. src/hotspot/share/opto/graphInvariants.cpp line 304: > 302: } > 303: bool (Node::*_type_check)() const; > 304: const Pattern* const _pattern; Suggestion: private: bool (Node::*_type_check)() const; const Pattern* const _pattern; src/hotspot/share/opto/graphInvariants.cpp line 307: > 305: }; > 306: > 307: /* A LocalGraphInvariant that mostly use a Pattern for checking. Suggestion: /* A LocalGraphInvariant that mostly uses a Pattern for checking. src/hotspot/share/opto/graphInvariants.cpp line 312: > 310: */ > 311: struct PatternBasedCheck : LocalGraphInvariant { > 312: const Pattern* const _pattern; Suggestion: private: const Pattern* const _pattern; public: src/hotspot/share/opto/graphInvariants.cpp line 336: > 334: return CheckResult::NOT_APPLICABLE; > 335: } > 336: CheckResult r = PatternBasedCheck::check(center, reachable_cfg_nodes, steps, path, ss); Suggestion: CheckResult result = PatternBasedCheck::check(center, reachable_cfg_nodes, state); Packing the 3 args would give us some extra space to write out a name for `r` ;) src/hotspot/share/opto/graphInvariants.cpp line 351: > 349: */ > 350: struct PhiArity : PatternBasedCheck { > 351: const RegionNode* region_node = nullptr; Suggestion: private: const RegionNode* _region_node = nullptr; You've been giving fields the `_` consistenly up to now, as we usually doing in hotspot ;) src/hotspot/share/opto/graphInvariants.cpp line 359: > 357: 0, > 358: NodeClassIsAndBind(Region, region_node)))) { > 359: } Are there Phi's that only have the ctrl input? I'd be quite surprised if they did not at least have a single data input. What do you think? src/hotspot/share/opto/graphInvariants.cpp line 378: > 376: return CheckResult::VALID; > 377: } > 378: }; I am wondering if it is really worth it to do the whole pattern matching approach, if we still have to write so much code. There is a lot of boiler plate now, that has replaced the procedural code. I'm just wondering if we are there yet, or if we need to find some way to make it more concise. Maybe we can do something like this: return .applies_if(&Node::is_Phi) .check([&]() { return PatternBasedCheck::check(center, reachable_cfg_nodes, steps, path, ss); }) .require(...) .finish(); Just an idea. It would probably be lambda based again, which has its disadvantages. Maybe you have an even better idea. I'd just like to understand why the Pattern based approach is really super desirable, what are the advantages and disadvantages? src/hotspot/share/opto/graphInvariants.cpp line 400: > 398: } > 399: > 400: uint cfg_out = ctrl_succ.size(); Suggestion: const uint cfg_out = ctrl_succ.size(); Though you could also use `ctrl_succ.size()` directly. Matter of taste. src/hotspot/share/opto/graphInvariants.cpp line 421: > 419: ss.print(" "); > 420: ctrl_succ.at(i)->dump("\n", false, &ss); > 421: } You repeat this 4x. Can we do something reasonable about that? src/hotspot/share/opto/graphInvariants.cpp line 438: > 436: ss.print_cr("%s node must have at least one control successors. Found %d.", center->Name(), cfg_out); > 437: return CheckResult::FAILED; > 438: } Is there some upper bound? src/hotspot/share/opto/graphInvariants.cpp line 454: > 452: }; > 453: > 454: /* Checks that Region Start and Root nodes' first input is a self loop, except for copy regions, which then must have only one non null input. Suggestion: /* Checks that Region, Start and Root nodes' first input is a self loop, except for copy regions, which then must have only one non null input. src/hotspot/share/opto/graphInvariants.cpp line 487: > 485: if (non_null_inputs_count != 1) { > 486: // Should be a rare case, hence the second (but more expensive) traversal. > 487: Node_List non_null_inputs; `ResourceMark`? src/hotspot/share/opto/graphInvariants.cpp line 509: > 507: // CountedLoopEnd -> IfTrue -> CountedLoop > 508: struct CountedLoopInvariants : PatternBasedCheck { > 509: const BaseCountedLoopEndNode* counted_loop_end = nullptr; Suggestion: private: const BaseCountedLoopEndNode* _counted_loop_end = nullptr; public: src/hotspot/share/opto/graphInvariants.cpp line 528: > 526: if (!center->is_CountedLoop() && !center->is_LongCountedLoop()) { > 527: return CheckResult::NOT_APPLICABLE; > 528: } Actually: why not applie that to `OuterStripMinedLoop` as well? Or any `BaseCountedLoop`? Are there more than these 3 cases? If there are ever more, they should probably also adhere to this backedge pattern, we'll just need an extension. But it would be nice to trip over something here if we ever do extend. src/hotspot/share/opto/graphInvariants.cpp line 547: > 545: return CheckResult::FAILED; > 546: } > 547: } If you do add `OuterStripMinedLoop`, make it a swich, and assert in the default case ;) src/hotspot/share/opto/graphInvariants.cpp line 552: > 550: }; > 551: > 552: // CountedLoopEnd -> IfFalse -> SafePoint -> OuterStripMinedLoopEnd[center] -> IfTrue -> OuterStripMinedLoop -> CountedLoop Could we close the loop, and check that the CountedLoop match via their backedge? src/hotspot/share/opto/graphInvariants.hpp line 73: > 71: * In addition, if the check fails, it must write its error message in [ss]. > 72: * > 73: * If the check succeeds or is not applicable, [steps], [path] and [ss] must be untouched. I wonder if we should not have some object that represents these 3 args. You pass them everywhere, and they seem to be a unit. And they have invariants that we may want to check. You could for example enforce that steps and path are in synch just by only providing the access methods that allow it. What do you think? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26362#pullrequestreview-3196932276 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330519851 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330535043 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330549479 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330572874 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330560727 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330581064 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330584310 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330600482 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330620646 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330682903 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330647056 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330634803 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330699234 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330643795 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330699535 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330652147 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330654896 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330691118 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330699783 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330709824 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330706819 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330732097 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330734834 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330736070 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330737470 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330747146 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330754630 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330750840 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330786828 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330798028 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330801396 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330803456 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330805158 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330814454 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330815461 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330820754 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330822466 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330832982 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330674639 From epeter at openjdk.org Mon Sep 8 17:07:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 8 Sep 2025 17:07:32 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 15:17:43 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/graphInvariants.cpp line 61: > >> 59: * and compositional to express complex structures from simple properties. >> 60: * For instance, we have a pattern for saying "the first input of the center match P" where P is another >> 61: * Pattern. We end up with trees of patterns matching the graph. > > `the first input of the center match P` does not sound like a proper assertion. > Some alternatives I could think of: > - `match P on the first input of center` ok > - `the first input of center must match P` ok > - `match the first input of center with P` meh Also: are we `check` ing or `match` ing? I would pick one consistently. > src/hotspot/share/opto/graphInvariants.cpp line 234: > >> 232: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { >> 233: if (!(center->*_type_check)()) { >> 234: ss.print_cr("Unexpected type: %s.", center->Name()); > > Is there a way we could say what we actually do expect? Not really, right? We'd need to do it via macro again. Or we pass a string .. not nice but would work with the macro for `NodeClassIsAndBind`. Not sure what's best here. > src/hotspot/share/opto/graphInvariants.cpp line 378: > >> 376: return CheckResult::VALID; >> 377: } >> 378: }; > > I am wondering if it is really worth it to do the whole pattern matching approach, if we still have to write so much code. > > There is a lot of boiler plate now, that has replaced the procedural code. > > I'm just wondering if we are there yet, or if we need to find some way to make it more concise. > Maybe we can do something like this: > > return > .applies_if(&Node::is_Phi) > .check([&]() { return PatternBasedCheck::check(center, reachable_cfg_nodes, steps, path, ss); }) > .require(...) > .finish(); > > Just an idea. It would probably be lambda based again, which has its disadvantages. > Maybe you have an even better idea. > I'd just like to understand why the Pattern based approach is really super desirable, what are the advantages and disadvantages? One advantage is definitively reporting. And it is still reasonably debuggable I think, my solution may be a little trickier that way. I think there are multiple factors: - Simple: fewer abstractions can be easier to read/debug. - Concise: few lines of code. - Reporting: nice output when rules fail. > src/hotspot/share/opto/graphInvariants.cpp line 547: > >> 545: return CheckResult::FAILED; >> 546: } >> 547: } > > If you do add `OuterStripMinedLoop`, make it a swich, and assert in the default case ;) I just saw that you do the `OuterStripMinedLoop` below. But to capture the parallel structure it may still be good. And to capture possible future extension. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330614759 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330713629 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330793156 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2330826144 From sparasa at openjdk.org Mon Sep 8 21:44:52 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 8 Sep 2025 21:44:52 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v3] In-Reply-To: References: Message-ID: > This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. > > Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). > > For example: > `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding > `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding Srinivas Vamsi Parasa has updated the pull request incrementally with two additional commits since the last revision: - refactor emit_eevex_prefix_or_demote_arith_ndd to use size instead of passing attribute - undo swap in emit_arith and refactor accordinly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26997/files - new: https://git.openjdk.org/jdk/pull/26997/files/91962f4f..83a22e1c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=01-02 Stats: 52 lines in 2 files changed: 14 ins; 15 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/26997.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26997/head:pull/26997 PR: https://git.openjdk.org/jdk/pull/26997 From sparasa at openjdk.org Mon Sep 8 21:44:52 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 8 Sep 2025 21:44:52 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v3] In-Reply-To: References: Message-ID: On Tue, 2 Sep 2025 02:21:44 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with two additional commits since the last revision: >> >> - refactor emit_eevex_prefix_or_demote_arith_ndd to use size instead of passing attribute >> - undo swap in emit_arith and refactor accordinly > > src/hotspot/cpu/x86/assembler_x86.cpp line 12932: > >> 12930: if (is_commutative && is_demotable(no_flags, dst->encoding(), src2->encoding())) { >> 12931: if (size == EVEX_64bit) { >> 12932: emit_prefix_and_int8(get_prefixq(src1, dst, is_map1), opcode_byte + 2); > > It will be good to write a comment on top of opcode_byte adjustment on account of opcode mismatch b/w NDD and equivalent demotable variant. > > > EVEX.LLZ.NP.MAP4.SCALABLE 21 /r AND {NF} {ND=1} rv, rv/mv, rv > > > `REX.W + 23 /r AND r64, r/m64 | RM | Valid | N.E. | r64 AND r/m64 > ` Please see a comment added as suggested. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2331439482 From sparasa at openjdk.org Mon Sep 8 21:44:55 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 8 Sep 2025 21:44:55 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v2] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 22:03:59 GMT, Sandhya Viswanathan wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - nomenclature change >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into cdemotion >> - remove trailing whitespaces >> - remove unused instructions >> - 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 > > src/hotspot/cpu/x86/assembler_x86.cpp line 13125: > >> 13123: emit_arith(op1, op2, src1, src2, second_operand_demotable); >> 13124: } >> 13125: > > This could be written something like below: > > void Assembler::emit_eevex_prefix_or_demote_arith_ndd(Register dst, Register src1, Register src2, VexSimdPrefix pre, VexOpcode opc, > InstructionAttr *attributes, int op1, int op2, bool no_flags, bool use_prefixq, bool is_commutative) { > bool demotable = is_demotable(no_flags, dst->encoding(), src1->encoding()); > if (!demotable && is_commutative) { > if (is_demotable(no_flags, dst->encoding(), src2->encoding())) { > demotable = true; > // swap src1 and src2 > Register tmp = src1; > src1 = src2; > src2 = tmp; > } > } > (void)emit_eevex_prefix_or_demote_ndd(src1->encoding(), dst->encoding(), src2->encoding(), pre, opc, attributes, no_flags, use_prefixq); > emit_arith(op1, op2, src1, src2); > } > > > Then we don't need extra argument in emit_arith() and emit_eevex_prefix_or_demote_ndd. Please see the updated code with the suggestion incorporated. > src/hotspot/cpu/x86/assembler_x86.hpp line 812: > >> 810: void emit_eevex_prefix_or_demote_arith_ndd(Register dst, Register src1, Register src2, VexSimdPrefix pre, VexOpcode opc, >> 811: InstructionAttr *attributes, int op1, int op2, bool no_flags = false, bool use_prefixq = false, bool is_commutative = false); >> 812: > > The attributes parameter could be replaced by int size and the attributes computed inside the emit_eevex_prefix_or_demote_arith_ndd. Also then no need to have use_prefixq as a separate parameter, (size == EVEX_64bit) implies use_prefixq. Please see the updated code to pass size and attributes computed inside the `emit_eevex_prefix_or_demote_arith_ndd`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2331441375 PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2331440906 From cslucas at openjdk.org Mon Sep 8 22:12:16 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 8 Sep 2025 22:12:16 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> Message-ID: On Mon, 8 Sep 2025 15:38:52 GMT, Roberto Casta?eda Lozano wrote: >> Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. > >> Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. > > Testing did not reveal any issue. I have, however, a high-level question: could the current two-step design ([SR state adjustment loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L300-L315) followed by a [NSR propagation loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L318-L320) miss marking allocations as NSR in more complex scenarios, e.g. involving longer points-to/merge chains? Wouldn't it be more principled to re-run the SR state adjustment loop until a fixed point is reached, keeping `reducible_merges` consistent as new allocations are discovered to be NSR? (e.g. by calling `revisit_reducible_phi_status` - with your clean-up applied - every time [an allocation is marked as NSR due to non-removable merges](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L2962-L2964)). @robcasloz - are you thinking that the "fixed point" loops on `find_scalar_replaceable_allocs` aren't sufficient? At first glance yes, I think that the code would be more cleaned up if done that way. If the code had been written like that in the first place we wouldn't have seen the current issue. But I don't think this is a correctness issue. As long as we call `revisit_reducible_phi_status` when an object is marked as NSR the eventual call to `unique_java_object` should find that NSR object if it's used by a reducible phi. I propose that we move forward with the current patch and work on this refactoring as a separate issue. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3268175631 From dlong at openjdk.org Mon Sep 8 23:27:32 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 8 Sep 2025 23:27:32 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer In-Reply-To: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: On Fri, 5 Sep 2025 13:02:00 GMT, Daniel Jeli?ski wrote: > The nop list has never been used in the history of OpenJDK. Let's clean it up. > > Tested with Mach5 tier 1-5, no related failures. Looks good. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27117#pullrequestreview-3198456672 From dlong at openjdk.org Mon Sep 8 23:27:48 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 8 Sep 2025 23:27:48 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 [v5] In-Reply-To: References: Message-ID: On Mon, 4 Aug 2025 21:26:22 GMT, Dean Long wrote: >> This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() to perform a direct byte store, leaving 24 bits for the guard value. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > one unconditional release should be enough I need another review for this. Any volunteers? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26399#issuecomment-3268323422 From sparasa at openjdk.org Mon Sep 8 23:30:02 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 8 Sep 2025 23:30:02 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v4] In-Reply-To: References: Message-ID: > This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. > > Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). > > For example: > `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding > `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: undo the passing of demotable flag ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26997/files - new: https://git.openjdk.org/jdk/pull/26997/files/83a22e1c..9714a9b1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=02-03 Stats: 5 lines in 2 files changed: 0 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26997.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26997/head:pull/26997 PR: https://git.openjdk.org/jdk/pull/26997 From dlong at openjdk.org Mon Sep 8 23:56:16 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 8 Sep 2025 23:56:16 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v3] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 09:29:26 GMT, Damon Fenacci wrote: >> # Issue >> While compiling `java.util.zip.ZipFile` in C2 this assert is triggered >> https://github.com/openjdk/jdk/blob/a2e86ff3c56209a14c6e9730781eecd12c81d170/src/hotspot/share/opto/memnode.cpp#L4235 >> >> # Cause >> While compiling the constructor of java.util.zip.ZipFile$CleanableResource the following happens: >> * we insert a trailing `MemBarStoreStore` in the constructor >> before_folding >> >> * during IGVN we completely fold the memory subtree of the `MemBarStoreStore` node. The node still has a control output attached. >> after_folding >> >> * later during the same IGVN run the `MemBarStoreStore` node is handled and we try to remove it (because the `Allocate` node of the `MembBar` is not escaping the thread ) https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4301-L4302 >> * the assert https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4235 >> triggers because the barrier has only 1 (control) output and is a `MemBarStoreStore` (not `Initialize`) barrier >> >> The issue happens only when the `UseStoreStoreForCtor` is set (default as well), which makes C2 use `MemBarStoreStore` instead of `MemBarRelease` at the end of constructors. `MemBarStoreStore` are processed separately by EA and this happens after the IGVN pass that folds the memory subtree. `MemBarRelease` on the other hand are handled during same IGVN pass before the memory subtree gets removed and it?s still got 2 outputs (assert skipped). >> >> # Fix >> Adapting the assert to accept that `MemBarStoreStore` can also have `!= 2` outputs (when `+UseStoreStoreForCtor` is used) seems to be an OK solution as this seems like a perfectly plausible situation. >> >> # Testing >> Unfortunately reproducing the issue with a simple regression test has proven very hard. The test seems to rely on very peculiar profiling and IGVN worklist sequence. JBS replay compilation passes. Running JCK's `api/java_util` 100 times triggers the assert a couple of times on average before the fix, none after. >> Tier 1-3+ tests passed. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8360031: add MemBarStoreStore node to worklist during escape analysis/adapt remove assert src/hotspot/share/opto/memnode.cpp line 4232: > 4230: > 4231: void MemBarNode::remove(PhaseIterGVN *igvn) { > 4232: if (outcnt() != 2) { By itself, this allows outcnt() == 0, so maybe we need to continue to fail if that happens. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26556#discussion_r2331612715 From sviswanathan at openjdk.org Tue Sep 9 00:17:42 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 9 Sep 2025 00:17:42 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v4] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 23:30:02 GMT, Srinivas Vamsi Parasa wrote: >> This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. >> >> Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). >> >> For example: >> `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding >> `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > undo the passing of demotable flag Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26997#pullrequestreview-3198574139 From dzhang at openjdk.org Tue Sep 9 00:29:58 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 9 Sep 2025 00:29:58 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture [v2] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? Thanks! > > This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. > Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. > > ### Test > - [x] Run tier1 and tier2 on sg2042 Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27134/files - new: https://git.openjdk.org/jdk/pull/27134/files/99846cd6..b5d87735 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27134&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27134&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27134.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27134/head:pull/27134 PR: https://git.openjdk.org/jdk/pull/27134 From fyang at openjdk.org Tue Sep 9 00:29:59 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 9 Sep 2025 00:29:59 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture [v2] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 00:24:38 GMT, Dingli Zhang wrote: >> Hi, >> Can you help to review this patch? Thanks! >> >> This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. >> Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. >> >> ### Test >> - [x] Run tier1 and tier2 on sg2042 > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27134#pullrequestreview-3198583207 From dzhang at openjdk.org Tue Sep 9 00:30:00 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 9 Sep 2025 00:30:00 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 05:13:32 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. > Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. > > ### Test > - [x] Run tier1 and tier2 on sg2042 Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27134#issuecomment-3268434597 From duke at openjdk.org Tue Sep 9 00:30:01 2025 From: duke at openjdk.org (duke) Date: Tue, 9 Sep 2025 00:30:01 GMT Subject: RFR: 8367048: RISC-V: Correct pipeline descriptions of the architecture In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 05:13:32 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. > Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. > > ### Test > - [x] Run tier1 and tier2 on sg2042 @DingliZhang Your change (at version b5d87735bc2f6a1540676722c0befcca95557fa9) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27134#issuecomment-3268436960 From dzhang at openjdk.org Tue Sep 9 00:41:35 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 9 Sep 2025 00:41:35 GMT Subject: Integrated: 8367048: RISC-V: Correct pipeline descriptions of the architecture In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 05:13:32 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > This patch updates the RISC-V pipeline attributes to variable_size_instructions to properly account for the 2-byte compressed instructions from the C extension. > Furthermore, it increases the max_instructions_per_bundle to 4 and adjusts the instruction_unit_size to match 4-issue RISC-V hardware like the UR-CP100. > > ### Test > - [x] Run tier1 and tier2 on sg2042 This pull request has now been integrated. Changeset: 0aee7bf2 Author: Dingli Zhang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/0aee7bf24d7f2578d3867bcfa25646cb0bd06d9a Stats: 12 lines in 1 file changed: 5 ins; 0 del; 7 mod 8367048: RISC-V: Correct pipeline descriptions of the architecture Reviewed-by: fyang, fjiang, mli ------------- PR: https://git.openjdk.org/jdk/pull/27134 From jbhateja at openjdk.org Tue Sep 9 02:12:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 9 Sep 2025 02:12:11 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 17:17:52 GMT, Jatin Bhateja wrote: >> This patch optimizes PopCount value transforms using KnownBits information. >> Following are the results of the micro-benchmark included with the patch >> >> >> >> System: 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s >> PopCountValueTransform.StockKernelInt thrpt 2 409295.875 ops/s >> PopCountValueTransform.StockKernelLong thrpt 2 368025.608 ops/s >> >> Withopt: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s >> PopCountValueTransform.StockKernelInt thrpt 2 418649.269 ops/s >> PopCountValueTransform.StockKernelLong thrpt 2 381330.221 ops/s >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update countbitsnode.cpp Hi @TobiHartmann , @SirYwell , @eme64 , can you kindly verify the changes in the latest patch? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27075#issuecomment-3268608172 From jbhateja at openjdk.org Tue Sep 9 02:21:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 9 Sep 2025 02:21:11 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v4] In-Reply-To: <0X5cvpQZxb1l5Q_8f-iU0K4WtdyFW8ehdPXR2zsnSzo=.7f4f3d03-94db-4482-b5ee-c5f1362d84b5@github.com> References: <0X5cvpQZxb1l5Q_8f-iU0K4WtdyFW8ehdPXR2zsnSzo=.7f4f3d03-94db-4482-b5ee-c5f1362d84b5@github.com> Message-ID: On Thu, 4 Sep 2025 20:16:30 GMT, Srinivas Vamsi Parasa wrote: >> src/hotspot/cpu/x86/x86_64.ad line 7121: >> >>> 7119: %{ >>> 7120: predicate(UseAPX); >>> 7121: match(Set dst (AddI (LoadI src1) src2)); >> >> Will this not be covered by the pattern at line 7103, since ADLC automatically generates a DFA to handle both cases? > > Will run experiments to make sure that the RegRegMem pattern also applies to RegMemReg case and remove the newly added match rules if they're redundant. Will update you soon. Hi @vamsi-parasa, your latest patch does not address this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2331827360 From jbhateja at openjdk.org Tue Sep 9 02:31:18 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 9 Sep 2025 02:31:18 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v9] In-Reply-To: References: Message-ID: On Sat, 6 Sep 2025 09:44:56 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` >> 1... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Check for scalar casting instead of vector casting in tests when disabling vector alignment or compact object headers test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 432: > 430: applyIfCPUFeatureAnd = {"avx", "true", "avx10_2", "false"}) > 431: @IR(counts = {"cast2DtoX", " >0 "}, phase = CompilePhase.FINAL_CODE, > 432: applyIfCPUFeature = {"avx10_2", "true"}) Please refer to https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java#L2638 for adding MachNode IR node based checks ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2331837526 From duke at openjdk.org Tue Sep 9 05:39:14 2025 From: duke at openjdk.org (erifan) Date: Tue, 9 Sep 2025 05:39:14 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> References: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> Message-ID: On Tue, 2 Sep 2025 08:10:02 GMT, Andrew Haley wrote: >> Thanks @theRealAph . >> >> I've indeed considered and implemented your idea. The code diff: >> >> diff --git a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> index 11d302e9026..841d24f516b 100644 >> --- a/src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> +++ b/src/hotspot/cpu/aarch64/assembler_aarch64.hpp >> @@ -3813,8 +3813,9 @@ template >> bool isMerge, bool isFloat) { >> starti; >> assert(T != Q, "invalid size"); >> + assert((!isFloat) || (isFloat && T != B), "invalid size"); >> int sh = 0; >> - if (imm8 <= 127 && imm8 >= -128) { >> + if ((imm8 <= 127 && imm8 >= -128) || (isFloat && (imm8 >> 8) == 0)) { >> sh = 0; >> } else if (T != B && imm8 <= 32512 && imm8 >= -32768 && (imm8 & 0xff) == 0) { >> sh = 1; >> @@ -3824,7 +3825,7 @@ template >> } >> int m = isMerge ? 1 : 0; >> f(0b00000101, 31, 24), f(T, 23, 22), f(0b01, 21, 20); >> - prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), sf(imm8, 12, 5), rf(Zd, 0); >> + prf(Pg, 16), f(isFloat ? 1 : 0, 15), f(m, 14), f(sh, 13), f(imm8&0xff, 12, 5), rf(Zd, 0); >> } >> >> public: >> @@ -3834,7 +3835,7 @@ template >> } >> // SVE copy floating-point immediate to vector elements (predicated) >> void sve_cpy(FloatRegister Zd, SIMD_RegVariant T, PRegister Pg, double d) { >> - sve_cpy(Zd, T, Pg, checked_cast(pack(d)), /*isMerge*/true, /*isFloat*/true); >> + sve_cpy(Zd, T, Pg, checked_cast(pack(d)), /*isMerge*/true, /*isFloat*/true); >> } >> >> // SVE conditionally select elements from two vectors >> >> >> However, some of my colleagues have differing opinions: >> 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. >> 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . While some encoding code can be reused, separating the encodings makes the code clearer. >> >> I think both implementations are fine. If you think it's better to not refactor, I'll revert. > >> 1. sve `cpy` and `fcpy` are actually two different instructions, and distinguishing them might be clearer. > > That's a fair point, but the Arch64 name for all four instructions is CPY, and they are distinguished by their operands. Deviation from the names in the Reference Manual is occasionally necessary, but it makes life painful for maintainers when they have to search for what we've called an instruction they want to use. > >> 2. sve `cpy` 's imm8 is an **int** , while `fcpy` 's imm8 is an **fp8** . > > Yes, that's right. > >> While some encoding code can be reused, separating the encodings makes the code clearer. > > I don't agree that it makes the code clearer. In fact, tight factoring emphasizes the fact that these instructions are similar, and explicitly shows where they are different. > > It is true that I have a strong bias against copy-and-paste programming. > >> I think both implementations are fine. If you think it's better to not refactor, I'll revert. > > I do. Thank you. Hi @theRealAph @eme64 , would you mind sponsoring this PR? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3268944989 From xgong at openjdk.org Tue Sep 9 06:53:31 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 9 Sep 2025 06:53:31 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v11] In-Reply-To: References: Message-ID: <8T7swIJ17tLLg4FO_N5UZ0HsMYrz31ywBiMZohefGTE=.386eeb0d-8541-4c35-8a68-6caf31ea867e@github.com> On Thu, 14 Aug 2025 14:01:13 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > cleanup: start the SVE Integer Misc - Unpredicated section Do you intend to ignore ops with >32B vector size? May I ask the reason? If so, maybe the title like `AArch64: Implement MulReduction for 256-bit SVE` is more accurate? src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 2199: > 2197: > 2198: instruct reduce_non_strict_order_mulF_256b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp1, vReg tmp2) %{ > 2199: predicate(Matcher::vector_length_in_bytes(n->in(2)) == 32 && !n->as_Reduction()->requires_strict_order()); Suggestion: predicate(Matcher::vector_length_in_bytes(n->in(2)) == 32 && !n->as_Reduction()->requires_strict_order()); src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2119: > 2117: assert(false, "unsupported"); > 2118: ShouldNotReachHere(); > 2119: } Can we just add a type assertion at the start of the method and remove the switch-case? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2165: > 2163: FloatRegister vtmp1, > 2164: FloatRegister vtmp2) { > 2165: assert(vector_length_in_bytes > FloatRegister::neon_vl, "ASIMD impl should be used instead"); Is it better to assert `vector_length_in_bytes == 32` or `vector_length_in_bytes == 2 * FloatRegister::neon_vl`? ------------- PR Review: https://git.openjdk.org/jdk/pull/23181#pullrequestreview-3199499604 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2332130585 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2332153670 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2332197936 From xgong at openjdk.org Tue Sep 9 06:53:32 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 9 Sep 2025 06:53:32 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> <_gHaFQTNq2bApeWAE88cWxcNULRDqndSSo3hrY31FgI=.132b7c24-7205-4877-9b95-3d9d13ac7ec8@github.com> <-SwJHROQB4jO9nlICIWSwNGXZDIQUy8O54baR-Xe80o=.f7c4fd43-330d-4870-ae4b-316ab7507b06@github.com> Message-ID: On Fri, 11 Jul 2025 09:32:14 GMT, Xiaohong Gong wrote: >> @XiaohongGong , JIC, you've referenced the PR you left this comment in. Did you intend to post it somewhere else? > > Oh, sorry, my bad. I intended to post this one: https://github.com/openjdk/jdk/pull/21895/files#diff-7b82624b78127158abbce6835eeba196bd062aee59512ec2d4e4c8c7d681573b So do you intend to change this still? Either is fine to me. But I still prefer not touching the PR un-relative code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2332201601 From duke at openjdk.org Tue Sep 9 07:02:31 2025 From: duke at openjdk.org (erifan) Date: Tue, 9 Sep 2025 07:02:31 GMT Subject: RFR: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: References: <2R6O7Jhv3catwxc6rXJdh7Uiq-NFBp7beCmP49CLTqU=.7ba72e39-6efd-47fe-8ad9-6df54a45c99b@github.com> <-G8GwIflOhFjOL-PAG6_oylu0Fa9c8iNUB57EC6oo4s=.a0126087-2a97-4542-a555-27c12578fccf@github.com> Message-ID: On Tue, 9 Sep 2025 05:36:35 GMT, erifan wrote: > /sponsor Thank you very much ! @eme64 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26951#issuecomment-3269160855 From duke at openjdk.org Tue Sep 9 07:02:32 2025 From: duke at openjdk.org (erifan) Date: Tue, 9 Sep 2025 07:02:32 GMT Subject: Integrated: 8365911: AArch64: Fix encoding error in sve_cpy for negative floats In-Reply-To: References: Message-ID: On Wed, 27 Aug 2025 01:34:25 GMT, erifan wrote: > The?sve_cpy?instruction is not correctly implemented for?negative floating-point?values. The issues include: > > 1. When a negative floating-point number (e.g. `-1.0`) is passed, the `checked_cast(pack(d))`?check fails. For example, assume?`d = -1.0`: > - `pack(-1.0)`?returns an unsigned int with the 7th bit set, i.e.,?`0xf0`. > - `checked_cast(0xf0)`?casts?`0xf0`?to an?int8_t?value, which is?`-16`. > - Casting this int8_t `-16`?back to unsigned int results in?`0xfffffff0`. > - The check compares `0xf0`?to?`0xfffffff0`, which obviously fails. > > 2. Additionally, the encoding of the negative floating-point number is incorrect: > - The imm8?field can fall outside the valid range of?**[-128, 127]**. > - Bit **13** should be encoded as **0** for floating-point numbers. > > This PR fixes these issues and renames floating-point `sve_cpy` as `sve_fcpy`. > > Some test cases are added to aarch64-asmtest.py, and all tests passed. This pull request has now been integrated. Changeset: 680bf758 Author: erifan Committer: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/680bf758980452511ea72224066358e5fd38f060 Stats: 136 lines in 3 files changed: 9 ins; 0 del; 127 mod 8365911: AArch64: Fix encoding error in sve_cpy for negative floats Reviewed-by: aph, epeter ------------- PR: https://git.openjdk.org/jdk/pull/26951 From rcastanedalo at openjdk.org Tue Sep 9 07:04:20 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 9 Sep 2025 07:04:20 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> Message-ID: <2brDXuLmbVBVRaeSyCdKokA706v3t6VsZfGvj_QceJ4=.4483390e-c726-4d82-b220-f1dbdf4efef0@github.com> On Mon, 8 Sep 2025 15:38:52 GMT, Roberto Casta?eda Lozano wrote: >> Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. > >> Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. > > Testing did not reveal any issue. I have, however, a high-level question: could the current two-step design ([SR state adjustment loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L300-L315) followed by a [NSR propagation loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L318-L320) miss marking allocations as NSR in more complex scenarios, e.g. involving longer points-to/merge chains? Wouldn't it be more principled to re-run the SR state adjustment loop until a fixed point is reached, keeping `reducible_merges` consistent as new allocations are discovered to be NSR? (e.g. by calling `revisit_reducible_phi_status` - with your clean-up applied - every time [an allocation is marked as NSR due to non-removable merges](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L2962-L2964)). > @robcasloz - are you thinking that the "fixed point" loops on `find_scalar_replaceable_allocs` aren't sufficient? You're right, that should do. > At first glance yes, I think that the code would be more cleaned up if done that way. If the code had been written like that in the first place we wouldn't have seen the current issue. (...) Agree, a single fixed point loop combining NSR detection and propagation would be ideal for clarity and maintainability. > I propose that we move forward with the current patch and work on this refactoring as a separate issue. Sounds good, please file a RFE for that. I would suggest then to postpone the clean-up in `revisit_reducible_phi_status` to that RFE. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3269166743 From mchevalier at openjdk.org Tue Sep 9 07:31:30 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 07:31:30 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 15:08:54 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/graphInvariants.cpp line 45: > >> 43: } >> 44: } >> 45: } > > It seems you are assuming that all CFG nodes are reachable "from below". > That is true in most cases... but: > Have we not had this pesky case where we have a "infinite loop", where there is really no reachability from below, but from above it is reachable. > > See `_root_and_safepoints` in `PhaseCCP`. I'm not sure we need to worry about this, but I'd like to be sure that we have considered infinite loops here. > > The risk is that otherwise you just call those nodes dead, and do not verify them, right? Or you would just ignore failures there. I guess that is the risk, but I'm going from root, and follow the outputs, I'm checking reachability from above, no? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332296280 From epeter at openjdk.org Tue Sep 9 07:32:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 07:32:36 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> Message-ID: On Mon, 8 Sep 2025 02:28:16 GMT, Xiaohong Gong wrote: >> To me a `false` means this: >> If we support gater/scalter, then we do not need a vector index, we can do without it. >> >> Is that correct? >> >> But that would contradict @fg1417 's statement: >> If we support gater/scalter, then we do not permit a vector index. >> >> Can you clarify? > >> To me a `false` means this: If we support gater/scalter, then we do not need a vector index, we can do without it. >> >> Is that correct? > > Thanks for your review! Actually gather/scatter always need an index input. What this function want to decide is how the index elements are passed to the operations. > > It doesn't take an assumption whether vector gather_load/scatter_store is supported or not in backend. It just checks whether the `index` input of such operations requires a vector register or an address which stores the indexes. Currently, on x86, it passes an array address for subword types (the indexes are then will be loaded one-by-one in backend codegen). However, on AArch64, we requires it a vector type for all types instead (the indexes have been loaded and saved into vector registers in IR level). > >> The current platform does not support vector gather-load or scatter-store at all. > > I'm sorry that I didn't clarify very clear about @fg1417 's second statement. Whether the current platform supports vector gather-load/scatter-store is still decided by `Matcher::match_rule_supported_vector()` like other operations. It return `false` here just because arm doesn't support any vector operations. Assume if it want to support a vector gather/scatter, the index input must not be a vector, right? Thanks for all the explanations, that was very helpful! Can you please adjust the comment so that all the relevant information is there? We could also make the name of the method more precise / informative? Maybe you could write something like this: // true -> if gather/scatter supported: require index in vector register // false -> if gather/scatter supported: allows both index in vector register AND array address holding indices Then give more information about platform specific things that you mentioned about aarch64 and x86 in the relevant files ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2332295242 From xgong at openjdk.org Tue Sep 9 07:32:37 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 9 Sep 2025 07:32:37 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> Message-ID: On Tue, 9 Sep 2025 07:27:46 GMT, Emanuel Peter wrote: >>> To me a `false` means this: If we support gater/scalter, then we do not need a vector index, we can do without it. >>> >>> Is that correct? >> >> Thanks for your review! Actually gather/scatter always need an index input. What this function want to decide is how the index elements are passed to the operations. >> >> It doesn't take an assumption whether vector gather_load/scatter_store is supported or not in backend. It just checks whether the `index` input of such operations requires a vector register or an address which stores the indexes. Currently, on x86, it passes an array address for subword types (the indexes are then will be loaded one-by-one in backend codegen). However, on AArch64, we requires it a vector type for all types instead (the indexes have been loaded and saved into vector registers in IR level). >> >>> The current platform does not support vector gather-load or scatter-store at all. >> >> I'm sorry that I didn't clarify very clear about @fg1417 's second statement. Whether the current platform supports vector gather-load/scatter-store is still decided by `Matcher::match_rule_supported_vector()` like other operations. It return `false` here just because arm doesn't support any vector operations. Assume if it want to support a vector gather/scatter, the index input must not be a vector, right? > > Thanks for all the explanations, that was very helpful! > > Can you please adjust the comment so that all the relevant information is there? > We could also make the name of the method more precise / informative? > Maybe you could write something like this: > > // true -> if gather/scatter supported: require index in vector register > // false -> if gather/scatter supported: allows both index in vector register AND array address holding indices > > Then give more information about platform specific things that you mentioned about aarch64 and x86 in the relevant files ;) Sure, I will do that in next commit. Thanks for your suggestion! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2332301225 From epeter at openjdk.org Tue Sep 9 07:32:39 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 07:32:39 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: <0lnaxN7YsQEddGZfWLgFi2YOl_XtXntDoHRr57Bjp7k=.946b3e40-04c1-4eb5-a205-53347cdc91eb@github.com> Message-ID: On Mon, 8 Sep 2025 02:57:55 GMT, Xiaohong Gong wrote: >>> That semantic is not quite what I would expect from `Concatenate`. Maybe we can call it something else? `VectorConcatenateAndNarrowNode`? >> >> Yeah, `VectorConcatenateAndNarrowNode` would be much match. I just thought the name would be too long. I will change it in next commit. Thanks for your suggestion! > >> Have you considered using `2x Cast + Concatenate` instead, and just matching that in the backend? I don't remember how to do the mere Concat, but it should be possible via the `unslice` or some other operation that concatenates two vectors. > > Would using `2x Cast + Concatenate` make the IRs and match rule more complex? Mere concatenate would be something like `vector slice` in Vector API. It concatenates two vectors into one with an index denoting the merging position. And it requires the vector types are the same for two input vectors and the dst vector. Hence, if we want to separate this operation with cast and concatenate, the IRs would be (assume original type of `v1/v2` is `4-int`, the result type should be `8-short`): > 1) Narrow two input vectors: > `v1 = VectorCast(v1) (4-short); v2 = VectorCast(v2) (4-short)`. > The vector length are not changed while the element size is half size. Hence the vector length in bytes is half size as well. > 2) Resize `v1` and `v2` to double vector length. The higher bits are cleared: > `v1 = VectorReinterpret(v1) (8-short); v2 = VectorReinterpret(v2) (8-short)`. > 3) Concatenate `v1` and `v2` like slice. The position is the middle of the vector length. > `v = VectorSlice(v1, v2, 4) (8-short)`. > > If we want to merging these IRs in backend, would the match rule be more complex? I will take a considering. I'm not saying I know that this alternative would be better. I'm just worried about having extra IR nodes, and then optimizations are more complex / just don't work because we don't handle all nodes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2332301985 From epeter at openjdk.org Tue Sep 9 07:36:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 07:36:52 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 03:12:18 GMT, Xiaohong Gong wrote: >> Did you consider the alternative of `Extract` + `Cast`? Not sure if that would be better, you know more about the code complexity. It would just allow us to have one fewer nodes. > > It just has the `Extract` node to extract an element from vector in C2, right? Extracting the lowest part can be implemented with `VectorReinterpret` easily. But how about the higher parts? Maybe this can also be implemented with operations like `slice` ? But, seems this will also make the IR more complex? For `Cast`, we have `VectorCastMask` now, but it assumes the vector length should be the same for input and output. So the `VectorReinterpret` or an `VectorExtract` is sill needed. > > I can have a try with separating the IR. But I guess an additional new node is still necessary. > >> It would just allow us to have one fewer nodes. > > This is also what I expect really. It would just be nice to build on "simple" building blocks and not have too many complex nodes, that have very special semantics (widen + split into two). It just means that the IR optimizations have to take care of more special cases, rather than following simple rules/optimizations because every IR node does a relatively simple thing. Maybe you find out that we really need a complex node, and can provide good arguments. Looking forward to what you find :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2332311631 From mchevalier at openjdk.org Tue Sep 9 07:38:59 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 07:38:59 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 15:48:48 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/graphInvariants.cpp line 215: > >> 213: ss.print_cr("Input at index %d is nullptr.", _which_input); >> 214: return false; >> 215: } > > So we would never do `AtInput(0, ExpectNullptr())` for example? > Fine with me, just an idea to consider ;) No, we can't do that because every pattern must be applied on a center. `AtInput` moves the center. We cannot use a parametric pattern to check that a node around is not there: there would be no place to apply the parameter pattern. We can make `InputIsNull(int)` for that. > src/hotspot/share/opto/graphInvariants.hpp line 73: > >> 71: * In addition, if the check fails, it must write its error message in [ss]. >> 72: * >> 73: * If the check succeeds or is not applicable, [steps], [path] and [ss] must be untouched. > > I wonder if we should not have some object that represents these 3 args. You pass them everywhere, and they seem to be a unit. And they have invariants that we may want to check. > You could for example enforce that steps and path are in synch just by only providing the access methods that allow it. > What do you think? `steps` and `path` can make sense. I don't think it makes sense for `ss` because we just fill it from `steps` and `path` at some point, it doesn't really evolve with. If you like it, I won't fight, but is it worth it? It seems like more ad-hoc types to be aware of for simplifying the code a little and real benefits but not big benefits imo. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332310623 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332318192 From epeter at openjdk.org Tue Sep 9 07:38:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 07:38:59 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 07:33:25 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 215: >> >>> 213: ss.print_cr("Input at index %d is nullptr.", _which_input); >>> 214: return false; >>> 215: } >> >> So we would never do `AtInput(0, ExpectNullptr())` for example? >> Fine with me, just an idea to consider ;) > > No, we can't do that because every pattern must be applied on a center. `AtInput` moves the center. We cannot use a parametric pattern to check that a node around is not there: there would be no place to apply the parameter pattern. We can make `InputIsNull(int)` for that. Sounds good, we can do that when we need it :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332316770 From mchevalier at openjdk.org Tue Sep 9 07:45:15 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 07:45:15 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 15:59:58 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/graphInvariants.cpp line 222: > >> 220: } >> 221: return result; >> 222: } > > Would this not read better? > Suggestion: > > bool success = _pattern->check(center->in(_which_input), state); > if (!success) { > state.trace_failure_path(center, _which_input); > } > return success; > } I don't think it's terrible, but I don't think it's much better. If I know the code and that we want to add the new points in the path, and I'll read it as such either way. Or I don't know the code, but I know the types of `steps` and `path`, and I know what `push` does, while a custom type with custom methods has an higher learning cost. So to me, it's pretty equivalent. > src/hotspot/share/opto/graphInvariants.cpp line 239: > >> 237: return true; >> 238: } >> 239: bool (Node::*_type_check)() const; > > Suggestion: > > private: > bool (Node::*_type_check)() const; > > I would also suggest that you use a `typedef` here. > Something like: > `typedef bool (Node::*TypeCheckMethod)() const;` > Then you can write > Suggestion: > > public: > const TypeCheckMethod _type_check; Again, is it really better? If I know the code, I know what it needs, however I express it. If I don't know the code, looking at the signature won't be enough, I'll need to look up one level deeper the definition. Not sure it's a win. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332328768 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332333010 From mchevalier at openjdk.org Tue Sep 9 07:48:26 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 07:48:26 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 16:08:37 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 234: >> >>> 232: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { >>> 233: if (!(center->*_type_check)()) { >>> 234: ss.print_cr("Unexpected type: %s.", center->Name()); >> >> Is there a way we could say what we actually do expect? Not really, right? We'd need to do it via macro again. > > Or we pass a string .. not nice but would work with the macro for `NodeClassIsAndBind`. Not sure what's best here. I thought about that and I think the current situation is ok. The pattern is not something highly mutable, it's mostly some hardcoded thing. I don't think it's hard to figure out what you're expecting. I'm very reluctant to add some ugliness to the patterns who must stay readable, to be easy to verify by a human. It could be solved with more templates tho. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332342466 From aph at openjdk.org Tue Sep 9 07:58:11 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 9 Sep 2025 07:58:11 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 [v5] In-Reply-To: References: Message-ID: On Mon, 4 Aug 2025 21:26:22 GMT, Dean Long wrote: >> This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() to perform a direct byte store, leaving 24 bits for the guard value. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > one unconditional release should be enough That looks like a nice improvement. Thanks. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26399#pullrequestreview-3199895559 From mchevalier at openjdk.org Tue Sep 9 08:01:24 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 08:01:24 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: <4jB_I2sHD7IfzhR7ojHfsFPlvZFCOWaHf8aS0AZshj0=.d0162feb-10b8-488d-82fa-eb816ce5dda9@github.com> On Mon, 8 Sep 2025 16:42:32 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 378: >> >>> 376: return CheckResult::VALID; >>> 377: } >>> 378: }; >> >> I am wondering if it is really worth it to do the whole pattern matching approach, if we still have to write so much code. >> >> There is a lot of boiler plate now, that has replaced the procedural code. >> >> I'm just wondering if we are there yet, or if we need to find some way to make it more concise. >> Maybe we can do something like this: >> >> return >> .applies_if(&Node::is_Phi) >> .check([&]() { return PatternBasedCheck::check(center, reachable_cfg_nodes, steps, path, ss); }) >> .require(...) >> .finish(); >> >> Just an idea. It would probably be lambda based again, which has its disadvantages. >> Maybe you have an even better idea. >> I'd just like to understand why the Pattern based approach is really super desirable, what are the advantages and disadvantages? > > One advantage is definitively reporting. And it is still reasonably debuggable I think, my solution may be a little trickier that way. > > I think there are multiple factors: > - Simple: fewer abstractions can be easier to read/debug. > - Concise: few lines of code. > - Reporting: nice output when rules fail. I could have wrote this without pattern at all, but I also want to make more example of differently complex usage of patterns. Writing it without patterns at all would be pretty similar to me. I think the boilerplate has to exist somewhere. It's not nice to read, it's long, but it's (actually) simple. If we hide it somewhere, it's nicer to read and gives an impression of easier to understand, but harder to actually understand when something goes wrong. No strong opinion. >> src/hotspot/share/opto/graphInvariants.cpp line 547: >> >>> 545: return CheckResult::FAILED; >>> 546: } >>> 547: } >> >> If you do add `OuterStripMinedLoop`, make it a swich, and assert in the default case ;) > > I just saw that you do the `OuterStripMinedLoop` below. But to capture the parallel structure it may still be good. And to capture possible future extension. I don't understand. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332373118 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332384303 From mchevalier at openjdk.org Tue Sep 9 08:01:27 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 08:01:27 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 16:47:23 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/graphInvariants.cpp line 438: > >> 436: ss.print_cr("%s node must have at least one control successors. Found %d.", center->Name(), cfg_out); >> 437: return CheckResult::FAILED; >> 438: } > > Is there some upper bound? I don't think so. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332377717 From epeter at openjdk.org Tue Sep 9 08:08:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:08:36 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer In-Reply-To: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: On Fri, 5 Sep 2025 13:02:00 GMT, Daniel Jeli?ski wrote: > The nop list has never been used in the history of OpenJDK. Let's clean it up. > > Tested with Mach5 tier 1-5, no related failures. Looks quite reasonable. Thanks for cleaning the code :) src/hotspot/cpu/ppc/ppc.ad line 4926: > 4924: // Unused, list one so that array generated by adlc is not empty. > 4925: // Aix compiler chokes if _nop_count = 0. > 4926: nops(fxNop); There seems to be some justification here why we needed to have the list. Can you quickly say why we should not be worried about that now? ;) ------------- PR Review: https://git.openjdk.org/jdk/pull/27117#pullrequestreview-3199928796 PR Review Comment: https://git.openjdk.org/jdk/pull/27117#discussion_r2332395694 From epeter at openjdk.org Tue Sep 9 08:13:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:13:24 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 07:28:14 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 45: >> >>> 43: } >>> 44: } >>> 45: } >> >> It seems you are assuming that all CFG nodes are reachable "from below". >> That is true in most cases... but: >> Have we not had this pesky case where we have a "infinite loop", where there is really no reachability from below, but from above it is reachable. >> >> See `_root_and_safepoints` in `PhaseCCP`. I'm not sure we need to worry about this, but I'd like to be sure that we have considered infinite loops here. >> >> The risk is that otherwise you just call those nodes dead, and do not verify them, right? Or you would just ignore failures there. > > I guess that is the risk, but I'm going from root, and follow the outputs, I'm checking reachability from above, no? Never mind, I somehow did not look at this right. Sorry ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332431787 From epeter at openjdk.org Tue Sep 9 08:18:03 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:18:03 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: <4jB_I2sHD7IfzhR7ojHfsFPlvZFCOWaHf8aS0AZshj0=.d0162feb-10b8-488d-82fa-eb816ce5dda9@github.com> References: <4jB_I2sHD7IfzhR7ojHfsFPlvZFCOWaHf8aS0AZshj0=.d0162feb-10b8-488d-82fa-eb816ce5dda9@github.com> Message-ID: On Tue, 9 Sep 2025 07:58:45 GMT, Marc Chevalier wrote: >> I just saw that you do the `OuterStripMinedLoop` below. But to capture the parallel structure it may still be good. And to capture possible future extension. > > I don't understand. I would still consider adding `OuterStripMinedLoop` here, to capture that it has a similar structure. Even if you also verify below specific things for `OuterStripMinedLoop`. Just to check that all these loop structures have the same kind of backedge shape. And then make a switch out of it, with a default case that fails. In case we add yet another `Loop` shape, we would then catch that and add the logic for it. But actually: do not all `Loop` shapes have this backedge pattern? Or are there some that have a `IfFalse` on the backedge? Because then you could also add `LoopNode` with `LoopEndNode`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332447521 From epeter at openjdk.org Tue Sep 9 08:25:23 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:25:23 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 07:57:08 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 438: >> >>> 436: ss.print_cr("%s node must have at least one control successors. Found %d.", center->Name(), cfg_out); >>> 437: return CheckResult::FAILED; >>> 438: } >> >> Is there some upper bound? > > I don't think so. Can you add a comment, why it can be arbitrarily large? Do you have an example where we have very many ctrl uses? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332468642 From epeter at openjdk.org Tue Sep 9 08:25:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:25:24 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 08:21:28 GMT, Emanuel Peter wrote: >> I don't think so. > > Can you add a comment, why it can be arbitrarily large? > Do you have an example where we have very many ctrl uses? Also: are these all supposed to be projections of a specific kind? We could also test for that. You can also add that to a future RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332472835 From djelinski at openjdk.org Tue Sep 9 08:28:44 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 08:28:44 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer In-Reply-To: References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: On Tue, 9 Sep 2025 08:01:40 GMT, Emanuel Peter wrote: >> The nop list has never been used in the history of OpenJDK. Let's clean it up. >> >> Tested with Mach5 tier 1-5, no related failures. > > src/hotspot/cpu/ppc/ppc.ad line 4926: > >> 4924: // Unused, list one so that array generated by adlc is not empty. >> 4925: // Aix compiler chokes if _nop_count = 0. >> 4926: nops(fxNop); > > There seems to be some justification here why we needed to have the list. > Can you quickly say why we should not be worried about that now? ;) I don't have the AIX compiler at hand, but based on the comment I'd guess that the AIX compiler errored out either on [this](https://github.com/openjdk/jdk/blob/b1fa1ecc988fb07f191892a459625c2c8f2de3b5/src/hotspot/share/opto/output.cpp#L1403) or on [this](https://github.com/openjdk/jdk/blob/91f12600d2b188ca98c5c575a34b85f5835399a0/src/hotspot/share/adlc/output_h.cpp#L1122). Both these lines are removed in this PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27117#discussion_r2332483194 From roland at openjdk.org Tue Sep 9 08:35:14 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 9 Sep 2025 08:35:14 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v6] In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - Merge branch 'master' into JDK-8361702 - review - Merge branch 'master' into JDK-8361702 - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java Co-authored-by: Christian Hagedorn - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java Co-authored-by: Christian Hagedorn - tests - fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26424/files - new: https://git.openjdk.org/jdk/pull/26424/files/6da75e9d..b220867d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=04-05 Stats: 34368 lines in 1338 files changed: 21178 ins; 7472 del; 5718 mod Patch: https://git.openjdk.org/jdk/pull/26424.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26424/head:pull/26424 PR: https://git.openjdk.org/jdk/pull/26424 From roland at openjdk.org Tue Sep 9 08:35:18 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 9 Sep 2025 08:35:18 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v4] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Mon, 28 Jul 2025 06:34:46 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8361702 >> - Update src/hotspot/share/opto/loopopts.cpp >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java >> >> Co-authored-by: Christian Hagedorn >> - Update src/hotspot/share/opto/loopopts.cpp >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java >> >> Co-authored-by: Christian Hagedorn >> - tests >> - fix > > Marked as reviewed by chagedorn (Reviewer). @chhagedorn would you mind re-approving this change now that I added the run without flags and merged with latest? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26424#issuecomment-3269515987 From epeter at openjdk.org Tue Sep 9 08:35:50 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:35:50 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 07:40:23 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 222: >> >>> 220: } >>> 221: return result; >>> 222: } >> >> Would this not read better? >> Suggestion: >> >> bool success = _pattern->check(center->in(_which_input), state); >> if (!success) { >> state.trace_failure_path(center, _which_input); >> } >> return success; >> } > > I don't think it's terrible, but I don't think it's much better. If I know the code and that we want to add the new points in the path, and I'll read it as such either way. Or I don't know the code, but I know the types of `steps` and `path`, and I know what `push` does, while a custom type with custom methods has an higher learning cost. So to me, it's pretty equivalent. To me, it was a high overhead having to find out where the `steps` `path` and `ss` were defined. If I know it is some state, I can quickly go to the definition, and see what it is all about. You can also call the method `state.push_to_paths_and_steps(center, _which_input)`. >> src/hotspot/share/opto/graphInvariants.cpp line 239: >> >>> 237: return true; >>> 238: } >>> 239: bool (Node::*_type_check)() const; >> >> Suggestion: >> >> private: >> bool (Node::*_type_check)() const; >> >> I would also suggest that you use a `typedef` here. >> Something like: >> `typedef bool (Node::*TypeCheckMethod)() const;` >> Then you can write >> Suggestion: >> >> public: >> const TypeCheckMethod _type_check; > > Again, is it really better? If I know the code, I know what it needs, however I express it. If I don't know the code, looking at the signature won't be enough, I'll need to look up one level deeper the definition. Not sure it's a win. I think it is a matter of taste. I don't personally like the C++ way of expressing pointer types. But I can get used to it. >> src/hotspot/share/opto/graphInvariants.hpp line 73: >> >>> 71: * In addition, if the check fails, it must write its error message in [ss]. >>> 72: * >>> 73: * If the check succeeds or is not applicable, [steps], [path] and [ss] must be untouched. >> >> I wonder if we should not have some object that represents these 3 args. You pass them everywhere, and they seem to be a unit. And they have invariants that we may want to check. >> You could for example enforce that steps and path are in synch just by only providing the access methods that allow it. >> What do you think? > > `steps` and `path` can make sense. I don't think it makes sense for `ss` because we just fill it from `steps` and `path` at some point, it doesn't really evolve with. If you like it, I won't fight, but is it worth it? It seems like more ad-hoc types to be aware of for simplifying the code a little and real benefits but not big benefits imo. Let's ask @chhagedorn . He might have a good idea too here ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332506238 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332490006 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332509836 From epeter at openjdk.org Tue Sep 9 08:35:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:35:51 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: <8F_IhYAZ2XxKl9SzWYNYkGvXzKEj1rl8GsRFrORBWaE=.4bd4bd61-f01d-480b-86b1-e65bbf61b065@github.com> On Tue, 9 Sep 2025 07:45:18 GMT, Marc Chevalier wrote: >> Or we pass a string .. not nice but would work with the macro for `NodeClassIsAndBind`. Not sure what's best here. > > I thought about that and I think the current situation is ok. The pattern is not something highly mutable, it's mostly some hardcoded thing. I don't think it's hard to figure out what you're expecting. I'm very reluctant to add some ugliness to the patterns who must stay readable, to be easy to verify by a human. It could be solved with more templates tho. Once we have more complex patterns, will it really be that easy to see what was expected? All you will see is what we actually got. You are already all about good reporting, so I just noticed a hole here. You know the code better, so I'll leave it up to you in the end ;) >> One advantage is definitively reporting. And it is still reasonably debuggable I think, my solution may be a little trickier that way. >> >> I think there are multiple factors: >> - Simple: fewer abstractions can be easier to read/debug. >> - Concise: few lines of code. >> - Reporting: nice output when rules fail. > > I could have wrote this without pattern at all, but I also want to make more example of differently complex usage of patterns. Writing it without patterns at all would be pretty similar to me. > > I think the boilerplate has to exist somewhere. It's not nice to read, it's long, but it's (actually) simple. If we hide it somewhere, it's nicer to read and gives an impression of easier to understand, but harder to actually understand when something goes wrong. No strong opinion. Yes, these are the trade-offs. Maybe we can discuss in the office, and pull in some others to discuss the pros and cons. Because if we are going to use Patterns more in other places, we should not shy away from doing some design brainstorming together. I really appreciate the new approach, and I can see a lot of benefits, including for IGVN. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332497359 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332483756 From epeter at openjdk.org Tue Sep 9 08:38:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:38:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> Message-ID: On Mon, 8 Sep 2025 16:20:20 GMT, Daniel Lund?n wrote: >> Yes, the subtraction is consistent, because if the register mask is offset, we can no longer use the OptoReg to directly index the mask. Small simplified example: register mask with 5 bits, offset by 10. First bit (index 0) represents OptoReg 10, second bit (index 1) represents OptoReg 11, etc. If we call `Member(15)`, we need to subtract the offset so we look at the correct index in the register mask (index 5). > > Ah, I think I now better understand your question. `rm_up` is a low-level method for internal use in `regmask.hpp` and `regmask.cpp` only (perhaps I should prepend it with an underscore?). It basically makes it so that we can regard the backing storage (`_RM_UP` and `_RM_UP_EXT`) as one contiguous array. `Member` is exposed externally and so needs the offset logic. Makes sense. Maybe we can make that a bit more clear in the renaming. Maybe we can make a clear distinction between the two mappings somehow? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2332518245 From roland at openjdk.org Tue Sep 9 08:39:37 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 9 Sep 2025 08:39:37 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v7] In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: - Merge branch 'master' into JDK-8361702 - Merge branch 'master' into JDK-8361702 - review - Merge branch 'master' into JDK-8361702 - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java Co-authored-by: Christian Hagedorn - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java Co-authored-by: Christian Hagedorn - tests - ... and 1 more: https://git.openjdk.org/jdk/compare/e3d13e64...91a7d73c ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26424/files - new: https://git.openjdk.org/jdk/pull/26424/files/b220867d..91a7d73c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=05-06 Stats: 228 lines in 12 files changed: 43 ins; 163 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/26424.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26424/head:pull/26424 PR: https://git.openjdk.org/jdk/pull/26424 From epeter at openjdk.org Tue Sep 9 08:43:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:43:32 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v9] In-Reply-To: References: Message-ID: <9xCpJGY6CFKPAt4VtDY23_Tr3SE9tUebdMF3pAYWhFA=.281e0b84-bfad-466b-b290-918cf1fa83d1@github.com> On Fri, 8 Aug 2025 08:21:56 GMT, Qizheng Xing wrote: >> Qizheng Xing has updated the pull request incrementally with two additional commits since the last revision: >> >> - Add microbench >> - Add missing test method declarations > > Hi @jatin-bhateja, I've added a micro benchmark that includes the `numberOfNibbles` implementation from this PR description and your micro kernel. > > Here's my test results on an Intel(R) Xeon(R) Platinum: > > > # Baseline: > Benchmark Mode Cnt Score Error Units > CountLeadingZeros.benchClzLongConstrained avgt 15 1517.888 ? 5.691 ns/op > CountLeadingZeros.benchNumberOfNibbles avgt 15 1094.422 ? 1.753 ns/op > > # This patch: > Benchmark Mode Cnt Score Error Units > CountLeadingZeros.benchClzLongConstrained avgt 15 0.948 ? 0.002 ns/op > CountLeadingZeros.benchNumberOfNibbles avgt 15 942.438 ? 1.742 ns/op @MaxXSoft Feel free to just ping me again when you want another review :) FYI: I'll be on a longer vacation starting in about a week, so don't expect me to respond then. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25928#issuecomment-3269553729 From epeter at openjdk.org Tue Sep 9 08:45:30 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:45:30 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... @erifan This is a regression / bug fix for https://github.com/openjdk/jdk/pull/25673, right? If so, please convert the JBS issue into a bug. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27113#issuecomment-3269564806 From epeter at openjdk.org Tue Sep 9 08:51:21 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:51:21 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... The patch looks reasonable, thanks for fixing this and writing an IR test! I'm launching some internal testing now, should hopefully not take much more than 24h. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27113#issuecomment-3269585613 From epeter at openjdk.org Tue Sep 9 08:56:40 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 08:56:40 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation In-Reply-To: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> Message-ID: On Fri, 5 Sep 2025 15:27:22 GMT, Manuel H?ssig wrote: > When running a debug JVM on Linux with a compile task timeout and repeated compilation, the execution will time out almost always because the timeout does not reset for repetitions of a compilation. The core of the compile task timeout is to limit the amount of time a single compilation can take. Thus, this PR resets the `CompileTaskTimeout` for every compilation when running with `-XX:RepeatCompilation=` for n > 1. > > This PR is stacked on top of #27094. > > Testing: > - [x] Github Actions (failures are unrelated) > - [x] tier1, tier2, tier3 plus some additional internal testing Looks reasonable :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27120#pullrequestreview-3200195978 From epeter at openjdk.org Tue Sep 9 09:02:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 09:02:00 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer In-Reply-To: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: On Fri, 5 Sep 2025 13:02:00 GMT, Daniel Jeli?ski wrote: > The nop list has never been used in the history of OpenJDK. Let's clean it up. > > Tested with Mach5 tier 1-5, no related failures. Approved. (assuming you run the additional stress testing I asked for over slack) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27117#pullrequestreview-3200229837 From epeter at openjdk.org Tue Sep 9 09:02:03 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 09:02:03 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer In-Reply-To: References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: On Tue, 9 Sep 2025 08:25:41 GMT, Daniel Jeli?ski wrote: >> src/hotspot/cpu/ppc/ppc.ad line 4926: >> >>> 4924: // Unused, list one so that array generated by adlc is not empty. >>> 4925: // Aix compiler chokes if _nop_count = 0. >>> 4926: nops(fxNop); >> >> There seems to be some justification here why we needed to have the list. >> Can you quickly say why we should not be worried about that now? ;) > > I don't have the AIX compiler at hand, but based on the comment I'd guess that the AIX compiler errored out either on [this](https://github.com/openjdk/jdk/blob/b1fa1ecc988fb07f191892a459625c2c8f2de3b5/src/hotspot/share/opto/output.cpp#L1403) or on [this](https://github.com/openjdk/jdk/blob/91f12600d2b188ca98c5c575a34b85f5835399a0/src/hotspot/share/adlc/output_h.cpp#L1122). Both these lines are removed in this PR. Ok, sounds good! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27117#discussion_r2332606173 From epeter at openjdk.org Tue Sep 9 09:14:05 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 09:14:05 GMT Subject: RFR: 8366984: Remove delay slot support In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 14:24:50 GMT, Daniel Jeli?ski wrote: > SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. > > The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. Looks reasonable, thanks for doing the cleanup! I have 2 minor questions though. (please also run additional stress testing, see slack) src/hotspot/cpu/arm/arm.ad line 3383: > 3381: BR : R; > 3382: %} > 3383: Where was this used? Or is it an unrelated cleanup? src/hotspot/share/adlc/adlparse.cpp line 1394: > 1392: parse_err(SYNERR, "Using obsolete token, branch_has_delay_slot"); > 1393: break; > 1394: } I'm curious: why do you add that special warning? It would fail later anyway, right? Are we expecting anyone to parse things produced by different versions? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27119#pullrequestreview-3200246647 PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2332626258 PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2332620923 From epeter at openjdk.org Tue Sep 9 09:17:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 09:17:14 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 10:11:38 GMT, erifan wrote: >> The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation. > > @theRealAph @e1iu @XiaohongGong @fg1417 @shqking, could you help take a look at this PR, thanks~ @erifan Feel free to ping me again if I should re-review. I'm going on vacation in a week, so I'll be unresponsive for a while (feel free to contact other reviewers, especially for additional testing). ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3269684547 From xgong at openjdk.org Tue Sep 9 09:17:14 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 9 Sep 2025 09:17:14 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... > @erifan This is a regression / bug fix for #25673, right? If so, please convert the JBS issue into a bug. Thanks for your review! I'v changed the JBS type to bug. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27113#issuecomment-3269685052 From djelinski at openjdk.org Tue Sep 9 09:18:21 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 09:18:21 GMT Subject: RFR: 8366984: Remove delay slot support In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 09:03:18 GMT, Emanuel Peter wrote: >> SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. >> >> The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. > > src/hotspot/share/adlc/adlparse.cpp line 1394: > >> 1392: parse_err(SYNERR, "Using obsolete token, branch_has_delay_slot"); >> 1393: break; >> 1394: } > > I'm curious: why do you add that special warning? It would fail later anyway, right? Are we expecting anyone to parse things produced by different versions? I took my inspiration from earlier work on adlc (see 6e35bcbf038cec0210c38428a8e1c233e102911a or 3f9c8a39201644952c6d07b97695a5a7ef918622), but I don't mind removing these warnings and the related code block entirely. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2332667364 From epeter at openjdk.org Tue Sep 9 09:24:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 09:24:45 GMT Subject: RFR: 8356779: IGV: dump the index of the SafePointNode containing the current JVMS during parsing In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 05:22:00 GMT, Saranya Natarajan wrote: > This PR prints index of the SafePointNode containing the current JVMS during parsing in IGV. As stated in JBS the reason for this is that there are a lot of nodes during parsing, it would be nice to know what are the current nodes in the local slots or in the stack when looking at a graph. Looks reasonable. @merykitty first proposed this, so would be good if he took a look too :) Just out of curiosity: could you show a before/after igv screenshot? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27083#pullrequestreview-3200350110 PR Comment: https://git.openjdk.org/jdk/pull/27083#issuecomment-3269709657 From duke at openjdk.org Tue Sep 9 09:26:04 2025 From: duke at openjdk.org (erifan) Date: Tue, 9 Sep 2025 09:26:04 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 10:11:38 GMT, erifan wrote: >> The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation. > > @theRealAph @e1iu @XiaohongGong @fg1417 @shqking, could you help take a look at this PR, thanks~ > @erifan Feel free to ping me again if I should re-review. I'm going on vacation in a week, so I'll be unresponsive for a while (feel free to contact other reviewers, especially for additional testing). Thanks @eme64 , I have made the corresponding changes according to your suggestions. Please help take another look. Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3269712905 From djelinski at openjdk.org Tue Sep 9 09:26:04 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 09:26:04 GMT Subject: RFR: 8366984: Remove delay slot support In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 09:04:48 GMT, Emanuel Peter wrote: >> SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. >> >> The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. > > src/hotspot/cpu/arm/arm.ad line 3383: > >> 3381: BR : R; >> 3382: %} >> 3383: > > Where was this used? Or is it an unrelated cleanup? Removing the comment alone didn't feel quite right, so I removed the following block as well. The block appears to be unused. It was copy-pasted from [SPARC](https://github.com/openjdk/jdk/blob/8153779ad32d1e8ddd37ced826c76c7aafc61894/hotspot/src/cpu/sparc/vm/sparc.ad#L4984), where it was also unused. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2332702728 From duke at openjdk.org Tue Sep 9 09:29:24 2025 From: duke at openjdk.org (erifan) Date: Tue, 9 Sep 2025 09:29:24 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation In-Reply-To: References: Message-ID: On Wed, 20 Aug 2025 11:27:59 GMT, Andrew Haley wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation. @theRealAph @e1iu could you help take another look of this PR, thanks ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3269731241 From xgong at openjdk.org Tue Sep 9 09:33:34 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 9 Sep 2025 09:33:34 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... LGTM! ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/27113#pullrequestreview-3200414898 From duke at openjdk.org Tue Sep 9 09:33:35 2025 From: duke at openjdk.org (erifan) Date: Tue, 9 Sep 2025 09:33:35 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 08:48:42 GMT, Emanuel Peter wrote: >> Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 >> microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 >> microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 >> microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 >> microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 >> microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 >> microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 >> microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 >> microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 >> microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 >> >> >> Benchmarks on Intel 6444y machine with 512-bit avx3: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 >> microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 >> microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 >> microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 >> microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 >> microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 >> microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 >> microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 >> microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 >> microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 >> microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 >> microMaskLaneIsSetInt512_var ops/ms 573... > > The patch looks reasonable, thanks for fixing this and writing an IR test! > I'm launching some internal testing now, should hopefully not take much more than 24h. Thanks for your help @eme64 @XiaohongGong @shipilev ------------- PR Comment: https://git.openjdk.org/jdk/pull/27113#issuecomment-3269744719 From mchevalier at openjdk.org Tue Sep 9 09:39:49 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 09:39:49 GMT Subject: RFR: 8367135: Test compiler/loopstripmining/CheckLoopStripMining.java needs internal timeouts adjusted Message-ID: As described, adjust timeout to be as it implicitly used to be. Thanks, Marc ------------- Commit messages: - Explicit * 4, but literal Changes: https://git.openjdk.org/jdk/pull/27167/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27167&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367135 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/27167.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27167/head:pull/27167 PR: https://git.openjdk.org/jdk/pull/27167 From mchevalier at openjdk.org Tue Sep 9 09:55:05 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 09:55:05 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 14:58:33 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/compile.cpp line 702: > >> 700: , >> 701: _in_dump_cnt(0), >> 702: _invariant_checker(GraphInvariantChecker::make_default()) > > How does this interface with `ResouceMarks`? > Because it is now resource allocated. And so is the `_checks`. > How does this not trip the nesting asserts of allocation there? > I'm probably missing something here. > > I would have expected that we need to allocate it from the `_comp_arena`. I guess I can put it in the comp arena if it's better, but I don't see why there would be a problem. These things are under the ResourceMark in the block where the `Compile` object is created (in `C2Compiler::compile_method`), and deleted at the end of that. In between, all the other structures used by the invariant checkers have been created in nested ResourceMarks, and freed before the surrounding one is deleted. Nesting seems indeed respected. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2332826049 From djelinski at openjdk.org Tue Sep 9 10:13:44 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 10:13:44 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer [v2] In-Reply-To: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: > The nop list has never been used in the history of OpenJDK. Let's clean it up. > > Tested with Mach5 tier 1-5, no related failures. Daniel Jeli?ski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge remote-tracking branch 'origin/master' into nops-cleanup - Update copyright - Remove outdated comment - Remove nop list ------------- Changes: https://git.openjdk.org/jdk/pull/27117/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27117&range=01 Stats: 83 lines in 11 files changed: 1 ins; 77 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27117.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27117/head:pull/27117 PR: https://git.openjdk.org/jdk/pull/27117 From thartmann at openjdk.org Tue Sep 9 10:23:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 9 Sep 2025 10:23:53 GMT Subject: RFR: 8367135: Test compiler/loopstripmining/CheckLoopStripMining.java needs internal timeouts adjusted In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 09:31:24 GMT, Marc Chevalier wrote: > As described, adjust timeout to be as it implicitly used to be. > > Thanks, > Marc Thanks for fixing this. Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27167#pullrequestreview-3200720790 From mchevalier at openjdk.org Tue Sep 9 10:41:32 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 10:41:32 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: <8F_IhYAZ2XxKl9SzWYNYkGvXzKEj1rl8GsRFrORBWaE=.4bd4bd61-f01d-480b-86b1-e65bbf61b065@github.com> References: <8F_IhYAZ2XxKl9SzWYNYkGvXzKEj1rl8GsRFrORBWaE=.4bd4bd61-f01d-480b-86b1-e65bbf61b065@github.com> Message-ID: On Tue, 9 Sep 2025 08:29:51 GMT, Emanuel Peter wrote: >> I thought about that and I think the current situation is ok. The pattern is not something highly mutable, it's mostly some hardcoded thing. I don't think it's hard to figure out what you're expecting. I'm very reluctant to add some ugliness to the patterns who must stay readable, to be easy to verify by a human. It could be solved with more templates tho. > > Once we have more complex patterns, will it really be that easy to see what was expected? > All you will see is what we actually got. You are already all about good reporting, so I just noticed a hole here. > You know the code better, so I'll leave it up to you in the end ;) That is not quite true! We will also print the path from the center, so we know how we arrived at the point that has an unexpected type. We can both use the pattern and follow it, or our general knowledge of the IR to see that something looks wrong in the displayed part. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333006921 From epeter at openjdk.org Tue Sep 9 10:58:41 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 10:58:41 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer [v2] In-Reply-To: References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: On Tue, 9 Sep 2025 10:13:44 GMT, Daniel Jeli?ski wrote: >> The nop list has never been used in the history of OpenJDK. Let's clean it up. >> >> Tested with Mach5 tier 1-5, no related failures. > > Daniel Jeli?ski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - Merge remote-tracking branch 'origin/master' into nops-cleanup > - Update copyright > - Remove outdated comment > - Remove nop list Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27117#pullrequestreview-3200894301 From epeter at openjdk.org Tue Sep 9 10:59:48 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 10:59:48 GMT Subject: RFR: 8366984: Remove delay slot support In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 14:24:50 GMT, Daniel Jeli?ski wrote: > SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. > > The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. Thanks for the answers! You'll of course have to merge the dependency, and get a second review :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27119#issuecomment-3270142011 From epeter at openjdk.org Tue Sep 9 10:59:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 10:59:51 GMT Subject: RFR: 8366984: Remove delay slot support In-Reply-To: References: Message-ID: <6dsuRmIveoGusgv0MnsHqIv87ZjbwXy3z9DoBvbUwVc=.9437cbe1-e0ad-43ce-bbda-86403a5971fc@github.com> On Tue, 9 Sep 2025 09:23:46 GMT, Daniel Jeli?ski wrote: >> src/hotspot/cpu/arm/arm.ad line 3383: >> >>> 3381: BR : R; >>> 3382: %} >>> 3383: >> >> Where was this used? Or is it an unrelated cleanup? > > Removing the comment alone didn't feel quite right, so I removed the following block as well. The block appears to be unused. It was copy-pasted from [SPARC](https://github.com/openjdk/jdk/blob/8153779ad32d1e8ddd37ced826c76c7aafc61894/hotspot/src/cpu/sparc/vm/sparc.ad#L4984), where it was also unused. Thanks for the explanation :) >> src/hotspot/share/adlc/adlparse.cpp line 1394: >> >>> 1392: parse_err(SYNERR, "Using obsolete token, branch_has_delay_slot"); >>> 1393: break; >>> 1394: } >> >> I'm curious: why do you add that special warning? It would fail later anyway, right? Are we expecting anyone to parse things produced by different versions? > > I took my inspiration from earlier work on adlc (see 6e35bcbf038cec0210c38428a8e1c233e102911a or 3f9c8a39201644952c6d07b97695a5a7ef918622), but I don't mind removing these warnings and the related code block entirely. Sounds good, just keep the "obsolete" error :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2333063403 PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2333062954 From chagedorn at openjdk.org Tue Sep 9 11:16:52 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 9 Sep 2025 11:16:52 GMT Subject: RFR: 8367135: Test compiler/loopstripmining/CheckLoopStripMining.java needs internal timeouts adjusted In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 09:31:24 GMT, Marc Chevalier wrote: > As described, adjust timeout to be as it implicitly used to be. > > Thanks, > Marc Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27167#pullrequestreview-3200976783 From chagedorn at openjdk.org Tue Sep 9 11:17:57 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 9 Sep 2025 11:17:57 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v7] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Tue, 9 Sep 2025 08:39:37 GMT, Roland Westrelin wrote: >> A node in a pre loop only has uses out of the loop dominated by the >> loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control >> to the loop exit projection. A range check in the main loop has this >> node as input (through a chain of some other nodes). Range check >> elimination needs to update the exit condition of the pre loop with an >> expression that depends on the node pinned on its exit: that's >> impossible and the assert fires. This is a variant of 8314024 (this >> one was for a node with uses out of the pre loop on multiple paths). I >> propose the same fix: leave the node with control in the pre loop in >> this case. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Merge branch 'master' into JDK-8361702 > - Merge branch 'master' into JDK-8361702 > - review > - Merge branch 'master' into JDK-8361702 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn > - tests > - ... and 1 more: https://git.openjdk.org/jdk/compare/e2575a25...91a7d73c Still good! Since the last testing is quite a while back, let me rerun it. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26424#pullrequestreview-3200982359 From mchevalier at openjdk.org Tue Sep 9 11:20:11 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 11:20:11 GMT Subject: RFR: 8367135: Test compiler/loopstripmining/CheckLoopStripMining.java needs internal timeouts adjusted In-Reply-To: References: Message-ID: <6Hq93VTf74iXoQqAltS6qNEpiZmB2TXK2mvhZgtiFtc=.c66ef126-4656-4565-9586-463c888f44a4@github.com> On Tue, 9 Sep 2025 09:31:24 GMT, Marc Chevalier wrote: > As described, adjust timeout to be as it implicitly used to be. > > Thanks, > Marc Thanks @TobiHartmann & @chhagedorn! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27167#issuecomment-3270222653 From mchevalier at openjdk.org Tue Sep 9 11:20:12 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 11:20:12 GMT Subject: Integrated: 8367135: Test compiler/loopstripmining/CheckLoopStripMining.java needs internal timeouts adjusted In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 09:31:24 GMT, Marc Chevalier wrote: > As described, adjust timeout to be as it implicitly used to be. > > Thanks, > Marc This pull request has now been integrated. Changeset: 06326176 Author: Marc Chevalier URL: https://git.openjdk.org/jdk/commit/0632617670f991da23c3892d357e8d1f051d29a0 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod 8367135: Test compiler/loopstripmining/CheckLoopStripMining.java needs internal timeouts adjusted Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/27167 From roland at openjdk.org Tue Sep 9 11:27:50 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 9 Sep 2025 11:27:50 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v12] In-Reply-To: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> Message-ID: > An `Initialize` node for an `Allocate` node is created with a memory > `Proj` of adr type raw memory. In order for stores to be captured, the > memory state out of the allocation is a `MergeMem` with slices for the > various object fields/array element set to the raw memory `Proj` of > the `Initialize` node. If `Phi`s need to be created during later > transformations from this memory state, The `Phi` for a particular > slice gets its adr type from the type of the `Proj` which is raw > memory. If during macro expansion, the `Allocate` is found to have no > use and so can be removed, the `Proj` out of the `Initialize` is > replaced by the memory state on input to the `Allocate`. A `Phi` for > some slice for a field of an object will end up with the raw memory > state on input to the `Allocate` node. As a result, memory state at > the `Phi` is incorrect and incorrect execution can happen. > > The fix I propose is, rather than have a single `Proj` for the memory > state out of the `Initialize` with adr type raw memory, to use one > `Proj` per slice added to the memory state after the `Initalize`. Each > of the `Proj` should return the right adr type for its slice. For that > I propose having a new type of `Proj`: `NarrowMemProj` that captures > the right adr type. > > Logic for the construction of the `Allocate`/`Initialize` subgraph is > tweaked so the right adr type captured in is own `NarrowMemProj` is > added to the memory sugraph. Code that removes an allocation or moves > it also has to be changed so it correctly takes the multiple memory > projections out of the `Initialize` node into account. > > One tricky issue is that when EA split types for a scalar replaceable > `Allocate` node: > > 1- the adr type captured in the `NarrowMemProj` becomes out of sync > with the type of the slices for the allocation > > 2- before EA, the memory state for one particular field out of the > `Initialize` node can be used for a `Store` to the just allocated > object or some other. So we can have a chain of `Store`s, some to > the newly allocated object, some to some other objects, all of them > using the state of `NarrowMemProj` out of the `Initialize`. After > split unique types, the `NarrowMemProj` is for the slice of a > particular allocation. So `Store`s to some other objects shouldn't > use that memory state but the memory state before the `Allocate`. > > For that, I added logic to update the adr type of `NarrowMemProj` > during split unique types and update the memory input of `Store`s that > don't depend on the memory state ... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: - more - Merge branch 'master' into JDK-8327963 - more - more - Merge branch 'master' into JDK-8327963 - more - more - lambda return - lambda clean up - Merge branch 'master' into JDK-8327963 - ... and 35 more: https://git.openjdk.org/jdk/compare/e16c5100...b701d03e ------------- Changes: https://git.openjdk.org/jdk/pull/24570/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24570&range=11 Stats: 932 lines in 20 files changed: 845 ins; 25 del; 62 mod Patch: https://git.openjdk.org/jdk/pull/24570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24570/head:pull/24570 PR: https://git.openjdk.org/jdk/pull/24570 From roland at openjdk.org Tue Sep 9 11:30:13 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 9 Sep 2025 11:30:13 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v8] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> <1gdeBnZ7YuIf9CgQW2bCXkDDBWPjUgRnickHts-fvzE=.e6e901ba-3e9f-41a2-9c68-167a879e9655@github.com> <2m1_XtiSsW_LaBRrkX4qv7AKtLOjNgnl4mUp3zisasE=.dda62164-7aa0-4c1a-b83f-fa40ba7902e5@github.com> <4374L3lkQK90wLxxOA7POBmIKNX2DFK-4pO4vj1bkuQ=.5b8d7825-a7f1-497f-ab66-02a85a266659@github.com> Message-ID: On Fri, 11 Jul 2025 18:20:19 GMT, John R Rose wrote: >>> I think it would be good (although not necessarily in the context of this PR) to establish the "no duplicate memory projection" invariant in the back-end, for sanity and to make sure we do not break any logic that might be implicitly relying on it. If you agree, could you file a follow-up RFE, ideally with a reproducer where the current logic fails to remove `NarrowMemProj`s? >> >> One way would be to simply assert that there's no `NarrowMemProj`s left during final graph reshape. Is that what you'd like? >> Stepping back, what's the concern here? The new projections should mostly be harmless. > >> I think it would be good (although not necessarily in the context of this PR) to establish the "no duplicate memory projection" invariant in the back-end, for sanity and to make sure we do not break any logic that might be implicitly relying on it. If you agree, could you file a follow-up RFE, ideally with a reproducer where the current logic fails to remove `NarrowMemProj`s? > > I see this as a request for a better "normal form" for the graph. The trick here is that, if we are allowing temporary "abnormal" forms of the graph, in order to give various transforms some "working room" to rearrange things, we need to decide when are the moments when the graph must be settled back down into a normal form. > > We sometimes check for some kinds of IR normality, and/or enforce some normality, in the "final graph reshape" phase. The problem with loading up too many ad hoc operations at that point is, it may create a completely new kind of graph with new invariants. (Don't like the current standard? Create a new one, and see how that goes! Same for global IR contracts.) > > Having two kinds of IR with two sets of invariants (one set more restrictive) has an obvious objection: We fragment our ability to enforce the rules; we need to write enforcement logic which says "which phase are we in?" before checking the right set of rules. And if the editing sessions are rare, we don't get much benefit from the rules that are enforced by that editing session. By definition "final graph reshape" is rare. It's worth it since we are going to a lower IR, which really must have different rules, but it's not a light thing to add to the design. > > In any case, adding a normalization requirement seems to need a "wash pass" of some sort over the whole graph, to do necessary cleanups. We do this sometimes, I think, after loop opts or EA, maybe other places, and at "final graph reshape". This is going to be a runtime expense, I think, unless it can be piggybacked on some other pass we already do. Maybe a hallmark of these "post-operative" cleanups is that the operation itself required some side data structure, created just for the operation (loop nest or connection graph) and discarded later in order to unleash unconstrained downstream transforms. During the operation, transforms are specialized just to keep the side data structure relevant. Afterwards, the graph "opens up" to unconstrained changes. But in all cases, local updates should be as free as possible, even if their ord... @rose00 @robcasloz I updated the change with a new way to avoid redundant projections. At matching time, before a `NarrowMemProj` is matched into a `MachProj`, new logic checks whether a `MachProj` already exists. That guarantees that no redundant `MachProj` are ever added. It also performs the new normalization at a major cut-point. What do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/24570#issuecomment-3270256703 From epeter at openjdk.org Tue Sep 9 11:51:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 11:51:00 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 17:17:52 GMT, Jatin Bhateja wrote: >> This patch optimizes PopCount value transforms using KnownBits information. >> Following are the results of the micro-benchmark included with the patch >> >> >> >> System: 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s >> PopCountValueTransform.StockKernelInt thrpt 2 409295.875 ops/s >> PopCountValueTransform.StockKernelLong thrpt 2 368025.608 ops/s >> >> Withopt: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s >> PopCountValueTransform.StockKernelInt thrpt 2 418649.269 ops/s >> PopCountValueTransform.StockKernelLong thrpt 2 381330.221 ops/s >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update countbitsnode.cpp Very nice improvement @jatin-bhateja , thanks for working on it :) src/hotspot/share/opto/countbitsnode.cpp line 125: > 123: range is computed using the following formulas:- > 124: - _hi = ~ZEROS > 125: - _lo = ONES Is there going to be some other Lemma here, that gives rise to the numbering? I'd just remove the numbering. src/hotspot/share/opto/countbitsnode.cpp line 128: > 126: Proof:- > 127: - KnownBits.ZEROS and KnownBits.ONES are inferred out of the common prefix of the value range > 128: delimiting bounds. It could come from the range. But it could also come from individual bits being and-ed or or-ed to 1 or 0. I'll give an alternative suggestion below. src/hotspot/share/opto/countbitsnode.cpp line 145: > 143: B) Now, transform the computed knownbits back to the value range. > 144: _new_lo = _known_bits.ones = 0b11000100 > 145: _new_hi = ~known_bits.zeros = 0b11000111 This kinda duplicates all the descriptions that we have in KnownBits. I would drop it. Or maybe just refer to something over there. src/hotspot/share/opto/countbitsnode.cpp line 149: > 147: - We now know that ~KnownBits.ZEROS >= UB >= LB >= KnownBits.ONES > 148: - Therefore, popcount(ONES) and popcount(~ZEROS) can safely be assumed as the upper and lower > 149: bounds of the result value range. I don't quite see how that follows from the proof. And I'm also worried about the correctness. You are using the signed `_lo` and `_hi`. But the zeros and ones are unsigned. So it is a bit unclear what your comparisons prove here - you should probably cast one to signed or the other to unsigned to make things explicit. One crucial step here is also the linearity assumption of `popcount`. You'd need to show or at least assert that: ~KnownBits.ZEROS >= UB >= t >= LB >= KnownBits.ONES implies popcount(~KnownBits.ZEROS) >= popcount(UB) >= popcount(t) >= popcount(LB) >= popcount(KnownBits.ONES) It all sounds a bit complicated, and I think I would prefer something along the lines of what @SirYwell suggested. src/hotspot/share/opto/countbitsnode.cpp line 150: > 148: - Therefore, popcount(ONES) and popcount(~ZEROS) can safely be assumed as the upper and lower > 149: bounds of the result value range. > 150: */ Suggestion: // We use the KnownBits information from the integer types to derive how many one bits // we have at least and at most. // From the definition of KnownBits, we know: // zeros: Indicates which bits must be 0: ones[i] =1 -> t[i]=0 // ones: Indicates which bits must be 1: zeros[i]=1 -> t[i]=1 // // From this, we derive: // numer_of_zeros_in_t >= pop_count(zeros) // -> number_of_ones_in_t <= bits_per_type - pop_count(zeros) = pop_count(~zeros) // number_of_ones_in_t >= pop_count(ones) // // By definition: // pop_count(t) = number_of_ones_in_t // // It follows: // pop_count(ones) <= pop_count(t) <= pop_count(~zeros) // // Note: signed _lo and _hi, as well as unsigned _ulo and _uhi bounds of the integer types // are already reflected in the KnownBits information, see TypeInt / TypeLong definitions. Feel free to adjust the formulation :) test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 74: > 72: } > 73: return 1; > 74: } Can we not assert that there is exactly one popcount? The two should fold to one, no? test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 114: > 112: } > 113: return 1; > 114: } Thanks for the tests! I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks. For example: `num = (num | ONES) & ZEROS;` And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`? Then round it off with some random range comparisons at the end: ` if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {` test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 148: > 146: > 147: public static void main(String[] args) { > 148: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.2"); Can you explain the need for these flags? The TestFramework eventually enqueues for compilation anyway. Or is there something about profiling? test/micro/org/openjdk/bench/java/lang/PopCountValueTransform.java line 79: > 77: } > 78: return res; > 79: } I assume the `stock` kernels are there to show performance if there is no op, the `folding` kernels you hope have the same performance. It would be nice to have one where the `bitCount` does not fold away, just to keep that comparison :) ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27075#pullrequestreview-3200918107 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333099461 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333106910 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333117632 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333147248 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333189002 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333222066 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333199688 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333087776 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333210925 From epeter at openjdk.org Tue Sep 9 11:51:02 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 11:51:02 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: Message-ID: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> On Tue, 9 Sep 2025 11:03:26 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update countbitsnode.cpp > > src/hotspot/share/opto/countbitsnode.cpp line 125: > >> 123: range is computed using the following formulas:- >> 124: - _hi = ~ZEROS >> 125: - _lo = ONES > > Is there going to be some other Lemma here, that gives rise to the numbering? I'd just remove the numbering. Also: this is not really a mathematical statement that can be proven, rather some sort of high-level intention. > src/hotspot/share/opto/countbitsnode.cpp line 150: > >> 148: - Therefore, popcount(ONES) and popcount(~ZEROS) can safely be assumed as the upper and lower >> 149: bounds of the result value range. >> 150: */ > > Suggestion: > > // We use the KnownBits information from the integer types to derive how many one bits > // we have at least and at most. > // From the definition of KnownBits, we know: > // zeros: Indicates which bits must be 0: ones[i] =1 -> t[i]=0 > // ones: Indicates which bits must be 1: zeros[i]=1 -> t[i]=1 > // > // From this, we derive: > // numer_of_zeros_in_t >= pop_count(zeros) > // -> number_of_ones_in_t <= bits_per_type - pop_count(zeros) = pop_count(~zeros) > // number_of_ones_in_t >= pop_count(ones) > // > // By definition: > // pop_count(t) = number_of_ones_in_t > // > // It follows: > // pop_count(ones) <= pop_count(t) <= pop_count(~zeros) > // > // Note: signed _lo and _hi, as well as unsigned _ulo and _uhi bounds of the integer types > // are already reflected in the KnownBits information, see TypeInt / TypeLong definitions. > > Feel free to adjust the formulation :) It goes along the lines of what @SirYwell proposed. > test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 114: > >> 112: } >> 113: return 1; >> 114: } > > Thanks for the tests! > > I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks. > > For example: > `num = (num | ONES) & ZEROS;` > > And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`? > Then round it off with some random range comparisons at the end: > ` if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {` Also: how many popcount instructions are left? Should it not at most be 1? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333112033 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333226941 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2333218115 From epeter at openjdk.org Tue Sep 9 12:00:02 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 12:00:02 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v7] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Tue, 9 Sep 2025 08:39:37 GMT, Roland Westrelin wrote: >> A node in a pre loop only has uses out of the loop dominated by the >> loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control >> to the loop exit projection. A range check in the main loop has this >> node as input (through a chain of some other nodes). Range check >> elimination needs to update the exit condition of the pre loop with an >> expression that depends on the node pinned on its exit: that's >> impossible and the assert fires. This is a variant of 8314024 (this >> one was for a node with uses out of the pre loop on multiple paths). I >> propose the same fix: leave the node with control in the pre loop in >> this case. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Merge branch 'master' into JDK-8361702 > - Merge branch 'master' into JDK-8361702 > - review > - Merge branch 'master' into JDK-8361702 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn > - tests > - ... and 1 more: https://git.openjdk.org/jdk/compare/2676c5f4...91a7d73c src/hotspot/share/opto/loopopts.cpp line 1936: > 1934: // Sinking a node from a pre loop to its main loop pins the node between the pre and main loops. If that node is input > 1935: // to a check that's eliminated by range check elimination, it becomes input to an expression that feeds into the exit > 1936: // test of the pre loop above the point in the graph where it's pinned. I guess the alternative would have been not to do that RC elimination, right? If yes: you could finish the thought and say that we prefer to have a chance at RC elimination, rather than sinking the node out of the pre-loop. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2333262985 From epeter at openjdk.org Tue Sep 9 12:18:23 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 12:18:23 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v3] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 06:30:34 GMT, erifan wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Align code example data for better reading > - Merge branch 'master' into JDK-8363989 > - Improve the comment of the vector expand implementation > - Merge branch 'master' into JDK-8363989 > - 8363989: AArch64: Add missing backend support of VectorAPI expand operation > > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified > for 32-bit and 64-bit types only when SVE2 is available. In the following > cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. > This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` > in these cases, this patch mainly leverages the `TBL` instruction to > implement `expand`. To compute the index input for `TBL`, the prefix sum > algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. > Take a 128-bit byte vector on SVE2 as an example: > ``` > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > ``` > Step 1: calculate the index input of the TBL instruction. > ``` > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2... Thanks for the updates! The patch looks good to me now. I'll run some testing now, should take about 24h :) ------------- PR Review: https://git.openjdk.org/jdk/pull/26740#pullrequestreview-3201222772 From epeter at openjdk.org Tue Sep 9 13:16:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 13:16:59 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v11] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 06:08:33 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Update the code comment Looks much better, thanks for the updates! I have another small list of suggestions :) src/hotspot/share/opto/vectornode.cpp line 2243: > 2241: if (in1->Opcode() != Op_VectorMaskCmp || > 2242: in1->outcnt() != 1 || > 2243: !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || Suggestion: !in1->as_VectorMaskCmp()->predicate_can_be_negated() || Brackets are unnecessary, and rather make it harder to read. src/hotspot/share/opto/vectornode.cpp line 2277: > 2275: res = VectorNode::Ideal(phase, can_reshape); > 2276: } > 2277: return res; What if someone comes and wants to add yet another optimization before `VectorNode::Ideal`? Your code layout would give us deeper and deeper nesting. I suggest flattening it like this: Suggestion: Node* res = Ideal_XorV_VectorMaskCmp(phase, can_reshape); if (res != nullptr) { return res; } return VectorNode::Ideal(phase, can_reshape); test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 911: > 909: testCompareMaskNotLong(L_SPECIES_FOR_CAST, VectorOperators.UGE, (m) -> { return m.cast(I_SPECIES_FOR_CAST).not(); }); > 910: verifyResultsLong(L_SPECIES_FOR_CAST, VectorOperators.UGE); > 911: } You have some cast in here, and in similar tests. Can you add an IR rule to check if we do or do not have the expected casts? test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 1007: > 1005: testCompareMaskNotFloat(F_SPECIES, VectorOperators.NE, fa, fninf, (m) -> { return F_SPECIES.maskAll(true).xor(m); }); > 1006: verifyResultsFloat(F_SPECIES, VectorOperators.NE, fa, fninf); > 1007: } Do you have test cases for the cases other than `EQ` and `NE`? After all, we don't that someone accidentally messes with the logic you implemented later and we don't notice the bug ;) test/micro/org/openjdk/bench/jdk/incubator/vector/MaskCompareNotBenchmark.java line 351: > 349: public void testCompareULEMaskNotLong() { > 350: testCompareMaskNotLong(VectorOperators.ULE); > 351: } You could consider making the operator a `@Param` next time. There are multiple tricks to do that: - `test/micro/org/openjdk/bench/vm/compiler/VectorStoreToLoadForwarding.java` using `MethodHandles.constant` - Some inner class that has a static final, which is initialized from the non-final `@Param` value. - Probably even `StableValue` would work, but I have not yet experimented with it. It would be nice if we could do the same with the primitive types, but that's probably not going to work as easily. Really just an idea for next time. test/micro/org/openjdk/bench/jdk/incubator/vector/MaskCompareNotBenchmark.java line 366: > 364: public void testCompareNEMaskNotFloat() { > 365: testCompareMaskNotFloat(VectorOperators.NE); > 366: } You could still add the other comparisons as well, so we can see the performance difference. Very optional, feel free to ignore this suggestion. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24674#pullrequestreview-3201347660 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2333480061 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2333418237 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2333510278 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2333503735 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2333545924 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2333516350 From epeter at openjdk.org Tue Sep 9 13:28:22 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 13:28:22 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v7] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Tue, 26 Aug 2025 12:46:31 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > review Thanks for filing the issue! I left some comments there. We could delay div/mod by constants to after loop opts. And we could even optimize div/mod in loops that have loop-invariant divisor ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3270740800 From epeter at openjdk.org Tue Sep 9 13:35:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 13:35:27 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v7] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Wed, 3 Sep 2025 15:20:31 GMT, Hannes Greule wrote: >> Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > I also filed https://bugs.openjdk.org/browse/JDK-8366815 now regarding the early transformation of div/mod by constants. @SirYwell The changes look good to me, thanks for working on this! I'll now run some internal testing, before approving. Please ping me again in 24h if I don't report back by then :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3270770381 From epeter at openjdk.org Tue Sep 9 13:37:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 13:37:27 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... Patch looks good to me, testing passed :) Thanks for working on this @erifan ! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27113#pullrequestreview-3201642369 From mchevalier at openjdk.org Tue Sep 9 13:40:41 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 13:40:41 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 15:27:36 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/graphInvariants.cpp line 116: > >> 114: private: >> 115: const N*& _binding; >> 116: }; > > Would it not make sense to move it a bit closer to the related code? Do you need it much before `NodeClassIsAndBind`? `TypedBind` is like `Bind` they are both matching the same nodes as `TruePattern` just before. I think the grouping makes more sense than splitting `Bind` (whose comment refers ti `TruePattern`) and `TypedBind`. It makes more sense to hoist `NodeClass` here, even if their relation is rather light: used in the same macro. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333658743 From mbaesken at openjdk.org Tue Sep 9 14:04:57 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 9 Sep 2025 14:04:57 GMT Subject: RFR: 8366775: TestCompileTaskTimeout should use timeoutFactor In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 13:26:22 GMT, Manuel H?ssig wrote: > `TestCompileTaskTimeout.java` employs a timeout to test that methods compiled faster than a specified `CompileTaskTimeout`. However, it does not make use of the jtreg timeout factor, which lead to #26963 increasing the timeout to 2 s. This PR remedies this, by using the timeout factor and reducing the default timeout to 500 ms. > > Testing: > - [x] Github Actions > - [x] tier1, tier2 linux-x64-debug, linux-x64, linux-aarch64-debug, linux-aarch64 Marked as reviewed by mbaesken (Reviewer). Looks good, the adjustments seem to work for us. ------------- PR Review: https://git.openjdk.org/jdk/pull/27094#pullrequestreview-3201794914 PR Comment: https://git.openjdk.org/jdk/pull/27094#issuecomment-3270887814 From epeter at openjdk.org Tue Sep 9 14:07:19 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 14:07:19 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: <19sn8mAJlmJgsBYmEyI-9PfMbDDbUiQrpxrkrVb9Q4M=.4114119a-d86a-4488-9249-884652601972@github.com> On Tue, 9 Sep 2025 13:38:00 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 116: >> >>> 114: private: >>> 115: const N*& _binding; >>> 116: }; >> >> Would it not make sense to move it a bit closer to the related code? Do you need it much before `NodeClassIsAndBind`? > > `TypedBind` is like `Bind` they are both matching the same nodes as `TruePattern` just before. I think the grouping makes more sense than splitting `Bind` (whose comment refers ti `TruePattern`) and `TypedBind`. It makes more sense to hoist `NodeClass` here, even if their relation is rather light: used in the same macro. Up to you :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333768687 From mchevalier at openjdk.org Tue Sep 9 14:10:35 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 14:10:35 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: <0y1rlPAYPA8mfhOM22UtuR96ztkUjDwFDzEntzK2_ag=.360369c4-f0cb-4dfe-9773-530482a9c551@github.com> On Mon, 8 Sep 2025 15:47:44 GMT, Emanuel Peter wrote: > having the printed statements which statements? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333779354 From epeter at openjdk.org Tue Sep 9 14:14:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 14:14:46 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: <0y1rlPAYPA8mfhOM22UtuR96ztkUjDwFDzEntzK2_ag=.360369c4-f0cb-4dfe-9773-530482a9c551@github.com> References: <0y1rlPAYPA8mfhOM22UtuR96ztkUjDwFDzEntzK2_ag=.360369c4-f0cb-4dfe-9773-530482a9c551@github.com> Message-ID: On Tue, 9 Sep 2025 14:08:04 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 211: >> >>> 209: AtInput(uint which_input, const Pattern* pattern) : _which_input(which_input), _pattern(pattern) {} >>> 210: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { >>> 211: assert(_which_input < center->req(), "Input number is out of range"); >> >> Hmm. Could still be nice if we did our best here, and responded nicely. >> Just in case someone messes up the pattern, and then we get an assert here. >> Maybe the bug is hard to reproduce, and having the printed statements would have helped a little? > >> having the printed statements > > which statements? I meant your error messages that you put to the `ss` :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333793502 From mchevalier at openjdk.org Tue Sep 9 14:42:23 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 14:42:23 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 08:22:41 GMT, Emanuel Peter wrote: >> Can you add a comment, why it can be arbitrarily large? >> Do you have an example where we have very many ctrl uses? > > Also: are these all supposed to be projections of a specific kind? We could also test for that. You can also add that to a future RFE. > Can you add a comment, why it can be arbitrarily large? Maybe I'm very wrong about what is a CatchNode, but: try { ... } catch ( ... ) { ... } catch ( ... ) { ... } catch ( ... ) { ... } 4 outputs: 3 handlers + 1 fallthrough. > Also: are these all supposed to be projections of a specific kind? We could also test for that. You can also add that to a future RFE. I'd rather do it separately. We can always check more things, but we need to draw the line and that is safe to add later. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333873366 From mchevalier at openjdk.org Tue Sep 9 14:50:54 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 14:50:54 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 16:55:38 GMT, Emanuel Peter wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> One more ResourceMark > > src/hotspot/share/opto/graphInvariants.cpp line 528: > >> 526: if (!center->is_CountedLoop() && !center->is_LongCountedLoop()) { >> 527: return CheckResult::NOT_APPLICABLE; >> 528: } > > Actually: why not applie that to `OuterStripMinedLoop` as well? Or any `BaseCountedLoop`? Are there more than these 3 cases? If there are ever more, they should probably also adhere to this backedge pattern, we'll just need an extension. But it would be nice to trip over something here if we ever do extend. I'm going to push back on that. I rather want this one to be about counted loops, which have more structure that is HEAVILY relied on, that I haven't all enumerated, but that can be done. One can make another check for the few things that hold for other flavors. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333891737 From mchevalier at openjdk.org Tue Sep 9 14:50:55 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 14:50:55 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: <4jB_I2sHD7IfzhR7ojHfsFPlvZFCOWaHf8aS0AZshj0=.d0162feb-10b8-488d-82fa-eb816ce5dda9@github.com> Message-ID: On Tue, 9 Sep 2025 08:15:26 GMT, Emanuel Peter wrote: >> I don't understand. > > I would still consider adding `OuterStripMinedLoop` here, to capture that it has a similar structure. Even if you also verify below specific things for `OuterStripMinedLoop`. Just to check that all these loop structures have the same kind of backedge shape. > And then make a switch out of it, with a default case that fails. In case we add yet another `Loop` shape, we would then catch that and add the logic for it. > > But actually: do not all `Loop` shapes have this backedge pattern? Or are there some that have a `IfFalse` on the backedge? Because then you could also add `LoopNode` with `LoopEndNode`. Same as before: we extend the checks later. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2333895804 From djelinski at openjdk.org Tue Sep 9 15:11:31 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 15:11:31 GMT Subject: RFR: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer [v2] In-Reply-To: References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: On Mon, 8 Sep 2025 23:15:52 GMT, Dean Long wrote: >> Daniel Jeli?ski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Merge remote-tracking branch 'origin/master' into nops-cleanup >> - Update copyright >> - Remove outdated comment >> - Remove nop list > > Looks good. Thanks @dean-long @eme64 for the review and re-review. The additional tests came back clean. Given that the merge conflict resolution did not change the diff, I'm going to integrate this now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27117#issuecomment-3271146356 From djelinski at openjdk.org Tue Sep 9 15:11:33 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 15:11:33 GMT Subject: Integrated: 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer In-Reply-To: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> References: <2hBEO9Zpoy2wo_pgTXE9v8KG5u1HNdKp3RgQE-4HYcE=.e86088d1-1e25-49b5-9b3c-c2498ec6ca48@github.com> Message-ID: <4nmZRAStXqQVsqqb7t9AyH_xhcplf0YCUW4nJ2nMf9E=.0b820a97-cdc7-4139-89d4-856a59ed2cef@github.com> On Fri, 5 Sep 2025 13:02:00 GMT, Daniel Jeli?ski wrote: > The nop list has never been used in the history of OpenJDK. Let's clean it up. > > Tested with Mach5 tier 1-5, no related failures. This pull request has now been integrated. Changeset: cc6d34b2 Author: Daniel Jeli?ski URL: https://git.openjdk.org/jdk/commit/cc6d34b2fa299a68a05e65e25c1f41dffa67c118 Stats: 83 lines in 11 files changed: 1 ins; 77 del; 5 mod 8366971: C2: Remove unused nop_list from PhaseOutput::init_buffer Reviewed-by: epeter, dlong ------------- PR: https://git.openjdk.org/jdk/pull/27117 From djelinski at openjdk.org Tue Sep 9 15:25:25 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 15:25:25 GMT Subject: RFR: 8366984: Remove delay slot support [v2] In-Reply-To: References: Message-ID: > SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. > > The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. Daniel Jeli?ski has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27119/files - new: https://git.openjdk.org/jdk/pull/27119/files/330d5ad1..330d5ad1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27119&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27119&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27119.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27119/head:pull/27119 PR: https://git.openjdk.org/jdk/pull/27119 From dfenacci at openjdk.org Tue Sep 9 15:37:50 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 9 Sep 2025 15:37:50 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v4] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 23:53:49 GMT, Dean Long wrote: >> Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: >> >> JDK-8360031: add assert condition and make consume method argument escape > > src/hotspot/share/opto/memnode.cpp line 4232: > >> 4230: >> 4231: void MemBarNode::remove(PhaseIterGVN *igvn) { >> 4232: if (outcnt() != 2) { > > By itself, this allows outcnt() == 0, so maybe we need to continue to fail if that happens. I added the condition to the assert. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26556#discussion_r2334037886 From dfenacci at openjdk.org Tue Sep 9 15:37:48 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 9 Sep 2025 15:37:48 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v4] In-Reply-To: References: Message-ID: > # Issue > While compiling `java.util.zip.ZipFile` in C2 this assert is triggered > https://github.com/openjdk/jdk/blob/a2e86ff3c56209a14c6e9730781eecd12c81d170/src/hotspot/share/opto/memnode.cpp#L4235 > > # Cause > While compiling the constructor of java.util.zip.ZipFile$CleanableResource the following happens: > * we insert a trailing `MemBarStoreStore` in the constructor > before_folding > > * during IGVN we completely fold the memory subtree of the `MemBarStoreStore` node. The node still has a control output attached. > after_folding > > * later during the same IGVN run the `MemBarStoreStore` node is handled and we try to remove it (because the `Allocate` node of the `MembBar` is not escaping the thread ) https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4301-L4302 > * the assert https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4235 > triggers because the barrier has only 1 (control) output and is a `MemBarStoreStore` (not `Initialize`) barrier > > The issue happens only when the `UseStoreStoreForCtor` is set (default as well), which makes C2 use `MemBarStoreStore` instead of `MemBarRelease` at the end of constructors. `MemBarStoreStore` are processed separately by EA and this happens after the IGVN pass that folds the memory subtree. `MemBarRelease` on the other hand are handled during same IGVN pass before the memory subtree gets removed and it?s still got 2 outputs (assert skipped). > > # Fix > Adapting the assert to accept that `MemBarStoreStore` can also have `!= 2` outputs (when `+UseStoreStoreForCtor` is used) seems to be an OK solution as this seems like a perfectly plausible situation. > > # Testing > Unfortunately reproducing the issue with a simple regression test has proven very hard. The test seems to rely on very peculiar profiling and IGVN worklist sequence. JBS replay compilation passes. Running JCK's `api/java_util` 100 times triggers the assert a couple of times on average before the fix, none after. > Tier 1-3+ tests passed. Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: JDK-8360031: add assert condition and make consume method argument escape ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26556/files - new: https://git.openjdk.org/jdk/pull/26556/files/57073b96..f5406f30 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26556&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26556&range=02-03 Stats: 5 lines in 2 files changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26556.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26556/head:pull/26556 PR: https://git.openjdk.org/jdk/pull/26556 From dfenacci at openjdk.org Tue Sep 9 15:44:27 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 9 Sep 2025 15:44:27 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v4] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 15:37:48 GMT, Damon Fenacci wrote: >> # Issue >> While compiling `java.util.zip.ZipFile` in C2 this assert is triggered >> https://github.com/openjdk/jdk/blob/a2e86ff3c56209a14c6e9730781eecd12c81d170/src/hotspot/share/opto/memnode.cpp#L4235 >> >> # Cause >> While compiling the constructor of java.util.zip.ZipFile$CleanableResource the following happens: >> * we insert a trailing `MemBarStoreStore` in the constructor >> before_folding >> >> * during IGVN we completely fold the memory subtree of the `MemBarStoreStore` node. The node still has a control output attached. >> after_folding >> >> * later during the same IGVN run the `MemBarStoreStore` node is handled and we try to remove it (because the `Allocate` node of the `MembBar` is not escaping the thread ) https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4301-L4302 >> * the assert https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4235 >> triggers because the barrier has only 1 (control) output and is a `MemBarStoreStore` (not `Initialize`) barrier >> >> The issue happens only when the `UseStoreStoreForCtor` is set (default as well), which makes C2 use `MemBarStoreStore` instead of `MemBarRelease` at the end of constructors. `MemBarStoreStore` are processed separately by EA and this happens after the IGVN pass that folds the memory subtree. `MemBarRelease` on the other hand are handled during same IGVN pass before the memory subtree gets removed and it?s still got 2 outputs (assert skipped). >> >> # Fix >> Adapting the assert to accept that `MemBarStoreStore` can also have `!= 2` outputs (when `+UseStoreStoreForCtor` is used) seems to be an OK solution as this seems like a perfectly plausible situation. >> >> # Testing >> Unfortunately reproducing the issue with a simple regression test has proven very hard. The test seems to rely on very peculiar profiling and IGVN worklist sequence. JBS replay compilation passes. Running JCK's `api/java_util` 100 times triggers the assert a couple of times on average before the fix, none after. >> Tier 1-3+ tests passed. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8360031: add assert condition and make consume method argument escape The fix made the `ConstructorBarrier.java` JTREG test fail because the argument of the `consume` method wasn't actually escaping (and IGVN was removing the MemBar). So I added an assignment to a volatile field to make it escape. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26556#issuecomment-3271291568 From epeter at openjdk.org Tue Sep 9 16:03:30 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 16:03:30 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 14:39:51 GMT, Marc Chevalier wrote: >> Also: are these all supposed to be projections of a specific kind? We could also test for that. You can also add that to a future RFE. > >> Can you add a comment, why it can be arbitrarily large? > > Maybe I'm very wrong about what is a CatchNode, but: > > try { > ... > } > catch ( ... ) { ... } > catch ( ... ) { ... } > catch ( ... ) { ... } > > 4 outputs: 3 handlers + 1 fallthrough. > >> Also: are these all supposed to be projections of a specific kind? We could also test for that. You can also add that to a future RFE. > > I'd rather do it separately. We can always check more things, but we need to draw the line and that is safe to add later. Fair enough, thanks for the explanations :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2334110704 From epeter at openjdk.org Tue Sep 9 16:06:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 16:06:43 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v5] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 14:46:14 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/graphInvariants.cpp line 528: >> >>> 526: if (!center->is_CountedLoop() && !center->is_LongCountedLoop()) { >>> 527: return CheckResult::NOT_APPLICABLE; >>> 528: } >> >> Actually: why not applie that to `OuterStripMinedLoop` as well? Or any `BaseCountedLoop`? Are there more than these 3 cases? If there are ever more, they should probably also adhere to this backedge pattern, we'll just need an extension. But it would be nice to trip over something here if we ever do extend. > > I'm going to push back on that. I rather want this one to be about counted loops, which have more structure that is HEAVILY relied on, that I haven't all enumerated, but that can be done. > > One can make another check for the few things that hold for other flavors. My understanding is this: Any kind of loop has to have a matching end node, and a backedge. That is essencially the structure you are checking for int and long loops, but it also holds for the other loops. If you don't want to do it now, then note it down and consider it in a future RFE ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2334116311 From mchevalier at openjdk.org Tue Sep 9 16:17:29 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 16:17:29 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v6] In-Reply-To: References: Message-ID: > Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. > > Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. > > This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. > > For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. > > On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: > > 1 failure for node > 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > At node > 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) > From path: > [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) > <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringL... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: lot of fixes, porting patterns in other files ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26362/files - new: https://git.openjdk.org/jdk/pull/26362/files/ea78a5a3..99040b8e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=04-05 Stats: 893 lines in 4 files changed: 498 ins; 234 del; 161 mod Patch: https://git.openjdk.org/jdk/pull/26362.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26362/head:pull/26362 PR: https://git.openjdk.org/jdk/pull/26362 From epeter at openjdk.org Tue Sep 9 16:27:26 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 9 Sep 2025 16:27:26 GMT Subject: RFR: 8367243: Format issues with dist dump debug output in PhaseGVN::dead_loop_check Message-ID: The `#` option adds color to the terminal. But that only usually works on people's terminals, and not if it is piped to a file on the server. Hence, `#` is only really a debugging feature, and not one to report with in connection with `assert`s. Simply removed the `#`, and fixed some braces and spaces. ------------- Commit messages: - JDK-8367243 Changes: https://git.openjdk.org/jdk/pull/27175/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27175&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367243 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27175.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27175/head:pull/27175 PR: https://git.openjdk.org/jdk/pull/27175 From thartmann at openjdk.org Tue Sep 9 16:31:18 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 9 Sep 2025 16:31:18 GMT Subject: RFR: 8367243: Format issues with dist dump debug output in PhaseGVN::dead_loop_check In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 16:20:35 GMT, Emanuel Peter wrote: > The `#` option adds color to the terminal. But that only usually works on people's terminals, and not if it is piped to a file on the server. Hence, `#` is only really a debugging feature, and not one to report with in connection with `assert`s. > > Simply removed the `#`, and fixed some braces and spaces. Looks good and trivial! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27175#pullrequestreview-3202398872 From djelinski at openjdk.org Tue Sep 9 16:59:31 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Tue, 9 Sep 2025 16:59:31 GMT Subject: RFR: 8366984: Remove delay slot support [v3] In-Reply-To: References: Message-ID: > SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. > > The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. Daniel Jeli?ski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: - Merge remote-tracking branch 'origin/master' into delay-slot - Revert scope_desc change, breaks macos-aarch64 - Remove remaining comments - Update copyright - Remove commented out code - Remove unused variables - Comment out unused _unconditional_delay_slot - Remove bundle flags - Remove delay slot support from ADL - Clean up delay slot remnants from arm32 code - ... and 4 more: https://git.openjdk.org/jdk/compare/cc6d34b2...fb68b5a8 ------------- Changes: https://git.openjdk.org/jdk/pull/27119/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27119&range=02 Stats: 456 lines in 19 files changed: 1 ins; 407 del; 48 mod Patch: https://git.openjdk.org/jdk/pull/27119.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27119/head:pull/27119 PR: https://git.openjdk.org/jdk/pull/27119 From mchevalier at openjdk.org Tue Sep 9 17:07:40 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 9 Sep 2025 17:07:40 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v7] In-Reply-To: References: Message-ID: > Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. > > Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. > > This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. > > For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. > > On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: > > 1 failure for node > 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > At node > 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) > From path: > [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) > <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringL... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: A better way to make them not debug-only, without very ad-hoc hacking ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26362/files - new: https://git.openjdk.org/jdk/pull/26362/files/99040b8e..a69b9677 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=05-06 Stats: 132 lines in 3 files changed: 106 ins; 13 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/26362.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26362/head:pull/26362 PR: https://git.openjdk.org/jdk/pull/26362 From eosterlund at openjdk.org Tue Sep 9 19:37:31 2025 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 9 Sep 2025 19:37:31 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 [v5] In-Reply-To: References: Message-ID: On Mon, 4 Aug 2025 21:26:22 GMT, Dean Long wrote: >> This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() to perform a direct byte store, leaving 24 bits for the guard value. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > one unconditional release should be enough Sorry for the delay. Looks good. ------------- Marked as reviewed by eosterlund (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26399#pullrequestreview-3203004583 From dlong at openjdk.org Tue Sep 9 22:50:06 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 9 Sep 2025 22:50:06 GMT Subject: RFR: 8360031: C2 compilation asserts in MemBarNode::remove [v4] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 15:37:48 GMT, Damon Fenacci wrote: >> # Issue >> While compiling `java.util.zip.ZipFile` in C2 this assert is triggered >> https://github.com/openjdk/jdk/blob/a2e86ff3c56209a14c6e9730781eecd12c81d170/src/hotspot/share/opto/memnode.cpp#L4235 >> >> # Cause >> While compiling the constructor of java.util.zip.ZipFile$CleanableResource the following happens: >> * we insert a trailing `MemBarStoreStore` in the constructor >> before_folding >> >> * during IGVN we completely fold the memory subtree of the `MemBarStoreStore` node. The node still has a control output attached. >> after_folding >> >> * later during the same IGVN run the `MemBarStoreStore` node is handled and we try to remove it (because the `Allocate` node of the `MembBar` is not escaping the thread ) https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4301-L4302 >> * the assert https://github.com/openjdk/jdk/blob/7b7136b4eca15693cfcd46ae63d644efc8a88d2c/src/hotspot/share/opto/memnode.cpp#L4235 >> triggers because the barrier has only 1 (control) output and is a `MemBarStoreStore` (not `Initialize`) barrier >> >> The issue happens only when the `UseStoreStoreForCtor` is set (default as well), which makes C2 use `MemBarStoreStore` instead of `MemBarRelease` at the end of constructors. `MemBarStoreStore` are processed separately by EA and this happens after the IGVN pass that folds the memory subtree. `MemBarRelease` on the other hand are handled during same IGVN pass before the memory subtree gets removed and it?s still got 2 outputs (assert skipped). >> >> # Fix >> Adapting the assert to accept that `MemBarStoreStore` can also have `!= 2` outputs (when `+UseStoreStoreForCtor` is used) seems to be an OK solution as this seems like a perfectly plausible situation. >> >> # Testing >> Unfortunately reproducing the issue with a simple regression test has proven very hard. The test seems to rely on very peculiar profiling and IGVN worklist sequence. JBS replay compilation passes. Running JCK's `api/java_util` 100 times triggers the assert a couple of times on average before the fix, none after. >> Tier 1-3+ tests passed. > > Damon Fenacci has updated the pull request incrementally with one additional commit since the last revision: > > JDK-8360031: add assert condition and make consume method argument escape LGTM, but let's wait for @vnkozlov to approve it. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26556#pullrequestreview-3203613679 From duke at openjdk.org Tue Sep 9 23:04:08 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 9 Sep 2025 23:04:08 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v47] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [x] Linux x64 fastdebug tier 1/2/3/4 > - [x] Linux aarch64 fastdebug tier 1/2/3/4 Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: Fix race when not installed nmethod is deoptimized ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/a2051637..bf18a4c8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=46 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=45-46 Stats: 8 lines in 4 files changed: 2 ins; 2 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From duke at openjdk.org Tue Sep 9 23:22:52 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 9 Sep 2025 23:22:52 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v46] In-Reply-To: References: Message-ID: On Sat, 30 Aug 2025 00:32:02 GMT, Vladimir Kozlov wrote: > It failed on linux-x64 and linux-aarch64. I tried locally on linux-x64 but it passed. Sorry for the late response I have been on vacation. The test failed due to a race condition involving the de-optimization of a `not_installed` nmethod. `CompiledICLocker` uses `CompiledICProtectionBehaviour::is_safe(nm)` to determine whether it needs to acquire the `CompiledIC_lock`. If the nmethod `not_installed` at the time of the check, the lock is not acquired. However, if the nmethod is de-optimized and its state transitions to `not_entrant`, the next evaluation of `is_safe(nm)` will return false because the nmethod is no longer `not_installed`. The fix is to ensure that the `NMethodState_lock` is held when checking `nmethod::is_not_installed`, to prevent concurrent state changes that could lead to this race. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3272584418 From dlong at openjdk.org Tue Sep 9 23:31:01 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 9 Sep 2025 23:31:01 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 [v5] In-Reply-To: <-MqvO74Up2R0qmEDtgyGY-yScxZ-v6ZQWxDtSxpKO_g=.56d4eeca-670d-41e4-9e96-ba20b1b44100@github.com> References: <-MqvO74Up2R0qmEDtgyGY-yScxZ-v6ZQWxDtSxpKO_g=.56d4eeca-670d-41e4-9e96-ba20b1b44100@github.com> Message-ID: On Wed, 27 Aug 2025 20:17:07 GMT, Erik ?sterlund wrote: >> @fisk , can I get you to review this? > >> @fisk , can I get you to review this? > > Sure! Based on the symptoms you described, my main comment is that we might be looking at the wrong places. I don't know if this is really about lock contention. Perhaps it is indirectly. But you mention there is still so e regression with ZGC. > > My hypothesis would be that it is the unnecessary incrementing of the global patching epoch that causes the regression when using ZGC. It is only really needed when disarming the nmethod - in orher words when the guard value is set to the good value. > > The point of incrementing the patching epoch is to protect other threads from entering the nmethod without executing an instruction cross modication fence. And all other threads will have to do that. > > Only ZGC uses the mode of nmethod entry barriers that does this due to being the only GC that updates instructions in a concurrent phase on AArch64. We are conservative on AArch64 and ensure the use of appropriate synchronous cross modifying code. But that's not needed when arming, which is what we do when making the bmethod not entrant. Thanks @fisk and @theRealAph . ------------- PR Comment: https://git.openjdk.org/jdk/pull/26399#issuecomment-3272594352 From dlong at openjdk.org Tue Sep 9 23:31:03 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 9 Sep 2025 23:31:03 GMT Subject: Integrated: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 In-Reply-To: References: Message-ID: On Sat, 19 Jul 2025 01:39:12 GMT, Dean Long wrote: > This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() to perform a direct byte store, leaving 24 bits for the guard value. This pull request has now been integrated. Changeset: f9640398 Author: Dean Long URL: https://git.openjdk.org/jdk/commit/f96403986b99008593e025c4991ee865fce59bb1 Stats: 240 lines in 15 files changed: 128 ins; 71 del; 41 mod 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 Co-authored-by: Martin Doerr Reviewed-by: mdoerr, aph, eosterlund ------------- PR: https://git.openjdk.org/jdk/pull/26399 From missa at openjdk.org Wed Sep 10 01:00:52 2025 From: missa at openjdk.org (Mohamed Issa) Date: Wed, 10 Sep 2025 01:00:52 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v10] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Add new IR nodes covering x86 floating point conversion instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/4d8f3ab6..bc59e4d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=08-09 Stats: 121 lines in 3 files changed: 60 ins; 0 del; 61 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From missa at openjdk.org Wed Sep 10 01:00:55 2025 From: missa at openjdk.org (Mohamed Issa) Date: Wed, 10 Sep 2025 01:00:55 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v9] In-Reply-To: References: Message-ID: <0JotX9md-fjgXvgjODrvDQuHSHQQOI_TW-1U4qNDGz4=.25b4ff36-ea2e-4eba-8a5f-0a2bfe405064@github.com> On Tue, 9 Sep 2025 02:28:45 GMT, Jatin Bhateja wrote: >> Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: >> >> Check for scalar casting instead of vector casting in tests when disabling vector alignment or compact object headers > > test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 432: > >> 430: applyIfCPUFeatureAnd = {"avx", "true", "avx10_2", "false"}) >> 431: @IR(counts = {"cast2DtoX", " >0 "}, phase = CompilePhase.FINAL_CODE, >> 432: applyIfCPUFeature = {"avx10_2", "true"}) > > Please refer to https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java#L2638 for adding MachNode IR node based checks Thanks, I added some new nodes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2335208252 From dlong at openjdk.org Wed Sep 10 01:07:22 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 10 Sep 2025 01:07:22 GMT Subject: RFR: 8366984: Remove delay slot support [v3] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 16:59:31 GMT, Daniel Jeli?ski wrote: >> SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. >> >> The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. > > Daniel Jeli?ski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - Merge remote-tracking branch 'origin/master' into delay-slot > - Revert scope_desc change, breaks macos-aarch64 > - Remove remaining comments > - Update copyright > - Remove commented out code > - Remove unused variables > - Comment out unused _unconditional_delay_slot > - Remove bundle flags > - Remove delay slot support from ADL > - Clean up delay slot remnants from arm32 code > - ... and 4 more: https://git.openjdk.org/jdk/compare/cc6d34b2...fb68b5a8 Marked as reviewed by dlong (Reviewer). src/hotspot/share/runtime/sharedRuntime.cpp line 3505: > 3503: nm = cb->as_nmethod(); > 3504: method = nm->method(); > 3505: for (ScopeDesc *sd = nm->scope_desc_near(fr.pc()); sd != nullptr; sd = sd->sender()) { It's tempting to try to change this to scope_desc_at(), but also slightly risky if SPARC isn't the only reason it was needed. Should we investigate this in a separate RFE? ------------- PR Review: https://git.openjdk.org/jdk/pull/27119#pullrequestreview-3203943528 PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2335213805 From dlong at openjdk.org Wed Sep 10 01:09:16 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 10 Sep 2025 01:09:16 GMT Subject: RFR: 8366461: Remove obsolete method handle invoke logic [v3] In-Reply-To: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> References: <_pqvEs0LIlAc7RjFUwg-bpxS3D2v5U7c6In2sG8XLhQ=.57e3aead-6ac4-4a42-89d2-385d7e6ecedf@github.com> Message-ID: On Tue, 2 Sep 2025 20:52:32 GMT, Dean Long wrote: >> At one time, JSR292 support needed special logic to save and restore SP across method handle instrinsic calls, but that is no longer the case. The only platform that still does the save/restore is arm32, which is no longer necessary. The save/restore can be removed along with related APIs and logic. Note that the arm32 port is largely based on the x86 port, which stopped doing the save/restore in jdk9 ([JDK-8068945](https://bugs.openjdk.org/browse/JDK-8068945)). > > Dean Long has updated the pull request incrementally with three additional commits since the last revision: > > - revert whitespace change > - undo debug changes > - cleanup I need one more review for this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27059#issuecomment-3272844515 From duke at openjdk.org Wed Sep 10 01:49:14 2025 From: duke at openjdk.org (duke) Date: Wed, 10 Sep 2025 01:49:14 GMT Subject: RFR: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... @erifan Your change (at version a672dd26c6c7547bca260815ae2e1d7c3652c929) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27113#issuecomment-3272907016 From duke at openjdk.org Wed Sep 10 01:53:21 2025 From: duke at openjdk.org (erifan) Date: Wed, 10 Sep 2025 01:53:21 GMT Subject: Integrated: 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 08:13:28 GMT, erifan wrote: > Intrinsic support for `VectorMask.laneIsSet` with a **variable** input index was introduced in PR #14200, but was inadvertently broken by PR #25673. This PR restores the intrinsic functionality and adds some JTReg tests. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 21702.14415 91.902159 103472.9391 36.057447 4.767867 > microMaskLaneIsSetByte64_var ops/ms 21468.51868 107.94177 103365.6561 69.47736 4.814754 > microMaskLaneIsSetDouble128_var ops/ms 77489.32791 153.242699 413499.4127 311.854079 5.336211 > microMaskLaneIsSetFloat128_var ops/ms 41034.95204 399.421823 206840.0988 74.702234 5.040583 > microMaskLaneIsSetFloat64_var ops/ms 77607.40268 175.938921 413745.3001 149.716794 5.33126 > microMaskLaneIsSetInt128_var ops/ms 41452.48893 76.143208 206845.9754 59.371129 4.989953 > microMaskLaneIsSetInt64_var ops/ms 77726.2542 173.180518 413427.8838 363.575023 5.319024 > microMaskLaneIsSetLong128_var ops/ms 77646.11218 177.496587 413403.4404 236.609314 5.3242 > microMaskLaneIsSetShort128_var ops/ms 21374.93265 48.13101 103417.4618 34.827021 4.838259 > microMaskLaneIsSetShort64_var ops/ms 41066.19395 353.320621 206801.109 106.408938 5.035799 > > > Benchmarks on Intel 6444y machine with 512-bit avx3: > > Benchmark Unit Before Score Error After Score Error Uplift > microMaskLaneIsSetByte128_var ops/ms 57658.45497 240.209309 211643.8406 29.214532 3.670647 > microMaskLaneIsSetByte256_var ops/ms 57451.68169 116.994128 211609.4652 160.48513 3.683259 > microMaskLaneIsSetByte512_var ops/ms 57530.22411 311.63868 199802.8084 408.144015 3.473005 > microMaskLaneIsSetByte64_var ops/ms 57642.2672 161.406221 205252.4464 196.86852 3.560797 > microMaskLaneIsSetDouble256_var ops/ms 114401.3789 231.797375 361400.344 565.593984 3.159055 > microMaskLaneIsSetDouble512_var ops/ms 57379.27882 159.699503 211476.1138 136.980026 3.685583 > microMaskLaneIsSetFloat128_var ops/ms 113943.9512 141.062663 360855.3915 494.471996 3.166955 > microMaskLaneIsSetFloat256_var ops/ms 57682.78182 138.142053 211659.5098 30.167972 3.66937 > microMaskLaneIsSetFloat512_var ops/ms 57617.66405 301.748599 211246.8588 597.18949 3.666355 > microMaskLaneIsSetInt128_var ops/ms 113914.5062 118.681382 360856.4465 555.097397 3.167783 > microMaskLaneIsSetInt256_var ops/ms 57681.79883 112.391639 211555.6742 217.556981 3.667633 > microMaskLaneIsSetInt512_var ops/ms 57350.20346 206.146723 211657.7207 68.461571 3.690618 > microMaskLane... This pull request has now been integrated. Changeset: 53b3e056 Author: erifan Committer: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/53b3e0567d2801ddf62c5849b219324ddfcb264a Stats: 170 lines in 4 files changed: 168 ins; 0 del; 2 mod 8366588: VectorAPI: Re-intrinsify VectorMask.laneIsSet where the input index is a variable Reviewed-by: shade, xgong, epeter ------------- PR: https://git.openjdk.org/jdk/pull/27113 From dzhang at openjdk.org Wed Sep 10 03:16:41 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 10 Sep 2025 03:16:41 GMT Subject: RFR: 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet Message-ID: Hi, Can you help to review this patch? Thanks! [JDK-8366588](https://bugs.openjdk.org/browse/JDK-8366588) adds a vectorapi test for VectorMask.laneIsSet, which we can also enable on RISC-V. ### Test (fastdebug) - [x] Run compiler/vectorapi/VectorMaskLaneIsSetTest.java on k1, k230 and sg2042 ------------- Commit messages: - 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet Changes: https://git.openjdk.org/jdk/pull/27181/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27181&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367293 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/27181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27181/head:pull/27181 PR: https://git.openjdk.org/jdk/pull/27181 From fyang at openjdk.org Wed Sep 10 04:06:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 10 Sep 2025 04:06:11 GMT Subject: RFR: 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 03:10:02 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > [JDK-8366588](https://bugs.openjdk.org/browse/JDK-8366588) adds a vectorapi test for VectorMask.laneIsSet, which we can also enable on RISC-V. > > ### Test (fastdebug) > - [x] Run compiler/vectorapi/VectorMaskLaneIsSetTest.java on k1, k230 and sg2042 Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27181#pullrequestreview-3204337742 From galder at openjdk.org Wed Sep 10 04:30:21 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 10 Sep 2025 04:30:21 GMT Subject: RFR: 8366845: C2 SuperWord: wrong VectorCast after VectorReinterpret with swapped src/dst type In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 09:00:45 GMT, Emanuel Peter wrote: >> Makes sense @eme64. Happy with the fix and tests :) > > @galderz @iwanowww @TobiHartmann FYI, I filed: > [JDK-8366965](https://bugs.openjdk.org/browse/JDK-8366965) C2 SuperWord: add more tests for MoveF2I / Float.floatToRawIntBits and friends Thanks for the quick turnaround @eme64 on this! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27100#issuecomment-3273275432 From epeter at openjdk.org Wed Sep 10 05:12:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 05:12:12 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v7] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Tue, 26 Aug 2025 12:46:31 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > review Tests pass, approved ? @merykitty @mhaessig your turn ? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25254#pullrequestreview-3204439034 From epeter at openjdk.org Wed Sep 10 05:16:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 05:16:10 GMT Subject: RFR: 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 03:10:02 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > [JDK-8366588](https://bugs.openjdk.org/browse/JDK-8366588) adds a vectorapi test for VectorMask.laneIsSet, which we can also enable on RISC-V. > > ### Test (fastdebug) > - [x] Run compiler/vectorapi/VectorMaskLaneIsSetTest.java on k1, k230 and sg2042 Looks reasonable :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27181#pullrequestreview-3204446793 From djelinski at openjdk.org Wed Sep 10 06:08:21 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Wed, 10 Sep 2025 06:08:21 GMT Subject: RFR: 8366984: Remove delay slot support [v3] In-Reply-To: References: Message-ID: <_tx-ASKQdoHNnXSOi30eyjBgbDtsMY_WaoRPNuqrX80=.f65d62bd-dd4a-448b-b466-f76ca8d01112@github.com> On Wed, 10 Sep 2025 01:03:57 GMT, Dean Long wrote: >> Daniel Jeli?ski has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: >> >> - Merge remote-tracking branch 'origin/master' into delay-slot >> - Revert scope_desc change, breaks macos-aarch64 >> - Remove remaining comments >> - Update copyright >> - Remove commented out code >> - Remove unused variables >> - Comment out unused _unconditional_delay_slot >> - Remove bundle flags >> - Remove delay slot support from ADL >> - Clean up delay slot remnants from arm32 code >> - ... and 4 more: https://git.openjdk.org/jdk/compare/cc6d34b2...fb68b5a8 > > src/hotspot/share/runtime/sharedRuntime.cpp line 3505: > >> 3503: nm = cb->as_nmethod(); >> 3504: method = nm->method(); >> 3505: for (ScopeDesc *sd = nm->scope_desc_near(fr.pc()); sd != nullptr; sd = sd->sender()) { > > It's tempting to try to change this to scope_desc_at(), but also slightly risky if SPARC isn't the only reason it was needed. Should we investigate this in a separate RFE? [I tried](https://github.com/djelinski/jdk/actions/runs/17495967429) before I posted this PR. Apparently scope_desc_near is also needed on macosx-aarch64: # Internal Error (nmethod.cpp:668), pid=11090, tid=43267 # guarantee(pd != nullptr) failed: scope must be present ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27119#discussion_r2335651132 From djelinski at openjdk.org Wed Sep 10 06:19:29 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Wed, 10 Sep 2025 06:19:29 GMT Subject: RFR: 8366984: Remove delay slot support In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 10:56:04 GMT, Emanuel Peter wrote: >> SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. >> >> The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. > > Thanks for the answers! > > You'll of course have to merge the dependency, and get a second review :) Thanks @eme64 and @dean-long for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27119#issuecomment-3273459678 From djelinski at openjdk.org Wed Sep 10 06:19:30 2025 From: djelinski at openjdk.org (Daniel =?UTF-8?B?SmVsacWEc2tp?=) Date: Wed, 10 Sep 2025 06:19:30 GMT Subject: Integrated: 8366984: Remove delay slot support In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 14:24:50 GMT, Daniel Jeli?ski wrote: > SPARC was the only supported architecture that uses a delay slot. The SPARC port was removed in JDK 15, and the code is effectively dead. Let's remove it. > > The changes are no-op on all architectures that do not use delay slots. I still tested tier 1-5 on mach5, no related failures. This pull request has now been integrated. Changeset: b7b01d6f Author: Daniel Jeli?ski URL: https://git.openjdk.org/jdk/commit/b7b01d6f564ae34e913ae51bd2f8243a32807136 Stats: 456 lines in 19 files changed: 1 ins; 407 del; 48 mod 8366984: Remove delay slot support Reviewed-by: dlong, epeter ------------- PR: https://git.openjdk.org/jdk/pull/27119 From qxing at openjdk.org Wed Sep 10 06:58:33 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Wed, 10 Sep 2025 06:58:33 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v13] In-Reply-To: References: Message-ID: > The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases. > > This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch: > > > public static int numberOfNibbles(int i) { > int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i); > return Math.max((mag + 3) / 4, 1); > } > > > Testing: tier1, IR test Qizheng Xing has updated the pull request incrementally with two additional commits since the last revision: - Add random range tests - Add more comments to IR test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25928/files - new: https://git.openjdk.org/jdk/pull/25928/files/d09d4cb0..79394a25 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=11-12 Stats: 180 lines in 1 file changed: 154 ins; 16 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/25928.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25928/head:pull/25928 PR: https://git.openjdk.org/jdk/pull/25928 From qxing at openjdk.org Wed Sep 10 06:58:35 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Wed, 10 Sep 2025 06:58:35 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v10] In-Reply-To: References: Message-ID: On Tue, 19 Aug 2025 14:00:37 GMT, Emanuel Peter wrote: >> Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove redundant `@require` in IR test > > test/hotspot/jtreg/compiler/c2/gvn/TestCountBitsRange.java line 164: > >> 162: return Long.numberOfTrailingZeros(l) / 8; >> 163: } >> 164: } > > Nice examples! Could you please add a short description to most of them, explaining what you are testing with each? It would help me as a reviewer to see if you cover enough cases. > > I'm also missing some cases where you have non-trivial input ranges. And then verification that the output range is correct. > > You could look at this example: > https://github.com/openjdk/jdk/pull/25254/files#diff-0e3d89ac8cf0548b69d9bdb0859380bc31de0a772fa7ff211f446a4a5abd4197R220-R248 Added comments for unit tests and random ranges tests like the example. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25928#discussion_r2335744000 From qxing at openjdk.org Wed Sep 10 07:03:02 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Wed, 10 Sep 2025 07:03:02 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v14] In-Reply-To: References: Message-ID: > The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases. > > This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch: > > > public static int numberOfNibbles(int i) { > int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i); > return Math.max((mag + 3) / 4, 1); > } > > > Testing: tier1, IR test Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: Remove redundant import ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25928/files - new: https://git.openjdk.org/jdk/pull/25928/files/79394a25..f5d1e53d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=12-13 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/25928.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25928/head:pull/25928 PR: https://git.openjdk.org/jdk/pull/25928 From qxing at openjdk.org Wed Sep 10 07:18:16 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Wed, 10 Sep 2025 07:18:16 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v9] In-Reply-To: <9xCpJGY6CFKPAt4VtDY23_Tr3SE9tUebdMF3pAYWhFA=.281e0b84-bfad-466b-b290-918cf1fa83d1@github.com> References: <9xCpJGY6CFKPAt4VtDY23_Tr3SE9tUebdMF3pAYWhFA=.281e0b84-bfad-466b-b290-918cf1fa83d1@github.com> Message-ID: On Tue, 9 Sep 2025 08:40:35 GMT, Emanuel Peter wrote: >> Hi @jatin-bhateja, I've added a micro benchmark that includes the `numberOfNibbles` implementation from this PR description and your micro kernel. >> >> Here's my test results on an Intel(R) Xeon(R) Platinum: >> >> >> # Baseline: >> Benchmark Mode Cnt Score Error Units >> CountLeadingZeros.benchClzLongConstrained avgt 15 1517.888 ? 5.691 ns/op >> CountLeadingZeros.benchNumberOfNibbles avgt 15 1094.422 ? 1.753 ns/op >> >> # This patch: >> Benchmark Mode Cnt Score Error Units >> CountLeadingZeros.benchClzLongConstrained avgt 15 0.948 ? 0.002 ns/op >> CountLeadingZeros.benchNumberOfNibbles avgt 15 942.438 ? 1.742 ns/op > > @MaxXSoft Feel free to just ping me again when you want another review :) > FYI: I'll be on a longer vacation starting in about a week, so don't expect me to respond then. @eme64 Thank you for your patience and kind reviews! I've updated this patch based on your suggestions. This patch is now ready for further review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25928#issuecomment-3273636620 From duke at openjdk.org Wed Sep 10 07:34:58 2025 From: duke at openjdk.org (erifan) Date: Wed, 10 Sep 2025 07:34:58 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v11] In-Reply-To: References: Message-ID: <_ZKvuU_IqxgtXTVqz8yS2XOnItp0mtlemk2CR2p551s=.5c2ce4d5-f851-4acc-9994-adc76813d640@github.com> On Wed, 9 Jul 2025 06:08:33 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Update the code comment @eme64 Thank you for your patience in reviewing this PR. I'm doing some internal testing and expect to push a new commit next week. I'll be on vacation for the next two days. Thank you! ------------- PR Review: https://git.openjdk.org/jdk/pull/24674#pullrequestreview-3204218047 From duke at openjdk.org Wed Sep 10 07:35:03 2025 From: duke at openjdk.org (erifan) Date: Wed, 10 Sep 2025 07:35:03 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v11] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 12:56:40 GMT, Emanuel Peter wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Update the code comment > > src/hotspot/share/opto/vectornode.cpp line 2243: > >> 2241: if (in1->Opcode() != Op_VectorMaskCmp || >> 2242: in1->outcnt() != 1 || >> 2243: !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || > > Suggestion: > > !in1->as_VectorMaskCmp()->predicate_can_be_negated() || > > Brackets are unnecessary, and rather make it harder to read. Good catch, done. > src/hotspot/share/opto/vectornode.cpp line 2277: > >> 2275: res = VectorNode::Ideal(phase, can_reshape); >> 2276: } >> 2277: return res; > > What if someone comes and wants to add yet another optimization before `VectorNode::Ideal`? Your code layout would give us deeper and deeper nesting. I suggest flattening it like this: > Suggestion: > > > Node* res = Ideal_XorV_VectorMaskCmp(phase, can_reshape); > if (res != nullptr) { return res; } > > return VectorNode::Ideal(phase, can_reshape); Make sense, done. > test/micro/org/openjdk/bench/jdk/incubator/vector/MaskCompareNotBenchmark.java line 351: > >> 349: public void testCompareULEMaskNotLong() { >> 350: testCompareMaskNotLong(VectorOperators.ULE); >> 351: } > > You could consider making the operator a `@Param` next time. > > There are multiple tricks to do that: > - `test/micro/org/openjdk/bench/vm/compiler/VectorStoreToLoadForwarding.java` using `MethodHandles.constant` > - Some inner class that has a static final, which is initialized from the non-final `@Param` value. > - Probably even `StableValue` would work, but I have not yet experimented with it. > > It would be nice if we could do the same with the primitive types, but that's probably not going to work as easily. > > Really just an idea for next time. Good point, I didn't know about these methods before. I will submit this change in my next commit, thank you. > test/micro/org/openjdk/bench/jdk/incubator/vector/MaskCompareNotBenchmark.java line 366: > >> 364: public void testCompareNEMaskNotFloat() { >> 365: testCompareMaskNotFloat(VectorOperators.NE); >> 366: } > > You could still add the other comparisons as well, so we can see the performance difference. Very optional, feel free to ignore this suggestion. Sounds good, this will be added with the above change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2335413222 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2335421260 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2335825557 PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2335827904 From epeter at openjdk.org Wed Sep 10 07:45:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 07:45:53 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v7] In-Reply-To: References: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> Message-ID: On Wed, 3 Sep 2025 10:09:58 GMT, erifan wrote: >>> Oh I think we still cannot use `BoolTest::negate`, because we cannot instantiate a `BoolTest` object with **unsigned** comparison. `BoolTest::negate` is a non-static function. >> >> I see. Ok. Hmm. I still think that the logic should be in `BoolTest`, because that is where the exact implementation of the enum values is. In that context it is easier to see why `^4` does the negation. And imagine we were ever to change the enum values, then it would be harder to find your code and fix it. >> >> Maybe it could be called `BoolTest::negate_mask(mast btm)` and explain in a comment that both signed and unsigned is supported. > > Hi @eme64 @theRealAph @XiaohongGong @fg1417 @shqking , could you help take a look at this PR, thanks @erifan Sounds good. No rush, it takes as long as it takes. I'll soon be on vacation too and may not respond until mid of October. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3273732881 From galder at openjdk.org Wed Sep 10 08:16:22 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 10 Sep 2025 08:16:22 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 08:28:51 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Changes requested by galder (Author). src/hotspot/share/opto/vtransform.cpp line 795: > 793: > 794: VectorNode* vn = nullptr; > 795: if (req() <= 3) { I'm wondering if with this change, the `assert(2 <= req() && req() <= 4, "Must have 1-3 inputs");` call could moved and be made more specific for these 2 sides of the conditon. For example, we know that if we go down the `req() <= 3` route, then we're in the 1-2 inputs? And if if we're in the other one we're at least 3 inputs. Then, with that in mind, I wonder if we couldn't move ` Node* in3 = (req() >= 4) ? apply_state.transformed_node(in_req(3)) : nullptr;` to be computed only in the `else` and convert it to `Node* in3 = apply_state.transformed_node(in_req(3))`? ------------- PR Review: https://git.openjdk.org/jdk/pull/27056#pullrequestreview-3204977764 PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2335935271 From shade at openjdk.org Wed Sep 10 08:20:33 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 10 Sep 2025 08:20:33 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode Message-ID: I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. I think we should be running CTW tests in AWT headless mode to begin with. Additional testing: - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/27187/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27187&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367313 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27187.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27187/head:pull/27187 PR: https://git.openjdk.org/jdk/pull/27187 From rcastanedalo at openjdk.org Wed Sep 10 08:31:48 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 10 Sep 2025 08:31:48 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v8] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> <1gdeBnZ7YuIf9CgQW2bCXkDDBWPjUgRnickHts-fvzE=.e6e901ba-3e9f-41a2-9c68-167a879e9655@github.com> <2m1_XtiSsW_LaBRrkX4qv7AKtLOjNgnl4mUp3zisasE=.dda62164-7aa0-4c1a-b83f-fa40ba7902e5@github.com> <4374L3lkQK90wLxxOA7POBmIKNX2DFK-4pO4vj1bkuQ=.5b8d7825-a7f1-497f-ab66-02a85a266659@github.com> Message-ID: On Fri, 11 Jul 2025 18:20:19 GMT, John R Rose wrote: > Specifically, if we are using narrow memory projections sometimes, we should be prepared to respect them always. @rose00 I fully agree with your general argument, but note that my request refers to avoiding redundant MachProj memory projections arising after matching narrow memory projections (such as nodes 56-39 in B6 in the following CFG: https://github.com/user-attachments/files/20477560/after-gcm.pdf), not narrow memory projections per se. I do not see any use in allowing redundant MachProj memory projections in the IR, while due to their ambiguity they increase the risk of introducing new bugs or unveiling latent bugs, e.g. in anti-dependency analysis. So I am happy that Roland has found a cheap way to prevent them from ever appearing in the IR. > @rose00 @robcasloz I updated the change with a new way to avoid redundant projections. At matching time, before a `NarrowMemProj` is matched into a `MachProj`, new logic checks whether a `MachProj` already exists. That guarantees that no redundant `MachProj` are ever added. It also performs the new normalization at a major cut-point. What do you think? That sounds good to me, thank you for enforcing this Roland! I will re-run testing and have a new look at the changeset within the next days. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24570#issuecomment-3273887804 PR Comment: https://git.openjdk.org/jdk/pull/24570#issuecomment-3273891349 From duke at openjdk.org Wed Sep 10 08:48:59 2025 From: duke at openjdk.org (erifan) Date: Wed, 10 Sep 2025 08:48:59 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress Message-ID: The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. This pull request introduces the following changes: 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. 2. Eliminates unnecessary compress operations for partial subword type cases. 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. Benchmark results demonstrate that these changes significantly improve performance. Benchmarks on Nvidia Grace machine with 128-bit SVE: Benchmark Unit Before Error After Error Uplift Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. ------------- Commit messages: - 8366333: AArch64: Enhance SVE subword type implementation of vector compress Changes: https://git.openjdk.org/jdk/pull/27188/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27188&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366333 Stats: 414 lines in 9 files changed: 297 ins; 24 del; 93 mod Patch: https://git.openjdk.org/jdk/pull/27188.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27188/head:pull/27188 PR: https://git.openjdk.org/jdk/pull/27188 From epeter at openjdk.org Wed Sep 10 08:49:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 08:49:09 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 08:10:07 GMT, Galder Zamarre?o wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix typo > > src/hotspot/share/opto/vtransform.cpp line 795: > >> 793: >> 794: VectorNode* vn = nullptr; >> 795: if (req() <= 3) { > > I'm wondering if with this change, the `assert(2 <= req() && req() <= 4, "Must have 1-3 inputs");` call could moved and be made more specific for these 2 sides of the conditon. > > For example, we know that if we go down the `req() <= 3` route, then we're in the 1-2 inputs? And if if we're in the other one we're at least 3 inputs. > > Then, with that in mind, I wonder if we couldn't move ` Node* in3 = (req() >= 4) ? apply_state.transformed_node(in_req(3)) : nullptr;` to be computed only in the `else` and convert it to `Node* in3 = apply_state.transformed_node(in_req(3))`? We could. But I'd prefer to do the req assert before I access any inputs, to avoid failing in the input access. And I also like the parallel pattern of fetching the inputs, moving it inside the if/else would in my opinion make it harder to read. We could also just drop the assert and rely on the asserts in the input fetch. Personally, I would leave it as I have it now, but I'm open to a majority vote ;) @chhagedorn What would you prefer? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2336034124 From xgong at openjdk.org Wed Sep 10 08:57:40 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 10 Sep 2025 08:57:40 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v11] In-Reply-To: References: Message-ID: On Thu, 14 Aug 2025 14:01:13 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > cleanup: start the SVE Integer Misc - Unpredicated section Following issues are reported when I run this test on a SVE 512-bit vector length simulator. test Byte256VectorTests.MULByte256VectorTestsMasked(byte[-i * 5], byte[cornerCaseValue(i)], mask[false]): success [15ms] # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/tmp/ci-scripts/jdk-src/src/hotspot/cpu/aarch64/aarch64_vector.ad:3522), pid=299515, tid=299551 # assert(length_in_bytes == MaxVectorSize) failed: invalid vector length # Same failures happens on following tests: jdk/incubator/vector/Byte256VectorTests.java jdk/incubator/vector/Int256VectorTests.java jdk/incubator/vector/Long256VectorTests.java jdk/incubator/vector/Short256VectorTests.java ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3273984366 From qamai at openjdk.org Wed Sep 10 10:29:37 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 10 Sep 2025 10:29:37 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v7] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Tue, 26 Aug 2025 12:46:31 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > review Nice consolidation also. I have only some small style suggestion. src/hotspot/share/opto/divnode.cpp line 1220: > 1218: // Mod by zero? Throw exception at runtime! > 1219: if (t2 == TypeInteger::zero(bt)) { > 1220: return TypeInt::TOP; `TypeInt::TOP` is actually `Type::TOP` src/hotspot/share/opto/divnode.cpp line 1225: > 1223: const TypeInteger* i1 = t1->isa_integer(bt); > 1224: const TypeInteger* i2 = t2->isa_integer(bt); > 1225: if (i1 == nullptr || i2 == nullptr) { If they are not `TOP` here, `isa_integer` should never return `nullptr`, it's better to do an `assert` here. src/hotspot/share/opto/divnode.cpp line 1269: > 1267: hi = MIN2(hi, i1->hi_as_long()); > 1268: } > 1269: return TypeInteger::make(lo, hi, MAX2(i1->_widen,i2->_widen), bt); Small style: space after comma. ------------- Marked as reviewed by qamai (Committer). PR Review: https://git.openjdk.org/jdk/pull/25254#pullrequestreview-3205479330 PR Review Comment: https://git.openjdk.org/jdk/pull/25254#discussion_r2336282089 PR Review Comment: https://git.openjdk.org/jdk/pull/25254#discussion_r2336297089 PR Review Comment: https://git.openjdk.org/jdk/pull/25254#discussion_r2336288184 From epeter at openjdk.org Wed Sep 10 11:35:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 11:35:32 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 08:46:44 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/vtransform.cpp line 795: >> >>> 793: >>> 794: VectorNode* vn = nullptr; >>> 795: if (req() <= 3) { >> >> I'm wondering if with this change, the `assert(2 <= req() && req() <= 4, "Must have 1-3 inputs");` call could moved and be made more specific for these 2 sides of the conditon. >> >> For example, we know that if we go down the `req() <= 3` route, then we're in the 1-2 inputs? And if if we're in the other one we're at least 3 inputs. >> >> Then, with that in mind, I wonder if we couldn't move ` Node* in3 = (req() >= 4) ? apply_state.transformed_node(in_req(3)) : nullptr;` to be computed only in the `else` and convert it to `Node* in3 = apply_state.transformed_node(in_req(3))`? > > We could. But I'd prefer to do the req assert before I access any inputs, to avoid failing in the input access. > > And I also like the parallel pattern of fetching the inputs, moving it inside the if/else would in my opinion make it harder to read. > > We could also just drop the assert and rely on the asserts in the input fetch. > > Personally, I would leave it as I have it now, but I'm open to a majority vote ;) > > @chhagedorn What would you prefer? I discussed a bit with @chhagedorn . He thought I could move down the `Node* in3 = apply_state.transformed_node(in_req(3))`. Maybe if we extend the element wise ops to cases with yet another input it will have to be moved up again, but it's fine to move down for now. The assert we'll leave where it is, it makes more sense as a precondition. As such, I'll move it to the top of the method. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2336454623 From epeter at openjdk.org Wed Sep 10 11:42:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 11:42:01 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v5] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: for Galder ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27056/files - new: https://git.openjdk.org/jdk/pull/27056/files/e3fe36ee..f346e69f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=03-04 Stats: 5 lines in 1 file changed: 2 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27056.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27056/head:pull/27056 PR: https://git.openjdk.org/jdk/pull/27056 From jbhateja at openjdk.org Wed Sep 10 12:25:27 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 12:25:27 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v4] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 409295.875 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 368025.608 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > PopCountValueTransform.StockKernelInt thrpt 2 418649.269 ops/s > PopCountValueTransform.StockKernelLong thrpt 2 381330.221 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/countbitsnode.cpp Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/52ae6bc8..36ecb5d1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=02-03 Stats: 31 lines in 1 file changed: 0 ins; 12 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From jbhateja at openjdk.org Wed Sep 10 12:30:21 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 12:30:21 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> Message-ID: <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> On Tue, 9 Sep 2025 11:46:03 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 114: >> >>> 112: } >>> 113: return 1; >>> 114: } >> >> Thanks for the tests! >> >> I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks. >> >> For example: >> `num = (num | ONES) & ZEROS;` >> >> And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`? >> Then round it off with some random range comparisons at the end: >> ` if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {` > > Also: how many popcount instructions are left? Should it not at most be 1? > Thanks for the tests! > > I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks. > > For example: `num = (num | ONES) & ZEROS;` > > And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`? Then round it off with some random range comparisons at the end: ` if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {` With Random Ranges, we will not be able to ascertain the count of PopCountI IR node, which is why I created different tests for complete logic sweeping, and the one which retains PopCountIR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2336597616 From epeter at openjdk.org Wed Sep 10 12:38:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 12:38:18 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v6] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix include order ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27056/files - new: https://git.openjdk.org/jdk/pull/27056/files/f346e69f..afd716e3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27056&range=04-05 Stats: 2 lines in 1 file changed: 1 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27056.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27056/head:pull/27056 PR: https://git.openjdk.org/jdk/pull/27056 From epeter at openjdk.org Wed Sep 10 12:50:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 12:50:00 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> Message-ID: On Wed, 10 Sep 2025 12:27:51 GMT, Jatin Bhateja wrote: >> Also: how many popcount instructions are left? Should it not at most be 1? > >> Thanks for the tests! >> >> I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks. >> >> For example: `num = (num | ONES) & ZEROS;` >> >> And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`? Then round it off with some random range comparisons at the end: ` if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {` > > With Random Ranges, we will not be able to ascertain the count of PopCountI IR node, which is why I created different tests for complete logic sweeping, and the one which retains PopCountIR. Oh, maybe I missed those "complete logic sweeping tests". Can you please point me to them? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2336651060 From epeter at openjdk.org Wed Sep 10 12:53:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 12:53:14 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 08:14:04 GMT, Galder Zamarre?o wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix typo > > Changes requested by galder (Author). @galderz I addressed your comment, would you mind having another look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27056#issuecomment-3274845574 From galder at openjdk.org Wed Sep 10 13:51:47 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 10 Sep 2025 13:51:47 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v6] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 12:38:18 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix include order Nice tidy up @eme64! ------------- Marked as reviewed by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/27056#pullrequestreview-3206262311 From galder at openjdk.org Wed Sep 10 13:51:49 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 10 Sep 2025 13:51:49 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v4] In-Reply-To: References: Message-ID: <9tZVJNOOTP7iLuJZP4csjmyNXD_bOSyy1rINDUJscwU=.551a8cf7-be56-43d6-b066-3cd481bc1186@github.com> On Wed, 10 Sep 2025 11:32:21 GMT, Emanuel Peter wrote: >> We could. But I'd prefer to do the req assert before I access any inputs, to avoid failing in the input access. >> >> And I also like the parallel pattern of fetching the inputs, moving it inside the if/else would in my opinion make it harder to read. >> >> We could also just drop the assert and rely on the asserts in the input fetch. >> >> Personally, I would leave it as I have it now, but I'm open to a majority vote ;) >> >> @chhagedorn What would you prefer? > > I discussed a bit with @chhagedorn . > > He thought I could move down the `Node* in3 = apply_state.transformed_node(in_req(3))`. > > Maybe if we extend the element wise ops to cases with yet another input it will have to be moved up again, but it's fine to move down for now. > > The assert we'll leave where it is, it makes more sense as a precondition. As such, I'll move it to the top of the method. Sounds good, thanks @eme64 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27056#discussion_r2336833959 From jbhateja at openjdk.org Wed Sep 10 14:21:00 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 14:21:00 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 11:00:24 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update countbitsnode.cpp > > test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 148: > >> 146: >> 147: public static void main(String[] args) { >> 148: TestFramework.runWithFlags("-XX:-TieredCompilation", "-XX:CompileThresholdScaling=0.2"); > > Can you explain the need for these flags? > The TestFramework eventually enqueues for compilation anyway. Or is there something about profiling? Thanks for triggering an IR framework refresher :-), these options are only pertinent with Standalone run mode. > test/micro/org/openjdk/bench/java/lang/PopCountValueTransform.java line 79: > >> 77: } >> 78: return res; >> 79: } > > I assume the `stock` kernels are there to show performance if there is no op, the `folding` kernels you hope have the same performance. It would be nice to have one where the `bitCount` does not fold away, just to keep that comparison :) I see your point, on a second thought, since any benchmarks compare the performance of kernels with and without optimization it's better to do away with the stock variants and only retain folding kernels. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2336929724 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2336929500 From jbhateja at openjdk.org Wed Sep 10 14:22:26 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 14:22:26 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v5] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: review resoultions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/36ecb5d1..f1095b58 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=03-04 Stats: 29 lines in 2 files changed: 2 ins; 20 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From jbhateja at openjdk.org Wed Sep 10 14:24:28 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 14:24:28 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> Message-ID: On Wed, 10 Sep 2025 12:47:42 GMT, Emanuel Peter wrote: >>> Thanks for the tests! >>> >>> I think it would be quite valuable to have some tests that do not just clamp the range, but also create random `KnownBits`, i.e. with random and/or masks. >>> >>> For example: `num = (num | ONES) & ZEROS;` >>> >>> And then you generate `ONES` and `ZEROS` randomly, maybe even using `Generators`? Then round it off with some random range comparisons at the end: ` if (Integer.bitCount(num) >= CON1 && Integer.bitCount(num) <= CON2) {` >> >> With Random Ranges, we will not be able to ascertain the count of PopCountI IR node, which is why I created different tests for complete logic sweeping, and the one which retains PopCountIR. > > Oh, maybe I missed those "complete logic sweeping tests". Can you please point me to them? testPopCountElisionInt1 and testPopCountElisionLong1 check for absence of PopCount IR nodes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2336945012 From jbhateja at openjdk.org Wed Sep 10 14:30:10 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 14:30:10 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v6] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update TestPopCountValueTransforms.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/f1095b58..9e3957de Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=04-05 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From hgreule at openjdk.org Wed Sep 10 14:30:11 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Wed, 10 Sep 2025 14:30:11 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> Message-ID: <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> On Wed, 10 Sep 2025 14:22:10 GMT, Jatin Bhateja wrote: >> Oh, maybe I missed those "complete logic sweeping tests". Can you please point me to them? > > testPopCountElisionInt1 and testPopCountElisionLong1 check for absence of PopCount IR nodes. I think Or and And nodes aren't updated to make use if KnownBits themselves (that generally makes testing based on KnownBits a bit difficult). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2336959670 From epeter at openjdk.org Wed Sep 10 14:58:05 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 14:58:05 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> Message-ID: <1J6deqVB9WWQRxzc3oLxXLyIxam61rqx5u5KxZPCtqE=.db2c8f33-4297-47b1-9180-2a732eec8ac1@github.com> On Wed, 10 Sep 2025 14:26:31 GMT, Hannes Greule wrote: >> testPopCountElisionInt1 and testPopCountElisionLong1 check for absence of PopCount IR nodes. > > I think Or and And nodes aren't updated to make use if KnownBits themselves (that generally makes testing based on KnownBits a bit difficult). Ah I see. We should do that soon, it would give us a good way to do this kind of verification. So you can decide if you want to do the bits thing already in anticipation, or not yet. Personally, I would add it so that we catch the bugs in the future. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2337043857 From epeter at openjdk.org Wed Sep 10 14:58:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 14:58:07 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <1J6deqVB9WWQRxzc3oLxXLyIxam61rqx5u5KxZPCtqE=.db2c8f33-4297-47b1-9180-2a732eec8ac1@github.com> References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> <1J6deqVB9WWQRxzc3oLxXLyIxam61rqx5u5KxZPCtqE=.db2c8f33-4297-47b1-9180-2a732eec8ac1@github.com> Message-ID: On Wed, 10 Sep 2025 14:54:37 GMT, Emanuel Peter wrote: >> I think Or and And nodes aren't updated to make use if KnownBits themselves (that generally makes testing based on KnownBits a bit difficult). > > Ah I see. We should do that soon, it would give us a good way to do this kind of verification. > So you can decide if you want to do the bits thing already in anticipation, or not yet. Personally, I would add it so that we catch the bugs in the future. > testPopCountElisionInt1 and testPopCountElisionLong1 check for absence of PopCount IR nodes. @jatin-bhateja But there the clamps are with fixed constants. It would be nice if we also had some tests with randomized constants. We don't need IR tests for those, just result verification. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2337048167 From chagedorn at openjdk.org Wed Sep 10 15:02:37 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 10 Sep 2025 15:02:37 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v6] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 12:38:18 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> --------------------------------- >> >> I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) >> >> I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. >> >> My vision: >> - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord >> - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. >> - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` >> - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization >> - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes >> - That means it is straight-forward to compute cost >> - And it also makes optimizations on that graph easier >> - And the `apply` methods are simpler too >> >> ---------------------------------- >> >> So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> >> One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. >> >> What I did: >> - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. >> - Will make it easier to optimize and compute cost in future RFE's. >> - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. >> - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. >> - New vector nodes, they are special cases I split away from ... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix include order Still good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27056#pullrequestreview-3206586599 From epeter at openjdk.org Wed Sep 10 15:17:09 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 15:17:09 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> Message-ID: On Wed, 10 Sep 2025 14:26:31 GMT, Hannes Greule wrote: >> testPopCountElisionInt1 and testPopCountElisionLong1 check for absence of PopCount IR nodes. > > I think Or and And nodes aren't updated to make use if KnownBits themselves (that generally makes testing based on KnownBits a bit difficult). @SirYwell @jatin-bhateja I filed an RFE for And / Or. I think these would be really important to do soon, because any other KnownBits optimization relies on those working for verification (generating inputs and verifying outputs). https://bugs.openjdk.org/browse/JDK-8367341 @SirYwell @jatin-bhateja @merykitty I linked this issue here to the KnownBits RFE, to make sure we keep track of all KnownBits extensions. Can you please help me with linking any other RFEs that have already been filed or come up in the future? It would help track progress and avoid duplicated work. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2337091504 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2337099580 From rasbold at openjdk.org Wed Sep 10 15:52:54 2025 From: rasbold at openjdk.org (Chuck Rasbold) Date: Wed, 10 Sep 2025 15:52:54 GMT Subject: RFR: 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation [v5] In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 23:12:18 GMT, Man Cao wrote: >> Hi, >> >> Could anyone review this change that fixes https://bugs.openjdk.org/browse/JDK-8366118? When this bug happens, it is difficult or almost impossible to debug due to the lack of stack trace, hs-err log or core dump. Fortunately we are also experimenting with sigaltstack for https://bugs.openjdk.org/browse/JDK-8364654, and it helped immensely to identify the root cause. >> >> I will also try adding a test case for DontCompileHugeMethod under -XX:-TieredCompilation. >> >> -Man > > Man Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into JDK-8366118-DontCompileHugeMethods > - Add -Xbatch to test > - Use List.of in test > - Add a jtreg test > - 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation Marked as reviewed by rasbold (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26932#pullrequestreview-3206801879 From mablakatov at openjdk.org Wed Sep 10 15:57:54 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 10 Sep 2025 15:57:54 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v11] In-Reply-To: <8T7swIJ17tLLg4FO_N5UZ0HsMYrz31ywBiMZohefGTE=.386eeb0d-8541-4c35-8a68-6caf31ea867e@github.com> References: <8T7swIJ17tLLg4FO_N5UZ0HsMYrz31ywBiMZohefGTE=.386eeb0d-8541-4c35-8a68-6caf31ea867e@github.com> Message-ID: On Tue, 9 Sep 2025 06:51:00 GMT, Xiaohong Gong wrote: > Do you intend to ignore ops with >32B vector size? May I ask the reason? The reason is the lack of relevant hardware. The only publicly available platform that implements 512b SVE I'm aware of is Fujitsu A64FX. I used to have access to that platform but no longer which makes it difficult to test and benchmark changes for 512b SVE. Stripping that functionality and keeping the implementation in bounds of 256b SVE reduces complexity of this patch. > If so, maybe the title like AArch64: Implement MulReduction for 256-bit SVE is more accurate? Given the state of the PR it might be. Thank you for the suggestion, I'll consider it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3275566957 From jbhateja at openjdk.org Wed Sep 10 16:00:37 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 16:00:37 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> Message-ID: <6eJGadjOxt_uInDZmiRc5MZNefslQT3-bOcsTp2tEe0=.0bdc2d27-d09a-4872-9633-51c2b55d1c18@github.com> On Wed, 10 Sep 2025 14:26:31 GMT, Hannes Greule wrote: >> testPopCountElisionInt1 and testPopCountElisionLong1 check for absence of PopCount IR nodes. > > I think Or and And nodes aren't updated to make use if KnownBits themselves (that generally makes testing based on KnownBits a bit difficult). > @SirYwell @jatin-bhateja @merykitty I linked this issue here to the KnownBits RFE, to make sure we keep track of all KnownBits extensions. Can you please help me with linking any other RFEs that have already been filed or come up in the future? It would help track progress and avoid duplicate work. Current And Value Transforms : - Constant folds - both inputs - There are four possible cases for known bits extraction : - _lo _hi <0 <0 : Possibility of finding common prefix and known ZERO and ONE bits among the common portion. >=0 <0 : Not applicable scenario, since lower is greater than the upper bound. <0 >=0 : No possibility of finding a common prefix b/w hi and lo bounds, thus no known bits exist. >=0 >=0 : Possibility of finding common prefix and known ZERO and ONE bits among the common portion. Existing value transforms and canonicalization should furnish known bits in applicable scenarios. For a full solution, we can add another rule to directly AND the known ZERO and ONE bits of participating inputs, and let canonicalization compute the resultant type and clean up existing handling in Value transforms and explicit constant folding ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2337215333 From epeter at openjdk.org Wed Sep 10 16:06:04 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 10 Sep 2025 16:06:04 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <6eJGadjOxt_uInDZmiRc5MZNefslQT3-bOcsTp2tEe0=.0bdc2d27-d09a-4872-9633-51c2b55d1c18@github.com> References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> <6eJGadjOxt_uInDZmiRc5MZNefslQT3-bOcsTp2tEe0=.0bdc2d27-d09a-4872-9633-51c2b55d1c18@github.com> Message-ID: On Wed, 10 Sep 2025 15:55:42 GMT, Jatin Bhateja wrote: > For a full solution, we can add another rule to directly AND the known ZERO and ONE bits of participating inputs, and let canonicalization compute the resultant type and clean up existing handling in Value transforms and explicit constant folding Yes, this is what we would end up with after https://bugs.openjdk.org/browse/JDK-8367341 . But I think currently, there is no good way to set / get bits directly. Using signed comparisons as you mentioned is only of limited help. But it is what we have for now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2337233209 From manc at openjdk.org Wed Sep 10 17:45:27 2025 From: manc at openjdk.org (Man Cao) Date: Wed, 10 Sep 2025 17:45:27 GMT Subject: RFR: 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation [v5] In-Reply-To: References: Message-ID: On Fri, 29 Aug 2025 23:12:18 GMT, Man Cao wrote: >> Hi, >> >> Could anyone review this change that fixes https://bugs.openjdk.org/browse/JDK-8366118? When this bug happens, it is difficult or almost impossible to debug due to the lack of stack trace, hs-err log or core dump. Fortunately we are also experimenting with sigaltstack for https://bugs.openjdk.org/browse/JDK-8364654, and it helped immensely to identify the root cause. >> >> I will also try adding a test case for DontCompileHugeMethod under -XX:-TieredCompilation. >> >> -Man > > Man Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into JDK-8366118-DontCompileHugeMethods > - Add -Xbatch to test > - Use List.of in test > - Add a jtreg test > - 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation Thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26932#issuecomment-3275922256 From manc at openjdk.org Wed Sep 10 17:45:29 2025 From: manc at openjdk.org (Man Cao) Date: Wed, 10 Sep 2025 17:45:29 GMT Subject: Integrated: 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation In-Reply-To: References: Message-ID: On Mon, 25 Aug 2025 19:38:23 GMT, Man Cao wrote: > Hi, > > Could anyone review this change that fixes https://bugs.openjdk.org/browse/JDK-8366118? When this bug happens, it is difficult or almost impossible to debug due to the lack of stack trace, hs-err log or core dump. Fortunately we are also experimenting with sigaltstack for https://bugs.openjdk.org/browse/JDK-8364654, and it helped immensely to identify the root cause. > > I will also try adding a test case for DontCompileHugeMethod under -XX:-TieredCompilation. > > -Man This pull request has now been integrated. Changeset: 4e2a85f7 Author: Man Cao URL: https://git.openjdk.org/jdk/commit/4e2a85f7500876d65c36aeaf54f5361a1549e7f5 Stats: 143 lines in 2 files changed: 123 ins; 0 del; 20 mod 8366118: DontCompileHugeMethods is not respected with -XX:-TieredCompilation Co-authored-by: Chuck Rasbold Co-authored-by: Justin King Reviewed-by: rasbold, iveresov, jiangli ------------- PR: https://git.openjdk.org/jdk/pull/26932 From jbhateja at openjdk.org Wed Sep 10 18:15:18 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 10 Sep 2025 18:15:18 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v2] In-Reply-To: References: Message-ID: <0KIGO9Uk5uIHhyFupqt0KvRbLPz_YmTxnR0Q4Bpzakw=.df5451ef-2235-46e0-a472-5909a999976b@github.com> On Sat, 6 Sep 2025 00:28:18 GMT, Mohamed Issa wrote: >>> @missa-prime Looks like an interesting patch! Do you think you could add some sort of IR test here, to verify that the correct code is generated on AVX10 vs lower AVX? >> >> @eme64 Thanks for the suggestion. This patch doesn't modify any IR though, so I'm not sure what IR test(s) to add. I could modify existing tests (`test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java`, `test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java`, `test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java`) that use IR nodes as dependencies though. Would that be sufficient? Or did you have something else in mind? > >> @missa-prime Could you not match on the mach graph? See example: `test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java` with `CompilePhase.FINAL_CODE`. >> >> Maybe another `CompilePhase` is better. I have never matched on the mach graph myself, but I wonder if it may be useful here. > > I modified existing vector conversion tests, and I'll add some matching scalar tests to get full coverage. @missa-prime , please have a look at the following failure with the current patch 2025-09-10T02:04:00.8424130Z 2025-09-10T02:04:00.8424221Z Failed IR Rules (4) of Methods (4) 2025-09-10T02:04:00.8424462Z ---------------------------------- 2025-09-10T02:04:00.8425011Z 1) Method "public char[] compiler.vectorization.runner.ArrayTypeConvertTest.convertDoubleToChar()" - [Failed IR rules: 1]: 2025-09-10T02:04:00.8426351Z * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#CAST_D2X#_", "> 0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={"avx", "true", "avx10_2", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" 2025-09-10T02:04:00.8427967Z > Phase "Final Code": 2025-09-10T02:04:00.8428412Z - counts: Graph contains wrong number of nodes: 2025-09-10T02:04:00.8429037Z * Constraint 1: "(\d+(\s){2}(castD2X_reg_(av|eve)x.*)+(\s){2}===.*)" 2025-09-10T02:04:00.8429650Z - Failed comparison: [found] 0 > 0 [given] 2025-09-10T02:04:00.8430146Z - No nodes matched! 2025-09-10T02:04:00.8430409Z 2025-09-10T02:04:00.8431165Z 2) Method "public int[] compiler.vectorization.runner.ArrayTypeConvertTest.convertDoubleToInt()" - [Failed IR rules: 1]: 2025-09-10T02:04:00.8433283Z * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#CAST_D2X#_", "> 0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={"avx", "true", "avx10_2", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" 2025-09-10T02:04:00.8434546Z > Phase "Final Code": 2025-09-10T02:04:00.8434810Z - counts: Graph contains wrong number of nodes: 2025-09-10T02:04:00.8435174Z * Constraint 1: "(\d+(\s){2}(castD2X_reg_(av|eve)x.*)+(\s){2}===.*)" 2025-09-10T02:04:00.8435523Z - Failed comparison: [found] 0 > 0 [given] 2025-09-10T02:04:00.8435792Z - No nodes matched! 2025-09-10T02:04:00.8435937Z 2025-09-10T02:04:00.8436340Z 3) Method "public short[] compiler.vectorization.runner.ArrayTypeConvertTest.convertDoubleToShort()" - [Failed IR rules: 1]: 2025-09-10T02:04:00.8437688Z * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#CAST_D2X#_", "> 0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={"avx", "true", "avx10_2", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" 2025-09-10T02:04:00.8438703Z > Phase "Final Code": 2025-09-10T02:04:00.8438956Z - counts: Graph contains wrong number of nodes: 2025-09-10T02:04:00.8439301Z * Constraint 1: "(\d+(\s){2}(castD2X_reg_(av|eve)x.*)+(\s){2}===.*)" 2025-09-10T02:04:00.8439647Z - Failed comparison: [found] 0 > 0 [given] 2025-09-10T02:04:00.8439913Z - No nodes matched! 2025-09-10T02:04:00.8440063Z 2025-09-10T02:04:00.8440436Z 4) Method "public int[] compiler.vectorization.runner.ArrayTypeConvertTest.convertFloatToInt()" - [Failed IR rules: 1]: 2025-09-10T02:04:00.8441891Z * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#CAST_F2X#_", "> 0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={"avx", "true", "avx10_2", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" 2025-09-10T02:04:00.8442912Z > Phase "Final Code": 2025-09-10T02:04:00.8443155Z - counts: Graph contains wrong number of nodes: 2025-09-10T02:04:00.8443492Z * Constraint 1: "(\d+(\s){2}(castF2X_reg_(av|eve)x.*)+(\s){2}===.*)" 2025-09-10T02:04:00.8443951Z - Failed comparison: [found] 0 > 0 [given] 2025-09-10T02:04:00.8444210Z - No nodes matched! 2025-09-10T02:04:00.8444355Z 2025-09-10T02:04:00.8444498Z >>> Check stdout for compilation output of the failed methods ------------- PR Comment: https://git.openjdk.org/jdk/pull/26919#issuecomment-3276020517 From cslucas at openjdk.org Wed Sep 10 18:39:06 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 10 Sep 2025 18:39:06 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v2] In-Reply-To: References: Message-ID: > Please, review this patch to fix issue that may occur when reducing allocation merge. > > As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. > > The change in `revisit_reducible_phi_status` is just a clean-up. > The real fix is in `find_scalar_replaceable_allocs`. > > Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Revert clean-up in EA. Make catch statements more specific in test case. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27063/files - new: https://git.openjdk.org/jdk/pull/27063/files/7ebd687f..17d5ab22 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27063&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27063&range=00-01 Stats: 16 lines in 2 files changed: 13 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/27063.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27063/head:pull/27063 PR: https://git.openjdk.org/jdk/pull/27063 From cslucas at openjdk.org Wed Sep 10 18:39:07 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 10 Sep 2025 18:39:07 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: <2brDXuLmbVBVRaeSyCdKokA706v3t6VsZfGvj_QceJ4=.4483390e-c726-4d82-b220-f1dbdf4efef0@github.com> References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> <2brDXuLmbVBVRaeSyCdKokA706v3t6VsZfGvj_QceJ4=.4483390e-c726-4d82-b220-f1dbdf4efef0@github.com> Message-ID: <3-CCR9TA1nRh8rYDO8BEs-H6qP-Xa42r2kSjckUcdLw=.87585c25-a3dd-4bb7-825a-db58ddae7abb@github.com> On Tue, 9 Sep 2025 07:01:15 GMT, Roberto Casta?eda Lozano wrote: > Sounds good, please file a RFE for that. I would suggest then to postpone the clean-up in `revisit_reducible_phi_status` to that RFE. I created this RFE to track that: https://bugs.openjdk.org/browse/JDK-8367367 @robcasloz - I pushed some changes addressing yours and @eme64 comments. Could you please re-run your internal tests? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3276082582 PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3276091930 From rehn at openjdk.org Wed Sep 10 18:46:52 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 10 Sep 2025 18:46:52 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v5] In-Reply-To: <64z-PlrnxAISLzKBq-RZz7CXkQirGTvOgTGMJQl833o=.73ea3239-dfb6-4e32-b20f-8398334f2759@github.com> References: <64z-PlrnxAISLzKBq-RZz7CXkQirGTvOgTGMJQl833o=.73ea3239-dfb6-4e32-b20f-8398334f2759@github.com> Message-ID: On Thu, 4 Sep 2025 13:32:34 GMT, Robbin Ehn wrote: >> Hey, please consider! >> >> A bunch of info in JBS entry, please read that also. >> >> I narrowed this issue down to the old jal optimization, making direct calls when in reach. >> This patch restores them and removes this regression. >> >> In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. >> >> Please test on your hardware! >> >> >> Chi Square (100 runs each, 10 fastest iterations of each run, P550) >> JDK-23 (last version with trampoline calls) >> Mean: 3189.5827 >> Standard Deviation: 284.6478 >> >> JDK-25 >> Mean: 3424.8905 >> Standard Deviation: 222.2208 >> >> Patch: >> Mean: 3144.8535 >> Standard Deviation: 229.2577 >> >> >> No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. > > Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Merge branch 'master' into 8365926 > - Review comments > - Review comments > - Merge branch 'master' into 8365926 > - Spelling > - Merge branch 'master' into 8365926 > - draft jal<->jalr Hamlin had some offline Q so I gather this data for him: Benchmark Results: Base: JDK24* +UseTrampoline JAL OPT: JDK24* +UseTrampoline + JAL OPT +-----------------+--------------+--------------+----------------+----------------+--------------+------------------+-------------+----------------+------------------+--------------------+ | Benchmark | Mean (Base) | SD (Base) | Fastest (Base) | Mean (JAL OPT) | SD (JAL OPT) | Fastest (JAL OPT)| Diff Mean | Diff Fastest | Mean Diff Ratio | Fastest Diff Ratio | +-----------------+--------------+--------------+----------------+----------------+--------------+------------------+-------------+----------------+------------------+--------------------+ | future-genetic | 8317.8449 | 925.0775 | 7824.59 | 8421.137 | 1870.3916 | 7955.19 | 103.2922 | 130.6 | 1.012418145 | 1.01669097 | | akka-uct | 54775.8037 | 5220.7361 | 49614.46 | 54149.9939 | 4730.3662 | 48736.7 | -625.8097 | -877.76 | 0.9885750686 | 0.9823083835 | | movie-lens | 44859.3268 | 107.8713 | 38160.64 | 43043.6965 | 7932.6525 | 36807.2 | -1815.6295 | -1353.44 | 0.9595261529 | 0.9645330896 | | scala-doku | 10792.4933 | 3004.9348 | 970.34 | 10739.0164 | 2692.6155 | 9226.94 | -53.4766 | 256.59 | 0.9950450188 | 1.028605382 | | chi-square | 4740.1812 | 3552.9489 | 2579.09 | 4749.0893 | 3484.3178 | 2498.04 | 8.9081 | -81.05 | 1.001879274 | 0.968574187 | | fj-kmeans | 18597.656 | 2481.4036 | 17994.43 | 18588.154 | 4458.6089 | 18019.15 | -9.5018 | 24.72 | 0.9994890862 | 1.001373758 | | db-shootout | 26529.8048 | 3163.9087 | 21270.43 | 25101.5681 | 2483.0698 | 21419.11 | -1428.2367 | 148.67 | 0.9461648244 | 1.006989986 | | finagle-http | 20646.1713 | 1635.9154 | 14898.97 | 20250.4966 | 1046.1738 | 14735.66 | -395.6747 | -163.31 | 0.9808354443 | 0.9890388396 | | reactors | 52051.8872 | 2023.7865 | 49188.65 | 51625.9497 | 2150.598 | 48874.49 | -425.9376 | -314.16 | 0.9918170594 | 0.9936131608 | | dec-tree | 7532.9295 | 756.8107 | 4076.4 | 7441.0578 | 750.30926 | 4089.08 | -91.8717 | 12.68 | 0.9878039878 | 1.003110588 | | naive-bayes | 38973.8684 | 16828.5555 | 31479.37 | 38484.4577 | 16640.458 | 31576.24 | -489.4106 | 96.87 | 0.9874425937 | 1.003077253 | | als | 20116.2896 | 42.9005 | 14593.64 | 19553.929 | 947.1711 | 14599.15 | -562.3509 | 5.52 | 0.9720449855 | 1.000377562 | | par-mnemonics | 17564.7499 | 744.1041 | 16654.08 | 17239.074 | 1100.0016 | 15942.67 | -325.676 | -711.41 | 0.9814585518 | 0.9572831402 | | scala-kmeans | 1201.4918 | 180.6982 | 845 | 1173.5701 | 205.5769 | 791.32 | -27.9217 | -53.68 | 0.9767608069 | 0.9364733728 | | philosophers | 4780.9081 | 417.8337 | 3656.22 | 4828.5436 | 1372.1029 | 3926.02 | 47.6356 | 269.8 | 1.009963714 | 1.073792058 | | log-regression | 7403.8792 | 8743.3328 | 3675.79 | 7275.2818 | 715.8207 | 3578.2 | -128.5983 | -97.6 | 0.98263097 | 0.9734506052 | | gauss-mix | 35128.1145 | 8364.2843 | 27585.27 | 33996.7118 | 7896.5377 | 26810.99 | -1131.4027 | -774.27 | 0.9677921028 | 0.9719313967 | | mnemonics | 21426.0608 | 537.9065 | 20202.69 | 20956.9427 | 610.3026 | 19568.55 | -469.1181 | -634.14 | 0.9781052568 | 0.9686111107 | | dotty | 16674.7994 | 13824.23 | 12773.145 | 16098.8288 | 13498.268 | 7484.09 | -575.9706 | -247.36 | 0.965458619 | 0.9680060015 | | finagle-chirper | 20949.0206 | 10776.0049 | 15527.08 | 20286.9623 | 10038.7242 | 15212.05 | -662.0582 | -315.03 | 0.9683966944 | 0.9797109308 | +-----------------+--------------+--------------+----------------+----------------+--------------+------------------+-------------+----------------+------------------+--------------------+ ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3276121311 From hgreule at openjdk.org Wed Sep 10 19:32:58 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Wed, 10 Sep 2025 19:32:58 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v7] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Wed, 10 Sep 2025 10:25:00 GMT, Quan Anh Mai wrote: >> Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > src/hotspot/share/opto/divnode.cpp line 1225: > >> 1223: const TypeInteger* i1 = t1->isa_integer(bt); >> 1224: const TypeInteger* i2 = t2->isa_integer(bt); >> 1225: if (i1 == nullptr || i2 == nullptr) { > > If they are not `TOP` here, `isa_integer` should never return `nullptr`, it's better to do an `assert` here. I guess using `is_integer` directly might make sense then? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25254#discussion_r2337712152 From vlivanov at openjdk.org Wed Sep 10 22:05:49 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 10 Sep 2025 22:05:49 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v9] In-Reply-To: References: Message-ID: > This PR introduces C2 support for `Reference.reachabilityFence()`. > > After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. > > `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. > > Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. > > Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 > "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." > > Testing: > - [x] hs-tier1 - hs-tier8 > - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations > - [x] java/lang/foreign microbenchmarks Vladimir Ivanov has updated the pull request incrementally with four additional commits since the last revision: - update - update - update - MultiNode -> Node ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25315/files - new: https://git.openjdk.org/jdk/pull/25315/files/e95d4eb9..6981bd18 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=07-08 Stats: 68 lines in 12 files changed: 40 ins; 2 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/25315.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25315/head:pull/25315 PR: https://git.openjdk.org/jdk/pull/25315 From vlivanov at openjdk.org Wed Sep 10 22:05:51 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 10 Sep 2025 22:05:51 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 13:28:10 GMT, Emanuel Peter wrote: >> Good idea. Added one. > > Also: you promise that it happens randomly. But it seems to be added deterministically everywhere. Did I miss something? Sorry for the confusion. Reworded the comment. I didn't intend to make it truly random. The idea was to automatically insert RF nodes during parsing to stress the implementation. It doesn't slow down compilation times that much, so aggressive insertion just works. >> Live ranges of values are routinely extended during loop opts. And it can break the invariant that all interfering safepoints contain the referent in their oop map. (If an interfering safepoint doesn't keep the referent alive, then it becomes possible for the referent to be prematurely GCed.) >> >> After loop opts are over, it becomes possible to reliably enumerate all interfering safe points and ensure the referent present in their oop maps. > > Can you make sure this explanation is in the comment ;) Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2334889253 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2334855917 From vlivanov at openjdk.org Wed Sep 10 22:05:56 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 10 Sep 2025 22:05:56 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> References: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> Message-ID: On Mon, 8 Sep 2025 12:45:56 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/callGenerator.cpp line 623: >> >>> 621: return; // keep the original call node as the holder of reachability info >>> 622: } >>> 623: } >> >> Maybe that's just me. But people use the assert messages both in positive and negative ways, and so this is a bit ambiguous. Maybe you can write: >> `no reachability edge should be present` >> >> I'm still a bit unsure what the `SafePointNode::grow_stack` comment means. >> In the previous comment https://github.com/openjdk/jdk/pull/25315#discussion_r2320120466 you explained more. Why not add that here instead? > > I'm also not sure yet why there is a difference between incremental inlining and regular inlining. > Do you think it would make sense to explain that here, or is it explained elsewhere? There are no safepoint-attached reachability edges present during normal parsing. For incremental inlining, JVMS from the original call is taken and extended with callee state. If there are reachability edges present, they have to be treated specially and carried over to all safepoints produced during incremental inlining attempt. There's no such support in place yet. >> src/hotspot/share/opto/macro.cpp line 973: >> >>> 971: _igvn._worklist.push(ac); >>> 972: } else if (use->is_ReachabilityFence() && OptimizeReachabilityFences) { >>> 973: use->as_ReachabilityFence()->clear_referent(_igvn); // redundant fence >> >> Thanks for refactoring a bit here :) >> >> Is this rf guaranteed to belong to the Allocation somehow? > > Ah, you could mention that later `ReachabilityFenceNode::Identity` removes the rf. > Is this rf guaranteed to belong to the Allocation somehow? I don't get your question. The code iterates over users of an allocation which is being eliminated. Semantically, RF is a no-op on a scalarizable referent and has to be removed in order to let the scalarization happen. > Ah, you could mention that later ReachabilityFenceNode::Identity removes the rf. Done. >> src/hotspot/share/opto/reachability.cpp line 136: >> >>> 134: return true; >>> 135: } >>> 136: } >> >> Nit: `an no-op` -> `a no-op` >> >> Also: do you need the return value? The only use case does not do anything with it. > > You could mention that `Identity` will remove the node later. > Also: do you need the return value? The only use case does not do anything with it. I decided to keep it for diagnostic purposes even though no existing callers care about it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2334899185 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337978454 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337989028 From vlivanov at openjdk.org Wed Sep 10 22:06:04 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 10 Sep 2025 22:06:04 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 12:59:48 GMT, Emanuel Peter wrote: > could we just go through _reachability_fences, and hack the graph and clean up with IGVN? Or do we really need the loop state to do this successfully? RF elimination needs control for referent to enumerate all interfering safepoints. Theoretically, it's possible to use a conservative estimate, but then: (1) it can worsen the result (by enumerating more interfering safepoints than needed); and (2) build an unschedulable graph if referent doesn't dominate safepoint node (if estimate is way too conservative). IMO it's safer to build full dominator tree here. > It probably has a performance impact, right? Have you measured that? It does have a noticeable cost. On my laptop it bumps the time spent doing RF processing from 170ms to 210ms $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:-StressReachabilityFences IdealLoop: 0.173 s ReachabilityFence: 0.000 s Optimize: 0.000 s Eliminate: 0.000 s ``` vs $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:+StressReachabilityFences IdealLoop: 0.212 s ReachabilityFence: 0.030 s Optimize: 0.004 s Eliminate: 0.004 s ``` I reimplemented it to piggyback on the last loop optimization attempt if there's any and it drastically improves the situation: $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:+StressReachabilityFences IdealLoop: 0.193 s ReachabilityFence: 0.009 s Optimize: 0.003 s Eliminate: 0.004 s > src/hotspot/share/opto/loopTransform.cpp line 66: > >> 64: //------------------------------unique_loop_exit_or_null---------------------- >> 65: // Return the loop-exit projection if it is unique. >> 66: Node* IdealLoopTree::unique_loop_exit_or_null() { > > I suggested it here: > https://github.com/openjdk/jdk/pull/25315#discussion_r2149677594 > Can we change the return type to `IfProjNode`? > > Also: when is it possible that there are none or multiple loop exits? > Can you add a comment below where you return nullptr? Done. > src/hotspot/share/opto/parse1.cpp line 2233: > >> 2231: insert_reachability_fence(referent); >> 2232: } >> 2233: } > > Comments look better, thanks :) > > But `StressReachabilityFences` seems to promise that it should happen randomly. Did you want to do that or adjust the flag comment? I adjusted flag comment. > src/hotspot/share/opto/reachability.cpp line 49: > >> 47: * >> 48: * It is tempting to directly attach referents to interfering safepoints right from the beginning, but it >> 49: * doesn't play well with some optimizations C2 does. > > Do you have an example for such optimizations? Loop-invariant code motion is one example. Do you want me to add it to the comment? After parsing is over, the IR is in valid state, but loop optimizations are the primary reason why it can be broken later. > src/hotspot/share/opto/reachability.cpp line 67: > >> 65: * RF nodes may interfere with RA, so stand-alone RF nodes are eliminated and their referents are >> 66: * transferred to corresponding safepoints (phase #2). When safepoints are pruned during macro expansion, >> 67: * corresponding reachability edges also go away. > > Spell our RA on first use. Make more clear that this is why we eliminate RF before RA. > Suggestion: > > * RF nodes may interfere with register allocation (RA), hence we eliminate RF nodes and transfer their > * referents to corresponding safepoints (phase #2). When safepoints are pruned during macro expansion, > * corresponding reachability edges also go away. > > `reachability edges also go away` ... and that is ok why? Sketch of what you could write, is it correct? > - reachability only needs to be correct at SafePoints. If all the SafePoints are removed for a referent, then we don't need to ensure its reachablility. Applied your suggested change and elaborated the comment. > the very same similar way sounds a little funny. I Fixed. > What is the issue with the edges being attached to safepoints here? The issue is safepoint-attached representation conflicts with derived oops representation. There's no way to distinguish between them. As of now, VM treats post-debug info edges as representing derived oops which is completely wrong when there are reachability edges present. More work is needed to support both cases. > src/hotspot/share/opto/reachability.cpp line 438: > >> 436: if (!OptimizeReachabilityFences) { >> 437: return false; >> 438: } > > Can this ever fail? Could it be an assert? Done. > src/hotspot/share/opto/reachability.cpp line 441: > >> 439: >> 440: Unique_Node_List redundant_rfs; >> 441: Node_List worklist; > > Not sure if necessary, but maybe good practice anyway: add `ResourceMark`. Done. > src/hotspot/share/opto/reachability.cpp line 453: > >> 451: SafePointNode* sfpt = safepoints.pop()->as_SafePoint(); >> 452: assert(is_dominator(get_ctrl(referent), sfpt), ""); >> 453: assert(sfpt->req() == rf_start_offset(sfpt), ""); > > Is this the only reason we need this to happend during LoopOpts - i.e. that we can call `get_ctrl` and `is_dominator`? > > Because it is potentially a lot of overhead to create the whole loop-opts structures just for this. It's solely for `get_ctrl(referent)` call in `enumerate_interfering_sfpts()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337971541 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337972022 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337978893 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2334848196 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2334876169 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337985581 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337997889 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337998906 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337994039 From vlivanov at openjdk.org Wed Sep 10 22:06:05 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 10 Sep 2025 22:06:05 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> Message-ID: On Wed, 10 Sep 2025 21:46:17 GMT, Vladimir Ivanov wrote: >> You could mention that `Identity` will remove the node later. > >> Also: do you need the return value? The only use case does not do anything with it. > > I decided to keep it for diagnostic purposes even though no existing callers care about it. > You could mention that Identity will remove the node later. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2337989216 From dzhang at openjdk.org Wed Sep 10 23:55:10 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 10 Sep 2025 23:55:10 GMT Subject: RFR: 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet In-Reply-To: References: Message-ID: <4K4PN7Fl15s3VmsLEI4_u7RZqRTBU4m8KUtt7kYSUfc=.eb73d339-bfff-4a96-9c26-c43eacc66314@github.com> On Wed, 10 Sep 2025 03:10:02 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > [JDK-8366588](https://bugs.openjdk.org/browse/JDK-8366588) adds a vectorapi test for VectorMask.laneIsSet, which we can also enable on RISC-V. > > ### Test (fastdebug) > - [x] Run compiler/vectorapi/VectorMaskLaneIsSetTest.java on k1, k230 and sg2042 Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27181#issuecomment-3276890571 From duke at openjdk.org Wed Sep 10 23:55:10 2025 From: duke at openjdk.org (duke) Date: Wed, 10 Sep 2025 23:55:10 GMT Subject: RFR: 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet In-Reply-To: References: Message-ID: <4ve1VKFLupxZj5WpLeVbQ98wjoC7hXFlmgL0p5kwbW8=.0b819012-bdf7-400f-9ef7-95e35fd4dfd7@github.com> On Wed, 10 Sep 2025 03:10:02 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > [JDK-8366588](https://bugs.openjdk.org/browse/JDK-8366588) adds a vectorapi test for VectorMask.laneIsSet, which we can also enable on RISC-V. > > ### Test (fastdebug) > - [x] Run compiler/vectorapi/VectorMaskLaneIsSetTest.java on k1, k230 and sg2042 @DingliZhang Your change (at version c7a5e95ad5f7b84333509375db249c0797b480c4) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27181#issuecomment-3276892219 From dzhang at openjdk.org Thu Sep 11 00:07:20 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 11 Sep 2025 00:07:20 GMT Subject: Integrated: 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet In-Reply-To: References: Message-ID: <0GdMbtdpbOdvhk1eNjHjaeuRgHSinMk-oG95r_uiwWc=.e58f778a-3527-40a6-977d-a0c44164d54b@github.com> On Wed, 10 Sep 2025 03:10:02 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > [JDK-8366588](https://bugs.openjdk.org/browse/JDK-8366588) adds a vectorapi test for VectorMask.laneIsSet, which we can also enable on RISC-V. > > ### Test (fastdebug) > - [x] Run compiler/vectorapi/VectorMaskLaneIsSetTest.java on k1, k230 and sg2042 This pull request has now been integrated. Changeset: 134c3ef4 Author: Dingli Zhang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/134c3ef41e774b483bcce32ce2fe0ef416017728 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod 8367293: RISC-V: enable vectorapi test for VectorMask.laneIsSet Reviewed-by: fyang, epeter ------------- PR: https://git.openjdk.org/jdk/pull/27181 From sparasa at openjdk.org Thu Sep 11 00:45:45 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 11 Sep 2025 00:45:45 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: Message-ID: > This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. > > Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). > > For example: > `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding > `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: undo new match rules for RegMemReg for commutative operations ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26997/files - new: https://git.openjdk.org/jdk/pull/26997/files/9714a9b1..012511ab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26997&range=03-04 Stats: 120 lines in 1 file changed: 0 ins; 120 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26997.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26997/head:pull/26997 PR: https://git.openjdk.org/jdk/pull/26997 From sparasa at openjdk.org Thu Sep 11 00:45:45 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 11 Sep 2025 00:45:45 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: <0X5cvpQZxb1l5Q_8f-iU0K4WtdyFW8ehdPXR2zsnSzo=.7f4f3d03-94db-4482-b5ee-c5f1362d84b5@github.com> Message-ID: On Tue, 9 Sep 2025 02:18:32 GMT, Jatin Bhateja wrote: >> Will run experiments to make sure that the RegRegMem pattern also applies to RegMemReg case and remove the newly added match rules if they're redundant. Will update you soon. > > Hi @vamsi-parasa, your latest patch does not address this. Hi Jatin (@jatin-bhateja), please see the latest update which removed the unnecessary match rules for RegMemReg case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26997#discussion_r2338238097 From missa at openjdk.org Thu Sep 11 02:19:54 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 11 Sep 2025 02:19:54 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: - Check for instructions that shouldn't appear in vector floating point conversion tests - Correctly calculate vector lengths and don't rely on VectorReinterpret in cast2F2X and cast2D2X memory instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/bc59e4d2..8587952d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=09-10 Stats: 38 lines in 3 files changed: 2 ins; 0 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From epeter at openjdk.org Thu Sep 11 05:07:31 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 11 Sep 2025 05:07:31 GMT Subject: RFR: 8366702: C2 SuperWord: refactor VTransform vector nodes [v6] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 17:48:21 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix include order > > Thank you for your continued effort on cost modelling, @eme64! I have some minor style comments and questions, but this mostly looks good to me. > > Regarding style, I find the alignment of local variables to be a bit distracting, especially when the aligned "things" are different operations and things are sometimes aligned and sometimes not. However, I do not know the style of the rest of the SuperWord code. @mhaessig @galderz @chhagedorn Thanks for reviewing and all the helpful suggestions :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27056#issuecomment-3277619564 From epeter at openjdk.org Thu Sep 11 05:07:33 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 11 Sep 2025 05:07:33 GMT Subject: Integrated: 8366702: C2 SuperWord: refactor VTransform vector nodes In-Reply-To: References: Message-ID: <_4wlIBKArnJ0dC8M_Mfoa3I1JQ77CkOLDtSTP3KYPns=.eb956bae-c733-41c4-aa4a-d997feca2a80@github.com> On Tue, 2 Sep 2025 15:30:06 GMT, Emanuel Peter wrote: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > --------------------------------- > > I have to say: I'm very sorry for this refactoring. I took some decisions in https://github.com/openjdk/jdk/pull/19719 that I'm now partially undoing. I moved too much logic from `SuperWord::output` (now called `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`) to the `VTransform...Node::apply`. https://github.com/openjdk/jdk/pull/19719 was a roughly 1.5k line change, and I took about a 0.3k misstep that I'm now correcting here ;) > > I had accidentially made the `VTransformGraph` too close to the `PackSet`, and not close enough to the future vectorized C2 Graph. And that makes some future changes hard. > > My vision: > - VLoop / VLoopAnalyzer look at the scalar loop and prepare it for SuperWord > - SuperWord creates the `PackSet`: some nodes are packed, all others are scalar. > - `SuperWordVTransformBuilder` converts the `PackSet` into the `VTransformGraph` > - The `VTransformGraph` very closely represents the C2 vectorized loop after vectorization > - It does not need to know which `nodes` it packs, it rather just needs to know how to generate the new vector nodes > - That means it is straight-forward to compute cost > - And it also makes optimizations on that graph easier > - And the `apply` methods are simpler too > > ---------------------------------- > > So therefore, the main goal was to make the `VTransform...Node::apply` calls simpler again. And move the logic back to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > > One important step to making the the `VTransformGraph` less of a `PackSet` is to remove reliance on `nodes` for the vector nodes. > > What I did: > - Moving a lot of the logic in `VTransformElementWiseVectorNode::apply` to `SuperWordVTransformBuilder::make_vector_vtnode_for_pack`. > - Will make it easier to optimize and compute cost in future RFE's. > - `VTransformVectorNodePrototype`: packs a lot of the info for `VTransformVectorNode`. > - pass info about `bt`, `vlen`, `sopc` instead of the `pack` -> allows us to eventually remove the dependency on `nodes`. > - New vector nodes, they are special cases I split away from `VTransformElementWiseVectorNode`: > - `VTransformReinterpretVectorN... This pull request has now been integrated. Changeset: 4cc75be8 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/4cc75be80e6a89e0ed293e2f8bbb6d0f94189468 Stats: 352 lines in 4 files changed: 173 ins; 65 del; 114 mod 8366702: C2 SuperWord: refactor VTransform vector nodes Reviewed-by: chagedorn, galder ------------- PR: https://git.openjdk.org/jdk/pull/27056 From epeter at openjdk.org Thu Sep 11 05:08:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 11 Sep 2025 05:08:27 GMT Subject: RFR: 8367243: Format issues with dist dump debug output in PhaseGVN::dead_loop_check In-Reply-To: References: Message-ID: <5csypyowSL57JQ2SIkrH7CwktQ4nXeN7eNQSUS9nghQ=.5857a143-c420-48d9-aac9-9658211e1966@github.com> On Tue, 9 Sep 2025 16:28:24 GMT, Tobias Hartmann wrote: >> The `#` option adds color to the terminal. But that only usually works on people's terminals, and not if it is piped to a file on the server. Hence, `#` is only really a debugging feature, and not one to report with in connection with `assert`s. >> >> Simply removed the `#`, and fixed some braces and spaces. > > Looks good and trivial! @TobiHartmann Thanks for the review! I agree it is trivial. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27175#issuecomment-3277641678 From epeter at openjdk.org Thu Sep 11 05:08:29 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 11 Sep 2025 05:08:29 GMT Subject: Integrated: 8367243: Format issues with dist dump debug output in PhaseGVN::dead_loop_check In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 16:20:35 GMT, Emanuel Peter wrote: > The `#` option adds color to the terminal. But that only usually works on people's terminals, and not if it is piped to a file on the server. Hence, `#` is only really a debugging feature, and not one to report with in connection with `assert`s. > > Simply removed the `#`, and fixed some braces and spaces. This pull request has now been integrated. Changeset: 2826d170 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/2826d1702534783023802ac5c8d8ea575558f09f Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8367243: Format issues with dist dump debug output in PhaseGVN::dead_loop_check Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/27175 From epeter at openjdk.org Thu Sep 11 05:22:18 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 11 Sep 2025 05:22:18 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions Message-ID: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. Details, in **order you should review**: - `Operations.java`: maps lots of primitive operators as Expressions. - `Expression.java`: the fundamental engine behind Expressions. - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. - `tests/TestExpression.java`: correctness test of Expression machinery. - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. **Future Work**: - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. - Use `Expression`s to model more operations: - `Vector API`, more arithmetic operations like from `Math` classes etc. - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just folds away, but under `StressIGVN` and `StressCCP` it can at arbitrary times in CCP and IGVN narrow its type or fold away. Initially, it outputs the `bottom_type`, no matter the input type. Eventually, we can progressively update the output to be narrower, as long as it still contains the input type. And at some point fold it away. Each time, this should trigger worklist notification, and could trigger optimizations. If there is a bug, IGVN / CCP verification could catch it. ------------- Commit messages: - fix whitespaces - LibraryRNG example - fix bug - documentation - improve expression fuzzer - wip constraints - add more comments - wip test cmp - test refactoring - handle non-deterministic results - ... and 15 more: https://git.openjdk.org/jdk/compare/02fe095d...0709731a Changes: https://git.openjdk.org/jdk/pull/26885/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8359412 Stats: 1702 lines in 7 files changed: 1702 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26885.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26885/head:pull/26885 PR: https://git.openjdk.org/jdk/pull/26885 From galder at openjdk.org Thu Sep 11 06:11:11 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 11 Sep 2025 06:11:11 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 08:41:51 GMT, erifan wrote: > The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. > > This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: > 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. > 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. > > This pull request introduces the following changes: > 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. > 2. Eliminates unnecessary compress operations for partial subword type cases. > 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. > > Benchmark results demonstrate that these changes significantly improve performance. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Error After Error Uplift > Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 > Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 > Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 > Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 > > > This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. Would it make sense to additionally run the relevant benchmarks on other popular aarch64 platforms such as Graviton, to make sure the improvements are seen there as well? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27188#issuecomment-3278225500 From galder at openjdk.org Thu Sep 11 06:15:14 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 11 Sep 2025 06:15:14 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 08:41:51 GMT, erifan wrote: > The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. > > This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: > 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. > 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. > > This pull request introduces the following changes: > 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. > 2. Eliminates unnecessary compress operations for partial subword type cases. > 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. > > Benchmark results demonstrate that these changes significantly improve performance. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Error After Error Uplift > Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 > Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 > Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 > Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 > > > This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. Changes requested by galder (Author). src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2292: > 2290: // Return if the vector length is no more than MaxVectorSize/2, since the > 2291: // highest half is invalid. > 2292: if (vector_length_in_bytes <= (MaxVectorSize >> 1)) { Couldn't this check be done first thing when the function is called? Then you would avoid unnecessary work? I also wonder if this check should be done before `sve_compress_byte` is called, but I think at the very least it should be done first thing in this function. ------------- PR Review: https://git.openjdk.org/jdk/pull/27188#pullrequestreview-3209040850 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2338760542 From rcastanedalo at openjdk.org Thu Sep 11 07:45:20 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 11 Sep 2025 07:45:20 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v2] In-Reply-To: References: Message-ID: <8gKEdtd0n1SEUAGX-1Q41O0ZkCLNw2jUmXzDo1tWpyk=.30e53d54-d958-422e-8206-60fd56b9e412@github.com> On Wed, 10 Sep 2025 18:39:06 GMT, Cesar Soares Lucas wrote: >> Please, review this patch to fix issue that may occur when reducing allocation merge. >> >> As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. >> >> The change in `revisit_reducible_phi_status` is just a clean-up. >> The real fix is in `find_scalar_replaceable_allocs`. >> >> Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Revert clean-up in EA. Make catch statements more specific in test case. Changes requested by rcastanedalo (Reviewer). test/hotspot/jtreg/compiler/escapeAnalysis/TestReduceAllocationNotReducibleAnymore.java line 38: > 36: > 37: public class TestReduceAllocationNotReducibleAnymore { > 38: public static void main(String[] args) { Suggestion: public static void main (String[] args) { test/hotspot/jtreg/compiler/escapeAnalysis/TestReduceAllocationNotReducibleAnymore.java line 39: > 37: public class TestReduceAllocationNotReducibleAnymore { > 38: public static void main(String[] args) { > 39: for (int i =0; i< 100; i++) { Suggestion: for (int i = 0; i < 100; i++) { ------------- PR Review: https://git.openjdk.org/jdk/pull/27063#pullrequestreview-3209508720 PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2339170016 PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2339171222 From rcastanedalo at openjdk.org Thu Sep 11 07:45:22 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 11 Sep 2025 07:45:22 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: <2brDXuLmbVBVRaeSyCdKokA706v3t6VsZfGvj_QceJ4=.4483390e-c726-4d82-b220-f1dbdf4efef0@github.com> References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> <2brDXuLmbVBVRaeSyCdKokA706v3t6VsZfGvj_QceJ4=.4483390e-c726-4d82-b220-f1dbdf4efef0@github.com> Message-ID: On Tue, 9 Sep 2025 07:01:15 GMT, Roberto Casta?eda Lozano wrote: >>> Hi Cesar, thanks for addressing this issue. I will run some more comprehensive testing and have a look at it in the next days. >> >> Testing did not reveal any issue. I have, however, a high-level question: could the current two-step design ([SR state adjustment loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L300-L315) followed by a [NSR propagation loop](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L318-L320) miss marking allocations as NSR in more complex scenarios, e.g. involving longer points-to/merge chains? Wouldn't it be more principled to re-run the SR state adjustment loop until a fixed point is reached, keeping `reducible_merges` consistent as new allocations are discovered to be NSR? (e.g. by calling `revisit_reducible_phi_status` - with your clean-up applied - every time [an allocation is marked as NSR due to non-removable merges](https://github.com/openjdk/jdk/blob/166ef5e7b1c6d6a9f0f1f29fedb7f65b94f53119/src/hotspot/share/opto/escape.cpp#L2962-L2964)). > >> @robcasloz - are you thinking that the "fixed point" loops on `find_scalar_replaceable_allocs` aren't sufficient? > > You're right, that should do. > >> At first glance yes, I think that the code would be more cleaned up if done that way. If the code had been written like that in the first place we wouldn't have seen the current issue. (...) > > Agree, a single fixed point loop combining NSR detection and propagation would be ideal for clarity and maintainability. > >> I propose that we move forward with the current patch and work on this refactoring as a separate issue. > > Sounds good, please file a RFE for that. I would suggest then to postpone the clean-up in `revisit_reducible_phi_status` to that RFE. > @robcasloz - I pushed some changes addressing yours and @eme64 comments. Could you please re-run your internal tests? Thanks, I will report back within a couple of days. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3278934984 From rcastanedalo at openjdk.org Thu Sep 11 07:50:33 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 11 Sep 2025 07:50:33 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v8] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> <1gdeBnZ7YuIf9CgQW2bCXkDDBWPjUgRnickHts-fvzE=.e6e901ba-3e9f-41a2-9c68-167a879e9655@github.com> <2m1_XtiSsW_LaBRrkX4qv7AKtLOjNgnl4mUp3zisasE=.dda62164-7aa0-4c1a-b83f-fa40ba7902e5@github.com> <4374L3lkQK90wLxxOA7POBmIKNX2DFK-4pO4vj1bkuQ=.5b8d7825-a7f1-497f-ab66-02a85a266659@github.com> Message-ID: On Wed, 10 Sep 2025 08:29:19 GMT, Roberto Casta?eda Lozano wrote: > That sounds good to me, thank you for enforcing this Roland! I will re-run testing and have a new look at the changeset within the next days. Test results of b701d03ed335286587c4d2539dde715b091d30bd on top of jdk-26+14 look good. Will have a look at the code within the next days. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24570#issuecomment-3278966805 From rcastanedalo at openjdk.org Thu Sep 11 08:40:15 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 11 Sep 2025 08:40:15 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v2] In-Reply-To: <8gKEdtd0n1SEUAGX-1Q41O0ZkCLNw2jUmXzDo1tWpyk=.30e53d54-d958-422e-8206-60fd56b9e412@github.com> References: <8gKEdtd0n1SEUAGX-1Q41O0ZkCLNw2jUmXzDo1tWpyk=.30e53d54-d958-422e-8206-60fd56b9e412@github.com> Message-ID: On Thu, 11 Sep 2025 07:40:03 GMT, Roberto Casta?eda Lozano wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert clean-up in EA. Make catch statements more specific in test case. > > test/hotspot/jtreg/compiler/escapeAnalysis/TestReduceAllocationNotReducibleAnymore.java line 38: > >> 36: >> 37: public class TestReduceAllocationNotReducibleAnymore { >> 38: public static void main(String[] args) { > > Suggestion: > > public static void main (String[] args) { @JohnTortugo Please disregard this style suggestion, I had not had my morning coffee yet. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2339463846 From mli at openjdk.org Thu Sep 11 09:03:56 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 11 Sep 2025 09:03:56 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v5] In-Reply-To: References: <64z-PlrnxAISLzKBq-RZz7CXkQirGTvOgTGMJQl833o=.73ea3239-dfb6-4e32-b20f-8398334f2759@github.com> Message-ID: On Wed, 10 Sep 2025 18:43:46 GMT, Robbin Ehn wrote: > Hamlin had some offline Q so I gather this data for him: Thanks Robbin for collecting the data! > So on average using auipc+ld+jalr + JAL opt is 1.73% faster than the old trampolines. This looks great! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3279324038 From mli at openjdk.org Thu Sep 11 09:03:58 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 11 Sep 2025 09:03:58 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v2] In-Reply-To: References: Message-ID: On Wed, 3 Sep 2025 09:47:07 GMT, Robbin Ehn wrote: >>> But the AbstractICache::invalidate_range is not documented to guarantee to have this effect. >> >> what "not documented" here mean? By reading the code, seems `AbstractICache::invalidate_range` will delegate to `icache_flush` in riscv which will do the fence and flush. >> >> BTW, here are some comments from hotspot/share/runtime/icache.hpp, >> >> // Default implementation is in icache.cpp, and can be hidden per-platform. >> // Most platforms must provide only ICacheStubGenerator::generate_icache_flush(). >> >> >>> If someone executes the new instruction when changed to jalr(3), we did want them to call the new location we stored to the stub(1). By saying 1 happens before 3, we convey our intent. >>> Aarch64 also have this. >> >> Make sense! >> In worst condition, what will happen if we remove the 2 release here and just count on `fence rw, rw` in `AbstractICache::invalidate_range`? Seems we're fine based on your latter comment. >> I suppose these extra 2 releases bring some performance penalty? If this is true, I'm not sure if it's worth to treat such a rare condition in such a proper way. > >> > But the AbstractICache::invalidate_range is not documented to guarantee to have this effect. >> >> what "not documented" here mean? By reading the code, seems `AbstractICache::invalidate_range` will delegate to `icache_flush` in riscv which will do the fence and flush. >> >> BTW, here are some comments from hotspot/share/runtime/icache.hpp, >> >> ``` >> // Default implementation is in icache.cpp, and can be hidden per-platform. >> // Most platforms must provide only ICacheStubGenerator::generate_icache_flush(). >> ``` > > Yes, and it doesn't say this method also provide a release fence or anything like that. > I other general code we seem to needed, I can remove release(4) for a comment if you like. > >> >> > If someone executes the new instruction when changed to jalr(3), we did want them to call the new location we stored to the stub(1). By saying 1 happens before 3, we convey our intent. >> > Aarch64 also have this. >> >> Make sense! In worst condition, what will happen if we remove the 2 release here and just count on `fence rw, rw` in `AbstractICache::invalidate_range`? Seems we're fine based on your latter comment. I suppose these extra 2 releases bring some performance penalty? If this is true, I'm not sure if it's worth to treat such a rare condition in such a proper way. > > Yes, we should be fine, but there is no reason to not store them in 'wish' order. > No there is no perfomance differences, this code is not executed often and the call to invalidate_range is so slow that anything else don't matter. You are talking about removing a few cycles from something that take tens of thousands of cycles. I think we'd better to remove the code which is not necessary, in the sense of performance and readability. If needed, we can add some comments here instead. Otherwise the change looks good to me. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26944#discussion_r2339594340 From rehn at openjdk.org Thu Sep 11 09:19:59 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 11 Sep 2025 09:19:59 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v6] In-Reply-To: References: Message-ID: > Hey, please consider! > > A bunch of info in JBS entry, please read that also. > > I narrowed this issue down to the old jal optimization, making direct calls when in reach. > This patch restores them and removes this regression. > > In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. > > Please test on your hardware! > > > Chi Square (100 runs each, 10 fastest iterations of each run, P550) > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains nine additional commits since the last revision: - Review fix - Merge branch 'master' into 8365926 - Merge branch 'master' into 8365926 - Review comments - Review comments - Merge branch 'master' into 8365926 - Spelling - Merge branch 'master' into 8365926 - draft jal<->jalr ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26944/files - new: https://git.openjdk.org/jdk/pull/26944/files/da18e6b6..b4e6c579 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26944&range=04-05 Stats: 16185 lines in 515 files changed: 7189 ins; 6335 del; 2661 mod Patch: https://git.openjdk.org/jdk/pull/26944.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26944/head:pull/26944 PR: https://git.openjdk.org/jdk/pull/26944 From mli at openjdk.org Thu Sep 11 09:25:16 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 11 Sep 2025 09:25:16 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v6] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 09:19:59 GMT, Robbin Ehn wrote: >> Hey, please consider! >> >> A bunch of info in JBS entry, please read that also. >> >> I narrowed this issue down to the old jal optimization, making direct calls when in reach. >> This patch restores them and removes this regression. >> >> In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. >> >> Please test on your hardware! >> >> >> Chi Square (100 runs each, 10 fastest iterations of each run, P550) >> JDK-23 (last version with trampoline calls) >> Mean: 3189.5827 >> Standard Deviation: 284.6478 >> >> JDK-25 >> Mean: 3424.8905 >> Standard Deviation: 222.2208 >> >> Patch: >> Mean: 3144.8535 >> Standard Deviation: 229.2577 >> >> >> No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. > > Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains nine additional commits since the last revision: > > - Review fix > - Merge branch 'master' into 8365926 > - Merge branch 'master' into 8365926 > - Review comments > - Review comments > - Merge branch 'master' into 8365926 > - Spelling > - Merge branch 'master' into 8365926 > - draft jal<->jalr Looks good. Thanks! ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26944#pullrequestreview-3210160509 From thartmann at openjdk.org Thu Sep 11 09:45:29 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 11 Sep 2025 09:45:29 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v2] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 18:39:06 GMT, Cesar Soares Lucas wrote: >> Please, review this patch to fix issue that may occur when reducing allocation merge. >> >> As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. >> >> The change in `revisit_reducible_phi_status` is just a clean-up. >> The real fix is in `find_scalar_replaceable_allocs`. >> >> Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Revert clean-up in EA. Make catch statements more specific in test case. src/hotspot/share/opto/escape.cpp line 3135: > 3133: Node* phi = use->ideal_node(); > 3134: if (phi->Opcode() == Op_Phi && reducible_merges.member(phi)) { > 3135: if (!can_reduce_phi(phi->as_Phi())) { Drive-by comment: I think the ifs should be merged ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2339804735 From dlunden at openjdk.org Thu Sep 11 10:02:33 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 11 Sep 2025 10:02:33 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v26] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Address review comments (renaming on the way in a separate PR) ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/c4a706b5..f250a061 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=25 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=24-25 Stats: 203 lines in 2 files changed: 183 ins; 14 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From dlunden at openjdk.org Thu Sep 11 10:02:38 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 11 Sep 2025 10:02:38 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: Message-ID: On Mon, 1 Sep 2025 07:30:49 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: >> >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Fix typo >> - Updates after Emanuel's comments >> - Refactor and improve TestNestedSynchronize.java >> - ... and 25 more: https://git.openjdk.org/jdk/compare/b39c7369...80c6cf47 > > test/hotspot/jtreg/compiler/arguments/TestMaxMethodArguments.java line 57: > >> 55: try { >> 56: test(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 21 7, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255); >> 57: } catch (TestException e) { > > This seems to be the only test that actually tests what your PR title promises: it has a method with many arguments. I have now pushed a new template framework test `TestMethodArguments.java`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2339895018 From dlunden at openjdk.org Thu Sep 11 10:13:11 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 11 Sep 2025 10:13:11 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp Message-ID: Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. ### Changeset - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. ### Testing - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. ------------- Commit messages: - Fix issue Changes: https://git.openjdk.org/jdk/pull/27215/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367397 Stats: 108 lines in 12 files changed: 1 ins; 0 del; 107 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From dlunden at openjdk.org Thu Sep 11 10:13:11 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 11 Sep 2025 10:13:11 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 10:04:47 GMT, Daniel Lund?n wrote: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Suggesting @robcasloz and @eme64 for reviewing, as you are already familiar with the code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27215#issuecomment-3279707428 From dlunden at openjdk.org Thu Sep 11 10:16:33 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 11 Sep 2025 10:16:33 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v23] In-Reply-To: References: Message-ID: On Wed, 27 Aug 2025 09:08:09 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Add clarifying comments at definitions of register mask sizes > >> For reference, here is now the changeset adding an IFG bailout: #26118 > > Since that is now integrated: do we need to make any changes to the patch here? I thought the goal was to use the bailouts instead of increasing `MaxNodeLimit`. > > Because looking at the discussions above: we were worried that there could be compile-time regressions - even if quite rare. But they were in the range of 40s which is quite scary. Are these now gone? @eme64 I have now addressed your comments (the renaming is in https://github.com/openjdk/jdk/pull/27215, as requested). Please have a look and let me know if I've missed something. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-3279719061 From mli at openjdk.org Thu Sep 11 10:49:22 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 11 Sep 2025 10:49:22 GMT Subject: RFR: 8367406: Simple refactoring AOTCodeAddressTable::id_for_address Message-ID: <-15yudSPZOyKnpwNY9mTVKdXDu4hcjxqZxI2AXodi5Q=.9a3db977-a9d8-4061-8a04-39ce967eb550@github.com> Hi, Can you help to review this simple refactoring? AOTCodeAddressTable::id_for_address currently is implemented in a way that introduce too many nested if/else, seems we could make the code more readable by removing these nested if/else. But it's quite subjective, so I'll let you tell if the patch is helpful. Run tests (test/hotspot/jtreg/runtime/cds/appcds/aot*), no new failures on x64. Thanks! ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/27217/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27217&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367406 Stats: 60 lines in 1 file changed: 9 ins; 12 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/27217.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27217/head:pull/27217 PR: https://git.openjdk.org/jdk/pull/27217 From fandreuzzi at openjdk.org Thu Sep 11 11:07:34 2025 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Thu, 11 Sep 2025 11:07:34 GMT Subject: RFR: 8367406: Simple refactoring AOTCodeAddressTable::id_for_address In-Reply-To: <-15yudSPZOyKnpwNY9mTVKdXDu4hcjxqZxI2AXodi5Q=.9a3db977-a9d8-4061-8a04-39ce967eb550@github.com> References: <-15yudSPZOyKnpwNY9mTVKdXDu4hcjxqZxI2AXodi5Q=.9a3db977-a9d8-4061-8a04-39ce967eb550@github.com> Message-ID: <99qr4pl_RhUPNfhtAirWoeY5fUDaYdEgOYXetw59yGw=.f91f9b32-7a3f-4e1a-848c-c12cf65394a6@github.com> On Thu, 11 Sep 2025 10:42:48 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple refactoring? > > AOTCodeAddressTable::id_for_address currently is implemented in a way that introduce too many nested if/else, seems we could make the code more readable by removing these nested if/else. But it's quite subjective, so I'll let you tell if the patch is helpful. > > Run tests (test/hotspot/jtreg/runtime/cds/appcds/aot*), no new failures on x64. > > Thanks! src/hotspot/share/code/aotCodeCache.cpp line 1685: > 1683: desc = StubCodeDesc::desc_for(addr + frame::pc_return_offset); > 1684: } > 1685: const char* sub_name = (desc != nullptr) ? desc->name() : ""; This seems to be used only in the assertion, maybe it could be hidden behind a `#ifdef ASSERT`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27217#discussion_r2340199165 From jbhateja at openjdk.org Thu Sep 11 12:16:47 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 11 Sep 2025 12:16:47 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v7] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Adding random bound test point ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/9e3957de..a7f9b79c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=05-06 Stats: 60 lines in 1 file changed: 58 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From jbhateja at openjdk.org Thu Sep 11 12:16:49 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 11 Sep 2025 12:16:49 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: <88lK21UPhkqWYMU-PNUCMYYH1QWrjiUfftspxZB7GFM=.99f8aa26-0075-4932-a427-054f088d8068@github.com> <7GYE4B_fk2sz0pxSjPgYxpTWz1v4T0-V-oMmBcS0tpY=.658141db-9bac-4f19-876f-f859ae00984b@github.com> <4PssROFHsUv9rYCp9KlszXVzJV4jIxbHSOWKmQ8VA0k=.db8cac2d-23fd-4ad0-ac1e-7c6e2f3c7b8e@github.com> <6eJGadjOxt_uInDZmiRc5MZNefslQT3-bOcsTp2tEe0=.0bdc2d27-d09a-4872-9633-51c2b55d1c18@github.com> Message-ID: <4c4bRGY4F63wqzDBO5mcg5K6o51PkRPlfH65EsIYqXI=.b0a45206-5c7a-4635-8fbe-2f97cd6c6463@github.com> On Wed, 10 Sep 2025 16:03:10 GMT, Emanuel Peter wrote: >>> @SirYwell @jatin-bhateja @merykitty I linked this issue here to the KnownBits RFE, to make sure we keep track of all KnownBits extensions. Can you please help me with linking any other RFEs that have already been filed or come up in the future? It would help track progress and avoid duplicate work. >> >> Current And Value Transforms : >> - Constant folds - both inputs >> - There are four possible cases for known bits extraction : - >> >> _lo _hi >> <0 <0 : Possibility of finding common prefix and known ZERO and ONE bits among the common portion. >> >=0 <0 : Not applicable scenario, since lower is greater than the upper bound. >> <0 >=0 : No possibility of finding a common prefix b/w hi and lo bounds, thus no known bits exist. >> >=0 >=0 : Possibility of finding common prefix and known ZERO and ONE bits among the common portion. >> >> >> Existing value transforms and canonicalization should furnish known bits in applicable scenarios. >> >> For a full solution, we can add another rule to directly AND the known ZERO and ONE bits of participating inputs, and let canonicalization compute the resultant type and clean up existing handling in Value transforms and explicit constant folding > >> For a full solution, we can add another rule to directly AND the known ZERO and ONE bits of participating inputs, and let canonicalization compute the resultant type and clean up existing handling in Value transforms and explicit constant folding > > Yes, this is what we would end up with after https://bugs.openjdk.org/browse/JDK-8367341 . But I think currently, there is no good way to set / get bits directly. > > Using signed comparisons as you mentioned is only of limited help. But it is what we have for now. > > testPopCountElisionInt1 and testPopCountElisionLong1 check for absence of PopCount IR nodes. > > @jatin-bhateja But there the clamps are with fixed constants. It would be nice if we also had some tests with randomized constants. We don't need IR tests for those, just result verification. Hi @eme64, added a random bound test point.- ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2340528585 From mli at openjdk.org Thu Sep 11 12:29:48 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 11 Sep 2025 12:29:48 GMT Subject: RFR: 8367406: Simple refactoring AOTCodeAddressTable::id_for_address In-Reply-To: <99qr4pl_RhUPNfhtAirWoeY5fUDaYdEgOYXetw59yGw=.f91f9b32-7a3f-4e1a-848c-c12cf65394a6@github.com> References: <-15yudSPZOyKnpwNY9mTVKdXDu4hcjxqZxI2AXodi5Q=.9a3db977-a9d8-4061-8a04-39ce967eb550@github.com> <99qr4pl_RhUPNfhtAirWoeY5fUDaYdEgOYXetw59yGw=.f91f9b32-7a3f-4e1a-848c-c12cf65394a6@github.com> Message-ID: On Thu, 11 Sep 2025 11:04:54 GMT, Francesco Andreuzzi wrote: >> Hi, >> Can you help to review this simple refactoring? >> >> AOTCodeAddressTable::id_for_address currently is implemented in a way that introduce too many nested if/else, seems we could make the code more readable by removing these nested if/else. But it's quite subjective, so I'll let you tell if the patch is helpful. >> >> Run tests (test/hotspot/jtreg/runtime/cds/appcds/aot*), no new failures on x64. >> >> Thanks! > > src/hotspot/share/code/aotCodeCache.cpp line 1685: > >> 1683: desc = StubCodeDesc::desc_for(addr + frame::pc_return_offset); >> 1684: } >> 1685: const char* sub_name = (desc != nullptr) ? desc->name() : ""; > > This seems to be used only in the assertion, maybe it could be hidden behind a `#ifdef ASSERT`? I assume the compiler will remove it in product version, as the sub_name is not used anywhere else, and there is no side effect of its generation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27217#discussion_r2340594415 From epeter at openjdk.org Thu Sep 11 12:56:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 11 Sep 2025 12:56:45 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp In-Reply-To: References: Message-ID: <2unG-RdDR2e1mI-veaR3AdDGGs1q4XFdITrnQtBGOw8=.47f7d565-b722-434b-96ef-b51ed733b241@github.com> On Thu, 11 Sep 2025 10:04:47 GMT, Daniel Lund?n wrote: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Thanks for doing this! I think it is a step in the right direction - though I have not checked if it renames everything we should. I think we can just do this one now, integrate it to your other PR, and see if we need to do another round of renamings. src/hotspot/share/opto/regmask.hpp line 78: > 76: // is something like 90+ parameters. > 77: int _RM_INT[RM_SIZE_IN_INTS]; > 78: uintptr_t _RM_WORD[_RM_SIZE_IN_WORDS]; Is there now still a reason to have `_` for the words and not for the ints? ------------- PR Review: https://git.openjdk.org/jdk/pull/27215#pullrequestreview-3211313012 PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2340652830 From epeter at openjdk.org Thu Sep 11 12:56:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 11 Sep 2025 12:56:46 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp In-Reply-To: <2unG-RdDR2e1mI-veaR3AdDGGs1q4XFdITrnQtBGOw8=.47f7d565-b722-434b-96ef-b51ed733b241@github.com> References: <2unG-RdDR2e1mI-veaR3AdDGGs1q4XFdITrnQtBGOw8=.47f7d565-b722-434b-96ef-b51ed733b241@github.com> Message-ID: On Thu, 11 Sep 2025 12:40:02 GMT, Emanuel Peter wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > src/hotspot/share/opto/regmask.hpp line 78: > >> 76: // is something like 90+ parameters. >> 77: int _RM_INT[RM_SIZE_IN_INTS]; >> 78: uintptr_t _RM_WORD[_RM_SIZE_IN_WORDS]; > > Is there now still a reason to have `_` for the words and not for the ints? Generally, we use `_` for fields, but not for constants. Also: fields should be lower-case, so maybe `_RM_INT` -> `_rm_int`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2340660689 From adinn at openjdk.org Thu Sep 11 13:13:27 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 11 Sep 2025 13:13:27 GMT Subject: RFR: 8367406: Simple refactoring AOTCodeAddressTable::id_for_address In-Reply-To: <-15yudSPZOyKnpwNY9mTVKdXDu4hcjxqZxI2AXodi5Q=.9a3db977-a9d8-4061-8a04-39ce967eb550@github.com> References: <-15yudSPZOyKnpwNY9mTVKdXDu4hcjxqZxI2AXodi5Q=.9a3db977-a9d8-4061-8a04-39ce967eb550@github.com> Message-ID: <1TN7O_LAUKJXrMDQFCQWVbqK_z1EGwv19lIKZkIRA9U=.c7edd78b-a253-4284-afbc-a775e1983b0b@github.com> On Thu, 11 Sep 2025 10:42:48 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple refactoring? > > AOTCodeAddressTable::id_for_address currently is implemented in a way that introduce too many nested if/else, seems we could make the code more readable by removing these nested if/else. But it's quite subjective, so I'll let you tell if the patch is helpful. > > Run tests (test/hotspot/jtreg/runtime/cds/appcds/aot*), no new failures on x64. > > Thanks! I'm not convinced this is making anything simpler. Also, it is diverging from the code we have in the Leyden repo which caters for further cases. If this code does merit a cleanup (which I agree is the case) the that should really wait until we have 1. folded in cases currently catered for in Leyden premain that deal with translation of stub addresses 2. worked out a better way of managing addresses than the current use of several ad hoc bucket lists ------------- PR Comment: https://git.openjdk.org/jdk/pull/27217#issuecomment-3280585779 From chagedorn at openjdk.org Thu Sep 11 13:20:58 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 11 Sep 2025 13:20:58 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v7] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Tue, 9 Sep 2025 08:39:37 GMT, Roland Westrelin wrote: >> A node in a pre loop only has uses out of the loop dominated by the >> loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control >> to the loop exit projection. A range check in the main loop has this >> node as input (through a chain of some other nodes). Range check >> elimination needs to update the exit condition of the pre loop with an >> expression that depends on the node pinned on its exit: that's >> impossible and the assert fires. This is a variant of 8314024 (this >> one was for a node with uses out of the pre loop on multiple paths). I >> propose the same fix: leave the node with control in the pre loop in >> this case. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Merge branch 'master' into JDK-8361702 > - Merge branch 'master' into JDK-8361702 > - review > - Merge branch 'master' into JDK-8361702 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn > - tests > - ... and 1 more: https://git.openjdk.org/jdk/compare/3ba2cf5f...91a7d73c Testing looked good! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26424#issuecomment-3280625737 From mli at openjdk.org Thu Sep 11 13:26:07 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 11 Sep 2025 13:26:07 GMT Subject: RFR: 8367406: Simple refactoring AOTCodeAddressTable::id_for_address In-Reply-To: <1TN7O_LAUKJXrMDQFCQWVbqK_z1EGwv19lIKZkIRA9U=.c7edd78b-a253-4284-afbc-a775e1983b0b@github.com> References: <-15yudSPZOyKnpwNY9mTVKdXDu4hcjxqZxI2AXodi5Q=.9a3db977-a9d8-4061-8a04-39ce967eb550@github.com> <1TN7O_LAUKJXrMDQFCQWVbqK_z1EGwv19lIKZkIRA9U=.c7edd78b-a253-4284-afbc-a775e1983b0b@github.com> Message-ID: On Thu, 11 Sep 2025 13:10:29 GMT, Andrew Dinn wrote: > I'm not convinced this is making anything simpler. Also, it is diverging from the code we have in the Leyden repo which caters for further cases. > > If this code does merit a cleanup (which I agree is the case) the that should really wait until we have > > 1. folded in cases currently catered for in Leyden premain that deal with translation of stub addresses > 2. worked out a better way of managing addresses than the current use of several ad hoc bucket lists I see, thanks for the information. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27217#issuecomment-3280665372 From dlunden at openjdk.org Thu Sep 11 13:43:26 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 11 Sep 2025 13:43:26 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp In-Reply-To: References: <2unG-RdDR2e1mI-veaR3AdDGGs1q4XFdITrnQtBGOw8=.47f7d565-b722-434b-96ef-b51ed733b241@github.com> Message-ID: <1xDonJ67G3hUAWTdngutIb7LBboWxHRviCHXKDCSoN4=.2617f8e9-206b-424d-a1ab-501b182717bb@github.com> On Thu, 11 Sep 2025 12:41:50 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/regmask.hpp line 78: >> >>> 76: // is something like 90+ parameters. >>> 77: int _RM_INT[RM_SIZE_IN_INTS]; >>> 78: uintptr_t _RM_WORD[_RM_SIZE_IN_WORDS]; >> >> Is there now still a reason to have `_` for the words and not for the ints? > > Generally, we use `_` for fields, but not for constants. > Also: fields should be lower-case, so maybe `_RM_INT` -> `_rm_int`? Thanks, I agree that it seems more consistent to use `_rm_int` and `_rm_word` instead. The missing leading underscore for `RM_SIZE_IN_INTS` highlights that it is a macro, unlike `_RM_SIZE_IN_WORDS`. Maybe this is just for historical reasons and not up to date with today's conventions? Do we classify constant static fields such as `_RM_SIZE_IN_WORDS` as constants or fields? I.e., do we use upper or lower case? I guess it would be `_rm_size_in_words` if considered a field and `RM_SIZE_IN_WORDS` (without the leading underscore) if considered a constant. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2340921637 From dlunden at openjdk.org Thu Sep 11 14:01:43 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 11 Sep 2025 14:01:43 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Lowercase _RM_INT and _RM_WORD ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27215/files - new: https://git.openjdk.org/jdk/pull/27215/files/67381d34..61ff4f8c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=00-01 Stats: 50 lines in 2 files changed: 0 ins; 0 del; 50 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From sparasa at openjdk.org Thu Sep 11 16:28:21 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 11 Sep 2025 16:28:21 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 00:45:45 GMT, Srinivas Vamsi Parasa wrote: >> This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. >> >> Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). >> >> For example: >> `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding >> `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > undo new match rules for RegMemReg for commutative operations Hi Emanuel (@eme64), Could you please run the tests for this PR? Thanks, Vamsi ------------- PR Comment: https://git.openjdk.org/jdk/pull/26997#issuecomment-3281736456 From sviswanathan at openjdk.org Thu Sep 11 17:02:40 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 11 Sep 2025 17:02:40 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 02:19:54 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` >> 1... > > Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: > > - Check for instructions that shouldn't appear in vector floating point conversion tests > - Correctly calculate vector lengths and don't rely on VectorReinterpret in cast2F2X and cast2D2X memory instructions test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 496: > 494: public static final String CAST_F2X = PREFIX + "CAST_F2X" + POSTFIX; > 495: static { > 496: machOnlyNameRegex(CAST_F2X, "castF2X_reg_(av|eve)x"); This should be "castFtoX_reg_(av|eve)x". test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 501: > 499: public static final String CAST_D2X = PREFIX + "CAST_D2X" + POSTFIX; > 500: static { > 501: machOnlyNameRegex(CAST_D2X, "castD2X_reg_(av|eve)x"); This should be "castDtoX_reg_(av|eve)x". test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 506: > 504: public static final String CAST2_F2X = PREFIX + "CAST2_F2X" + POSTFIX; > 505: static { > 506: machOnlyNameRegex(CAST2_F2X, "cast2F2X_(reg|mem)_evex"); This should be "cast2FtoX_(reg|mem)_evex" test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 511: > 509: public static final String CAST2_D2X = PREFIX + "CAST2_D2X" + POSTFIX; > 510: static { > 511: machOnlyNameRegex(CAST2_D2X, "cast2D2X_(reg|mem)_evex"); This should be "cast2DtoX_(reg|mem)_evex". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341775174 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341782092 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341784055 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341787294 From cslucas at openjdk.org Thu Sep 11 17:09:35 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 11 Sep 2025 17:09:35 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v3] In-Reply-To: References: Message-ID: <9prBDcDkholUOVv1rNRDNQyrjCzn6FCESaQofSTwLN0=.c13a6b10-82ef-479b-b1d5-3102d7ea0165@github.com> > Please, review this patch to fix issue that may occur when reducing allocation merge. > > As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. > > The change in `revisit_reducible_phi_status` is just a clean-up. > The real fix is in `find_scalar_replaceable_allocs`. > > Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/escapeAnalysis/TestReduceAllocationNotReducibleAnymore.java Co-authored-by: Roberto Casta?eda Lozano ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27063/files - new: https://git.openjdk.org/jdk/pull/27063/files/17d5ab22..28d9432e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27063&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27063&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27063.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27063/head:pull/27063 PR: https://git.openjdk.org/jdk/pull/27063 From jbhateja at openjdk.org Thu Sep 11 17:18:51 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 11 Sep 2025 17:18:51 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 00:45:45 GMT, Srinivas Vamsi Parasa wrote: >> This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. >> >> Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). >> >> For example: >> `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding >> `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > undo new match rules for RegMemReg for commutative operations Hi @vamsi-parasa , Thanks for addressing my comments. ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26997#pullrequestreview-3212908454 From jbhateja at openjdk.org Thu Sep 11 17:31:14 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 11 Sep 2025 17:31:14 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: References: Message-ID: <8x67EDx2mHmRygqECi1m3BJ8kmBOpogaVvy-V_NnsUU=.f41f9f34-a559-491b-8d9d-8ae05a6890d3@github.com> On Thu, 11 Sep 2025 02:19:54 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` >> 1... > > Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: > > - Check for instructions that shouldn't appear in vector floating point conversion tests > - Correctly calculate vector lengths and don't rely on VectorReinterpret in cast2F2X and cast2D2X memory instructions src/hotspot/cpu/x86/x86.ad line 7719: > 7717: is_integral_type(Matcher::vector_element_basic_type(n))); > 7718: match(Set dst (VectorCastF2X src)); > 7719: format %{ "vector_cast2r_f2x $dst, $src\t!" %} Suggestion: format %{ "vector_cast_f2x_saturating $dst, $src\t!" %} src/hotspot/cpu/x86/x86.ad line 7732: > 7730: is_integral_type(Matcher::vector_element_basic_type(n))); > 7731: match(Set dst (VectorCastF2X (LoadVector src))); > 7732: format %{ "vector_cast2m_f2x $dst, $src\t!" %} Suggestion: format %{ "vector_cast_f2x_saturating $dst, $src\t!" %} src will be represented by appropriate addressing scheme for the memory operand src/hotspot/cpu/x86/x86.ad line 7793: > 7791: is_integral_type(Matcher::vector_element_basic_type(n))); > 7792: match(Set dst (VectorCastD2X src)); > 7793: format %{ "vector_cast2r_d2x $dst, $src\t!" %} Suggestion: format %{ "vector_cast_d2x_saturating $dst, $src\t!" %} src/hotspot/cpu/x86/x86.ad line 7806: > 7804: is_integral_type(Matcher::vector_element_basic_type(n))); > 7805: match(Set dst (VectorCastD2X (LoadVector src))); > 7806: format %{ "vector_cast2m_d2x $dst, $src\t!" %} Suggestion: format %{ "vector_cast_d2x_saturating $dst, $src\t!" %} ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341851882 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341859872 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341861234 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2341861814 From hgreule at openjdk.org Thu Sep 11 17:42:46 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Thu, 11 Sep 2025 17:42:46 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v8] In-Reply-To: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: <19JdaOkvM92QSjXvYVr1CNSXD5hkXINl1gh6qj-DCMQ=.6b268ebd-6c9a-4b33-b355-1dc41de53454@github.com> > This change improves the precision of the `Mod(I|L)Node::Value()` functions. > > I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. > The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. > > ### Monotonicity > > Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). > > ### Testing > > I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). > > Please review and let me know what you think. > > ### Other > > The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. > > During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: > - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? > - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25254/files - new: https://git.openjdk.org/jdk/pull/25254/files/5c74919a..41d0e2c7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25254&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25254&range=06-07 Stats: 7 lines in 1 file changed: 0 ins; 3 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/25254.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25254/head:pull/25254 PR: https://git.openjdk.org/jdk/pull/25254 From hgreule at openjdk.org Thu Sep 11 17:42:47 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Thu, 11 Sep 2025 17:42:47 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Thu, 15 May 2025 17:47:16 GMT, Quan Anh Mai wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > >> Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). > > Can we return `Type::TOP` instead? > > Besides, #17508 should be merged right after JDK-25 folk, do you want to wait for it first? @merykitty thanks, I hopefully addressed your comments :) @eme64 do you want to re-run the tests once again? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3282030670 From vlivanov at openjdk.org Thu Sep 11 18:13:33 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 11 Sep 2025 18:13:33 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v10] In-Reply-To: References: Message-ID: > This PR introduces C2 support for `Reference.reachabilityFence()`. > > After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. > > `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. > > Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. > > Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 > "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." > > Testing: > - [x] hs-tier1 - hs-tier8 > - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations > - [x] java/lang/foreign microbenchmarks Vladimir Ivanov has updated the pull request incrementally with two additional commits since the last revision: - minor fixes - Fix guaranteed_safepoint usage ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25315/files - new: https://git.openjdk.org/jdk/pull/25315/files/6981bd18..267995ce Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=08-09 Stats: 54 lines in 5 files changed: 33 ins; 14 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/25315.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25315/head:pull/25315 PR: https://git.openjdk.org/jdk/pull/25315 From vlivanov at openjdk.org Thu Sep 11 18:18:13 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 11 Sep 2025 18:18:13 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: Message-ID: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> > This PR introduces C2 support for `Reference.reachabilityFence()`. > > After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. > > `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. > > Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. > > Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 > "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." > > Testing: > - [x] hs-tier1 - hs-tier8 > - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations > - [x] java/lang/foreign microbenchmarks Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: Minor fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25315/files - new: https://git.openjdk.org/jdk/pull/25315/files/267995ce..01eaf64f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=09-10 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/25315.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25315/head:pull/25315 PR: https://git.openjdk.org/jdk/pull/25315 From vlivanov at openjdk.org Thu Sep 11 18:28:12 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 11 Sep 2025 18:28:12 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> Message-ID: <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> On Wed, 3 Sep 2025 08:30:47 GMT, Emanuel Peter wrote: >>>> Representing ReachabilityFence as memory barrier (e.g., MemBarCPUOrder) would solve the issue, but performance costs are prohibitively high. >> >>> How bad is it? MemBarCPUOrder pinches all memory, so I assume this breaks a lot of optimizations when RF is sitting in the hot loop? I remember we went through a similar exercise with Blackholes: [JDK-8296545](https://bugs.openjdk.org/browse/JDK-8296545) -- and decided to pinch only the control. I guessing this is not enough to fix RF, or is it? >> >> Yes, if a barrier stays inside loop body, it breaks a lot of important optimizations. It may end up almost as bad as a full-blown call (except a barrier can be moved around while a call can't). And moving a node when it depends both on control and memory is more complicated than just a CFG node. Moreover, as you can see in the proposed solution, even CFG-only representation is problematic for loop opts, so additional care is needed to ensure RFs are moved out of loops. >> >> As an alternative approach, I thought about reifying RF as a data node (think of `CastPP`) and then linking its referent to all safepoints it dominates after loop opts are over. But that would only affect `optimize_reachability_fences()`. Everything else would stay the same. So, I decided to stay with CFG-only representation for now. > > @iwanowww Let me know whenever this is ready to review again ? @eme64 I think I addressed/answered all your suggestions/questions. Please, take another look. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3282162627 From vlivanov at openjdk.org Thu Sep 11 18:28:14 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 11 Sep 2025 18:28:14 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Mon, 8 Sep 2025 12:55:36 GMT, Emanuel Peter wrote: >> Well, it's a SafePointNode class after all. I lifted it from `CallNode` subclass to avoid elaborate check on SafePoint nodes (!is_Call() || as_Call() && guaranteed_safepoint()`)). >> >> If some node extends SafePointNode, but doesn't keep JVM state, it has to communicate it to users one way or another. And changing the default doesn't improve the situation IMO: reporting a safepoint node as a non-safepoint is still a bug. > > Hmm. The way it is formulated it sounds more like: > - `true` -> we are guaranteed that it is a safepoint. > - `false` -> it may or may not be a safepoint - no guarantees. > Am I understanding this right? > > If yes, then it would make more sense to have a default that is `no guarantee`. But maybe that makes things more complicated in other ways. All I'm saying it makes me nervous ;) You are right. I studied the code and `guaranteed_safepoint()` behaves as you described. It doesn't work for RF purposes, so I migrated the code to `sfpt->jvms() != nullptr` check and fixed a bug along the way. The changes related to `guaranteed_safepoint()` are reverted. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2341997278 From sviswanathan at openjdk.org Thu Sep 11 21:02:24 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 11 Sep 2025 21:02:24 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 02:19:54 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` >> 1... > > Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: > > - Check for instructions that shouldn't appear in vector floating point conversion tests > - Correctly calculate vector lengths and don't rely on VectorReinterpret in cast2F2X and cast2D2X memory instructions src/hotspot/cpu/x86/x86.ad line 7715: > 7713: %} > 7714: > 7715: instruct cast2FtoX_reg_evex(vec dst, vec src) %{ Could be named as castFtoX_reg_avx10. src/hotspot/cpu/x86/x86.ad line 7728: > 7726: %} > 7727: > 7728: instruct cast2FtoX_mem_evex(vec dst, memory src) %{ Could be named as castFtoX_mem_avx10. src/hotspot/cpu/x86/x86.ad line 7789: > 7787: %} > 7788: > 7789: instruct cast2DtoX_reg_evex(vec dst, vec src) %{ Could be named as castDtoX_reg_avx10. src/hotspot/cpu/x86/x86.ad line 7802: > 7800: %} > 7801: > 7802: instruct cast2DtoX_mem_evex(vec dst, memory src) %{ Could be named as castDtoX_mem_avx10. src/hotspot/cpu/x86/x86_64.ad line 11728: > 11726: %} > 11727: > 11728: instruct conv2F2I_reg_reg(rRegI dst, regF src) Could be named as convF2I_reg_reg_avx10. src/hotspot/cpu/x86/x86_64.ad line 11739: > 11737: %} > 11738: > 11739: instruct conv2F2I_reg_mem(rRegI dst, memory src) Could be named as convF2I_reg_mem_avx10. src/hotspot/cpu/x86/x86_64.ad line 11762: > 11760: %} > 11761: > 11762: instruct conv2F2L_reg_reg(rRegL dst, regF src) Could be named as convF2L_reg_reg_avx10 src/hotspot/cpu/x86/x86_64.ad line 11773: > 11771: %} > 11772: > 11773: instruct conv2F2L_reg_mem(rRegL dst, memory src) Could be named as convF2L_reg_mem_avx10 src/hotspot/cpu/x86/x86_64.ad line 11796: > 11794: %} > 11795: > 11796: instruct conv2D2I_reg_reg(rRegI dst, regD src) Could be named as convD2I_reg_reg_avx10. src/hotspot/cpu/x86/x86_64.ad line 11807: > 11805: %} > 11806: > 11807: instruct conv2D2I_reg_mem(rRegI dst, memory src) Could be named as convD2I_reg_mem_avx10. src/hotspot/cpu/x86/x86_64.ad line 11830: > 11828: %} > 11829: > 11830: instruct conv2D2L_reg_reg(rRegL dst, regD src) Could be named as convD2L_reg_reg_avx10. src/hotspot/cpu/x86/x86_64.ad line 11841: > 11839: %} > 11840: > 11841: instruct conv2D2L_reg_mem(rRegL dst, memory src) Could be named as convD2L_reg_mem_avx10. test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 494: > 492: } > 493: > 494: public static final String CAST_F2X = PREFIX + "CAST_F2X" + POSTFIX; May be we can name CAST_F2X as X86_VCAST_F2X and CAST2_F2X as X86_VCAST_F2X_AVX10. Then we can use the similar theme for other names below as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342294949 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342295729 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342298160 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342300778 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342302012 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342304990 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342305755 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342306364 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342307288 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342308076 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342309412 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342310391 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342326871 From sviswanathan at openjdk.org Thu Sep 11 21:02:26 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 11 Sep 2025 21:02:26 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 20:50:12 GMT, Sandhya Viswanathan wrote: >> Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: >> >> - Check for instructions that shouldn't appear in vector floating point conversion tests >> - Correctly calculate vector lengths and don't rely on VectorReinterpret in cast2F2X and cast2D2X memory instructions > > src/hotspot/cpu/x86/x86_64.ad line 11841: > >> 11839: %} >> 11840: >> 11841: instruct conv2D2L_reg_mem(rRegL dst, memory src) > > Could be named as convD2L_reg_mem_avx10. IRNode.java will need name regex changes accordingly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342314412 From missa at openjdk.org Thu Sep 11 23:10:44 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 11 Sep 2025 23:10:44 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v12] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: - Change the floating point conversion instruction, IR nodes, and test rules to make them clearer - Change debug text format of AVX 10.2 vector conversion instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/8587952d..df175756 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=10-11 Stats: 180 lines in 7 files changed: 60 ins; 60 del; 60 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From missa at openjdk.org Thu Sep 11 23:10:54 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 11 Sep 2025 23:10:54 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: References: Message-ID: <0wYVVPSsr5S3QTcGPkM0dmXLwJq_ff1yOZCOpqlyMMo=.c0585691-bf34-4da8-9dda-3d7bd2c9339f@github.com> On Thu, 11 Sep 2025 20:42:16 GMT, Sandhya Viswanathan wrote: >> Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: >> >> - Check for instructions that shouldn't appear in vector floating point conversion tests >> - Correctly calculate vector lengths and don't rely on VectorReinterpret in cast2F2X and cast2D2X memory instructions > > src/hotspot/cpu/x86/x86.ad line 7715: > >> 7713: %} >> 7714: >> 7715: instruct cast2FtoX_reg_evex(vec dst, vec src) %{ > > Could be named as castFtoX_reg_avx10. Renamed > src/hotspot/cpu/x86/x86.ad line 7728: > >> 7726: %} >> 7727: >> 7728: instruct cast2FtoX_mem_evex(vec dst, memory src) %{ > > Could be named as castFtoX_mem_avx10. Renamed > src/hotspot/cpu/x86/x86.ad line 7789: > >> 7787: %} >> 7788: >> 7789: instruct cast2DtoX_reg_evex(vec dst, vec src) %{ > > Could be named as castDtoX_reg_avx10. Renamed > src/hotspot/cpu/x86/x86.ad line 7802: > >> 7800: %} >> 7801: >> 7802: instruct cast2DtoX_mem_evex(vec dst, memory src) %{ > > Could be named as castDtoX_mem_avx10. Renamed > src/hotspot/cpu/x86/x86_64.ad line 11728: > >> 11726: %} >> 11727: >> 11728: instruct conv2F2I_reg_reg(rRegI dst, regF src) > > Could be named as convF2I_reg_reg_avx10. Renamed > src/hotspot/cpu/x86/x86_64.ad line 11739: > >> 11737: %} >> 11738: >> 11739: instruct conv2F2I_reg_mem(rRegI dst, memory src) > > Could be named as convF2I_reg_mem_avx10. Renamed > src/hotspot/cpu/x86/x86_64.ad line 11762: > >> 11760: %} >> 11761: >> 11762: instruct conv2F2L_reg_reg(rRegL dst, regF src) > > Could be named as convF2L_reg_reg_avx10 Renamed > src/hotspot/cpu/x86/x86_64.ad line 11773: > >> 11771: %} >> 11772: >> 11773: instruct conv2F2L_reg_mem(rRegL dst, memory src) > > Could be named as convF2L_reg_mem_avx10 Renamed > src/hotspot/cpu/x86/x86_64.ad line 11796: > >> 11794: %} >> 11795: >> 11796: instruct conv2D2I_reg_reg(rRegI dst, regD src) > > Could be named as convD2I_reg_reg_avx10. Renamed > src/hotspot/cpu/x86/x86_64.ad line 11807: > >> 11805: %} >> 11806: >> 11807: instruct conv2D2I_reg_mem(rRegI dst, memory src) > > Could be named as convD2I_reg_mem_avx10. Renamed > src/hotspot/cpu/x86/x86_64.ad line 11830: > >> 11828: %} >> 11829: >> 11830: instruct conv2D2L_reg_reg(rRegL dst, regD src) > > Could be named as convD2L_reg_reg_avx10. Renamed > test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 494: > >> 492: } >> 493: >> 494: public static final String CAST_F2X = PREFIX + "CAST_F2X" + POSTFIX; > > May be we can name CAST_F2X as X86_VCAST_F2X and CAST2_F2X as X86_VCAST_F2X_AVX10. > Then we can use the similar theme for other names below as well. Renamed > test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 496: > >> 494: public static final String CAST_F2X = PREFIX + "CAST_F2X" + POSTFIX; >> 495: static { >> 496: machOnlyNameRegex(CAST_F2X, "castF2X_reg_(av|eve)x"); > > This should be "castFtoX_reg_(av|eve)x". Fixed > test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 501: > >> 499: public static final String CAST_D2X = PREFIX + "CAST_D2X" + POSTFIX; >> 500: static { >> 501: machOnlyNameRegex(CAST_D2X, "castD2X_reg_(av|eve)x"); > > This should be "castDtoX_reg_(av|eve)x". Fixed > test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 506: > >> 504: public static final String CAST2_F2X = PREFIX + "CAST2_F2X" + POSTFIX; >> 505: static { >> 506: machOnlyNameRegex(CAST2_F2X, "cast2F2X_(reg|mem)_evex"); > > This should be "cast2FtoX_(reg|mem)_evex" Fixed > test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 511: > >> 509: public static final String CAST2_D2X = PREFIX + "CAST2_D2X" + POSTFIX; >> 510: static { >> 511: machOnlyNameRegex(CAST2_D2X, "cast2D2X_(reg|mem)_evex"); > > This should be "cast2DtoX_(reg|mem)_evex". Fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342544829 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342545092 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342545345 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342545547 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342545846 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342546207 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342546448 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342546884 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342547335 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342547642 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342547891 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342548440 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342543063 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342543272 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342543522 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342543785 From missa at openjdk.org Thu Sep 11 23:11:00 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 11 Sep 2025 23:11:00 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 20:51:47 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/x86_64.ad line 11841: >> >>> 11839: %} >>> 11840: >>> 11841: instruct conv2D2L_reg_mem(rRegL dst, memory src) >> >> Could be named as convD2L_reg_mem_avx10. > > IRNode.java will need name regex changes accordingly. Renamed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342548095 From missa at openjdk.org Thu Sep 11 23:10:58 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 11 Sep 2025 23:10:58 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v11] In-Reply-To: <8x67EDx2mHmRygqECi1m3BJ8kmBOpogaVvy-V_NnsUU=.f41f9f34-a559-491b-8d9d-8ae05a6890d3@github.com> References: <8x67EDx2mHmRygqECi1m3BJ8kmBOpogaVvy-V_NnsUU=.f41f9f34-a559-491b-8d9d-8ae05a6890d3@github.com> Message-ID: On Thu, 11 Sep 2025 17:20:29 GMT, Jatin Bhateja wrote: >> Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: >> >> - Check for instructions that shouldn't appear in vector floating point conversion tests >> - Correctly calculate vector lengths and don't rely on VectorReinterpret in cast2F2X and cast2D2X memory instructions > > src/hotspot/cpu/x86/x86.ad line 7719: > >> 7717: is_integral_type(Matcher::vector_element_basic_type(n))); >> 7718: match(Set dst (VectorCastF2X src)); >> 7719: format %{ "vector_cast2r_f2x $dst, $src\t!" %} > > Suggestion: > > format %{ "vector_cast_f2x_saturating $dst, $src\t!" %} Updated > src/hotspot/cpu/x86/x86.ad line 7732: > >> 7730: is_integral_type(Matcher::vector_element_basic_type(n))); >> 7731: match(Set dst (VectorCastF2X (LoadVector src))); >> 7732: format %{ "vector_cast2m_f2x $dst, $src\t!" %} > > Suggestion: > > format %{ "vector_cast_f2x_saturating $dst, $src\t!" %} > > src will be represented by appropriate addressing scheme for the memory operand Updated > src/hotspot/cpu/x86/x86.ad line 7793: > >> 7791: is_integral_type(Matcher::vector_element_basic_type(n))); >> 7792: match(Set dst (VectorCastD2X src)); >> 7793: format %{ "vector_cast2r_d2x $dst, $src\t!" %} > > Suggestion: > > format %{ "vector_cast_d2x_saturating $dst, $src\t!" %} Updated > src/hotspot/cpu/x86/x86.ad line 7806: > >> 7804: is_integral_type(Matcher::vector_element_basic_type(n))); >> 7805: match(Set dst (VectorCastD2X (LoadVector src))); >> 7806: format %{ "vector_cast2m_d2x $dst, $src\t!" %} > > Suggestion: > > format %{ "vector_cast_d2x_saturating $dst, $src\t!" %} Updated ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342543956 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342544232 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342544450 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342544575 From dlong at openjdk.org Thu Sep 11 23:50:22 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 11 Sep 2025 23:50:22 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 14:01:43 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Lowercase _RM_INT and _RM_WORD src/hotspot/share/opto/chaitin.cpp line 1580: > 1578: _ifg->re_insert(lidx); > 1579: if( !lrg->alive() ) continue; > 1580: // capture allstackedness flag before mask is hacked allstackedness --> infiniteness? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2342594775 From dlong at openjdk.org Thu Sep 11 23:50:23 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 11 Sep 2025 23:50:23 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: <1xDonJ67G3hUAWTdngutIb7LBboWxHRviCHXKDCSoN4=.2617f8e9-206b-424d-a1ab-501b182717bb@github.com> References: <2unG-RdDR2e1mI-veaR3AdDGGs1q4XFdITrnQtBGOw8=.47f7d565-b722-434b-96ef-b51ed733b241@github.com> <1xDonJ67G3hUAWTdngutIb7LBboWxHRviCHXKDCSoN4=.2617f8e9-206b-424d-a1ab-501b182717bb@github.com> Message-ID: On Thu, 11 Sep 2025 13:40:35 GMT, Daniel Lund?n wrote: >> Generally, we use `_` for fields, but not for constants. >> Also: fields should be lower-case, so maybe `_RM_INT` -> `_rm_int`? > > Thanks, I agree that it seems more consistent to use `_rm_int` and `_rm_word` instead. The missing leading underscore for `RM_SIZE_IN_INTS` highlights that it is a macro, unlike `_RM_SIZE_IN_WORDS`. Maybe this is just for historical reasons and not up to date with today's conventions? > > Do we classify constant static fields such as `_RM_SIZE_IN_WORDS` as constants or fields? I.e., do we use upper or lower case? I guess it would be `_rm_size_in_words` if considered a field and `RM_SIZE_IN_WORDS` (without the leading underscore) if considered a constant. I vote for `RM_SIZE_IN_WORDS` because it is a constant, the same as if it was a value from an enum. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2342592026 From dlong at openjdk.org Fri Sep 12 00:12:19 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 12 Sep 2025 00:12:19 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 14:01:43 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Lowercase _RM_INT and _RM_WORD src/hotspot/share/opto/regmask.hpp line 66: > 64: > 65: static const unsigned int _WordBitMask = BitsPerWord - 1U; > 66: static const unsigned int _LogWordBits = LogBitsPerWord; What about just replacing all uses of _LogWordBits with LogBitsPerWord? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2342617346 From dlong at openjdk.org Fri Sep 12 00:21:19 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 12 Sep 2025 00:21:19 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 14:01:43 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Lowercase _RM_INT and _RM_WORD src/hotspot/share/opto/regmask.hpp line 166: > 164: // indefinitely with ONE bits. Returns TRUE if mask is infinite or > 165: // unbounded in size. Returns FALSE if mask is finite size. > 166: bool is_infinite() const { "infinite" hides the fact that these unbounded bits are stack bits and not register bits, but `is_UnboundedStack` or `is_InfiniteStack` might be too verbose. How does `is_InfStack` sound? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2342628324 From sviswanathan at openjdk.org Fri Sep 12 00:24:20 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 12 Sep 2025 00:24:20 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v12] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 23:10:44 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` >> 1... > > Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: > > - Change the floating point conversion instruction, IR nodes, and test rules to make them clearer > - Change debug text format of AVX 10.2 vector conversion instructions src/hotspot/cpu/x86/x86.ad line 7669: > 7667: predicate(!VM_Version::supports_avx10_2() && > 7668: !VM_Version::supports_avx512vl() && > 7669: Matcher::vector_length_in_bytes(n->in(1)) < 64 && Good to add "is_integral_type(Matcher::vector_element_basic_type(n)) &&" here. test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 26: > 24: /** > 25: * @test > 26: * @bug 8287835 8320347 Did you mean 8364305 here? test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 364: > 362: applyIfCPUFeatureAnd = {"avx2", "true", "avx10_2", "false"}) > 363: @IR(counts = {IRNode.X86_VCAST_F2X_AVX10, "> 0"}, > 364: applyIfCPUFeature = {"avx10_2", "true"}) Need to add the following for X86_VCAST_F2X as well as X86_VCAST_F2X_AVX10. applyIfOr = {"AlignVector", "false", "UseCompactObjectHeaders", "false"}, test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 387: > 385: applyIfCPUFeatureAnd = {"avx2", "true", "avx10_2", "false"}) > 386: @IR(counts = {IRNode.X86_VCAST_F2X_AVX10, "> 0"}, > 387: applyIfCPUFeature = {"avx10_2", "true"}) Need to add the following for X86_VCAST_F2X as well as X86_VCAST_F2X_AVX10. applyIfOr = {"AlignVector", "false", "UseCompactObjectHeaders", "false"}, test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 413: > 411: applyIfCPUFeatureAnd = {"avx", "true", "avx10_2", "false"}) > 412: @IR(counts = {IRNode.X86_VCAST_D2X_AVX10, "> 0"}, > 413: applyIfCPUFeature = {"avx10_2", "true"}) Need to add the following for X86_VCAST_D2X and X86_VCAST_D2X_AVX10: applyIf = {"MaxVectorSize", ">=16"}, test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 432: > 430: applyIfCPUFeatureAnd = {"avx", "true", "avx10_2", "false"}) > 431: @IR(counts = {IRNode.X86_VCAST_D2X_AVX10, "> 0"}, > 432: applyIfCPUFeature = {"avx10_2", "true"}) Need to add the following for X86_VCAST_D2X and X86_VCAST_D2X_AVX10: applyIf = {"MaxVectorSize", ">=16"}, ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342571300 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342620816 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342615073 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342615727 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342618205 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2342618455 From dlong at openjdk.org Fri Sep 12 01:04:21 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 12 Sep 2025 01:04:21 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v12] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> Message-ID: On Tue, 9 Sep 2025 11:27:50 GMT, Roland Westrelin wrote: >> An `Initialize` node for an `Allocate` node is created with a memory >> `Proj` of adr type raw memory. In order for stores to be captured, the >> memory state out of the allocation is a `MergeMem` with slices for the >> various object fields/array element set to the raw memory `Proj` of >> the `Initialize` node. If `Phi`s need to be created during later >> transformations from this memory state, The `Phi` for a particular >> slice gets its adr type from the type of the `Proj` which is raw >> memory. If during macro expansion, the `Allocate` is found to have no >> use and so can be removed, the `Proj` out of the `Initialize` is >> replaced by the memory state on input to the `Allocate`. A `Phi` for >> some slice for a field of an object will end up with the raw memory >> state on input to the `Allocate` node. As a result, memory state at >> the `Phi` is incorrect and incorrect execution can happen. >> >> The fix I propose is, rather than have a single `Proj` for the memory >> state out of the `Initialize` with adr type raw memory, to use one >> `Proj` per slice added to the memory state after the `Initalize`. Each >> of the `Proj` should return the right adr type for its slice. For that >> I propose having a new type of `Proj`: `NarrowMemProj` that captures >> the right adr type. >> >> Logic for the construction of the `Allocate`/`Initialize` subgraph is >> tweaked so the right adr type captured in is own `NarrowMemProj` is >> added to the memory sugraph. Code that removes an allocation or moves >> it also has to be changed so it correctly takes the multiple memory >> projections out of the `Initialize` node into account. >> >> One tricky issue is that when EA split types for a scalar replaceable >> `Allocate` node: >> >> 1- the adr type captured in the `NarrowMemProj` becomes out of sync >> with the type of the slices for the allocation >> >> 2- before EA, the memory state for one particular field out of the >> `Initialize` node can be used for a `Store` to the just allocated >> object or some other. So we can have a chain of `Store`s, some to >> the newly allocated object, some to some other objects, all of them >> using the state of `NarrowMemProj` out of the `Initialize`. After >> split unique types, the `NarrowMemProj` is for the slice of a >> particular allocation. So `Store`s to some other objects shouldn't >> use that memory state but the memory state before the `Allocate`. >> >> For that, I added logic to update the adr type of `NarrowMemProj` >> during split uni... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: > > - more > - Merge branch 'master' into JDK-8327963 > - more > - more > - Merge branch 'master' into JDK-8327963 > - more > - more > - lambda return > - lambda clean up > - Merge branch 'master' into JDK-8327963 > - ... and 35 more: https://git.openjdk.org/jdk/compare/e16c5100...b701d03e src/hotspot/share/opto/loopTransform.cpp line 3992: > 3990: Node* frame = new ParmNode(C->start(), TypeFunc::FramePtr); > 3991: _igvn.register_new_node_with_optimizer(frame); > 3992: call->init_req(TypeFunc::FramePtr, frame); This seems unrelated. Is it needed? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2342681526 From fyang at openjdk.org Fri Sep 12 01:49:16 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 12 Sep 2025 01:49:16 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v6] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 09:19:59 GMT, Robbin Ehn wrote: >> Hey, please consider! >> >> A bunch of info in JBS entry, please read that also. >> >> I narrowed this issue down to the old jal optimization, making direct calls when in reach. >> This patch restores them and removes this regression. >> >> In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. >> >> Please test on your hardware! >> >> >> Chi Square (100 runs each, 10 fastest iterations of each run, P550) >> JDK-23 (last version with trampoline calls) >> Mean: 3189.5827 >> Standard Deviation: 284.6478 >> >> JDK-25 >> Mean: 3424.8905 >> Standard Deviation: 222.2208 >> >> Patch: >> Mean: 3144.8535 >> Standard Deviation: 229.2577 >> >> >> No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. > > Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains nine additional commits since the last revision: > > - Review fix > - Merge branch 'master' into 8365926 > - Merge branch 'master' into 8365926 > - Review comments > - Review comments > - Merge branch 'master' into 8365926 > - Spelling > - Merge branch 'master' into 8365926 > - draft jal<->jalr Still good to me. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26944#pullrequestreview-3214325209 From wenanjian at openjdk.org Fri Sep 12 03:15:27 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Fri, 12 Sep 2025 03:15:27 GMT Subject: RFR: 8365732: RISC-V: implement AES CTR intrinsics [v6] In-Reply-To: References: Message-ID: <8wqGgE5DEY1mQm5SP3g0Y_LEn8q9ptTtbjY5MEQOCHE=.10c6adb9-8dcd-4f10-bf5b-bd5d0be4f053@github.com> > Hi everyone, please help review this patch which Implement the _counterMode_AESCrypt with Zvkned. On my QEMU, with Zvkned extension enabled, the tests in test/hotspot/jtreg/compiler/codegen/aes/ Passed. Anjian Wen has updated the pull request incrementally with one additional commit since the last revision: fix the counter increase at limit and add test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25281/files - new: https://git.openjdk.org/jdk/pull/25281/files/6bd22c4e..0769db02 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25281&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25281&range=04-05 Stats: 37 lines in 2 files changed: 29 ins; 1 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/25281.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25281/head:pull/25281 PR: https://git.openjdk.org/jdk/pull/25281 From wenanjian at openjdk.org Fri Sep 12 03:40:59 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Fri, 12 Sep 2025 03:40:59 GMT Subject: RFR: 8365732: RISC-V: implement AES CTR intrinsics [v7] In-Reply-To: References: Message-ID: > Hi everyone, please help review this patch which Implement the _counterMode_AESCrypt with Zvkned. On my QEMU, with Zvkned extension enabled, the tests in test/hotspot/jtreg/compiler/codegen/aes/ Passed. Anjian Wen has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Merge branch 'openjdk:master' into aes_ctr - fix the counter increase at limit and add test - change format - update reg use and instruction - change some name and format - delete useless Label, change L_judge_used to L_slow_loop - add Flags and fix the stubid name - RISC-V: implement AES-CTR mode intrinsics ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25281/files - new: https://git.openjdk.org/jdk/pull/25281/files/0769db02..ff513708 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25281&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25281&range=05-06 Stats: 82462 lines in 2415 files changed: 49550 ins; 22013 del; 10899 mod Patch: https://git.openjdk.org/jdk/pull/25281.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25281/head:pull/25281 PR: https://git.openjdk.org/jdk/pull/25281 From epeter at openjdk.org Fri Sep 12 05:52:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 05:52:12 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 00:08:06 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Lowercase _RM_INT and _RM_WORD > > src/hotspot/share/opto/regmask.hpp line 66: > >> 64: >> 65: static const unsigned int _WordBitMask = BitsPerWord - 1U; >> 66: static const unsigned int _LogWordBits = LogBitsPerWord; > > What about just replacing all uses of _LogWordBits with LogBitsPerWord? Yes, that would be a good step in the right direction. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2343046545 From rehn at openjdk.org Fri Sep 12 06:12:23 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 12 Sep 2025 06:12:23 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v6] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 01:46:57 GMT, Fei Yang wrote: >> Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains nine additional commits since the last revision: >> >> - Review fix >> - Merge branch 'master' into 8365926 >> - Merge branch 'master' into 8365926 >> - Review comments >> - Review comments >> - Merge branch 'master' into 8365926 >> - Spelling >> - Merge branch 'master' into 8365926 >> - draft jal<->jalr > > Still good to me. Thanks @RealFYang, @Hamlin-Li ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3283845949 From bmaillard at openjdk.org Fri Sep 12 07:25:06 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Fri, 12 Sep 2025 07:25:06 GMT Subject: RFR: 8364757: Missing Store nodes caused by bad wiring in PhaseIdealLoop::insert_post_loop Message-ID: This PR introduces a fix for wrong results caused by missing `Store` nodes in C2 IR due to incorrect wiring in `PhaseIdealLoop::insert_post_loop`. ### Context The issue was initially found by the fuzzer. After some trial and error, and with the help of @chhagedorn I was able to reduce the reproducer to something very simple. After being compiled by C2, the execution of the following method led to the last statement (`x = 0`) to be ignored: static public void test() { x = 0; for (int i = 0; i < 20000; i++) { x += i; } x = 0; } After some investigation and discussions with @robcasloz and @chhagedorn, it appeared that this issue is linked to how safepoints are inserted into long running loops, causing the loop to be transformed into a nested loop with an `OuterStripMinedLoop` node. `Store` node are moved out of the inner loop when encountering this pattern, and the associated `Phi` nodes are removed in order to avoid inhibiting loop optimizations taking place later. This was initially adressed in [JDK-8356708](https://bugs.openjdk.org/browse/JDK-8356708) by making the necessary corrections in macro expansion. As explained in the next section, this is not enough here as macro expansion happens too late. This PR aims at addressing the specific case of the wrong wiring of `Store` nodes in _post_ loops, but on the longer term further investigations into the missing `Phi` node issue are necessary, as they are likely to cause other issues (cf. related JBS issues). ### Detailed Analysis In `PhaseIdealLoop::create_outer_strip_mined_loop`, a simple `CountedLoop` is turned into a nested loop with an `OuterStripMinedLoop`. The body of the initial loop remains in the inner loop, but the safepoint is moved to the outer loop. Later, we attempt to move `Store` nodes after the inner loop in `PhaseIdealLoop::try_move_store_after_loop`. When the `Store` node is moved to the outer loop, we also get rid of its input `Phi` node in order not to confuse loop optimizations happening later. This only becomes a problem in `PhaseIdealLoop::insert_post_loop`, where we clone the body of the inner/outer loop for the iterations remaining after unrolling. There, we use `Phi` nodes to do the necessary rewiring between the original body and the cloned one. Because we do not have `Phi` nodes for the moved `Store` nodes, their memory inputs may end up being incorrect. This is what the IR looks like after the creation of the post loop in our reproducer: image On the screenshot, node `118 StoreI` takes directly `24 StoreI` as memory input, even though it is obvious that `96 CountedLoopEnd` (to which `73 NodeI` is attached) is a predecessor of `114 CountedLoopEnd` in the CFG. After that, we observe a succession of IGVN optimizations that eventually lead to the generation of wrong code: - The `IfFalse` projection of `128 If` becomes dead, as the the _post_ loop is always executed (number of iterations is known) - `121 Region` and `123 Phi` are subsequently eliminated (as a result of the dead path) - Because the `Phi` disappeared, `118 StoreI` becomes the memory input of `89 StoreI` - `118 StoreI` is eliminated because it is directly followed by a write at the same memory location - `89 StoreI` is replaced by `24 StoreI` as an `Identity` optimizations because it is stores the same value at the same location Node `89 StoreI` corresponds to the last `x = 0` assignment, and its elimination directly causes the wrong result (the store node from the `OuterStripMinedLoop` remains, as it is used by the safepoint). ### Proposed Fix As mentioned previously, the impact of the missing `Phi` nodes need to be investigated further, as it it likely that this causes other bugs in the compilation process. This is a "local fix" for the specific issue of `Store` nodes moved out of the inner loop. The approach here is to do the wiring directly in `PhaseIdealLoop::insert_post_loop`, right after having done the usual rewiring based on the `Phi` nodes. As the conditions for moving `Store` nodes out of the loop are quite restrictive, the pattern is predictable: `Store` nodes are attached to the `false` projection of the inner `CountedLoopEnd`, right before the safepoint in the CFG. In the simplest case, the memory input of new version of the store node is outside of the loop body. In the cloned node, we change it to point to its original version instead (as the original store is always executed before). It may also be that the memory input of the new node points to another memory node in the loop body. This can happen in the case where we have: for (int i = 0; i < 20000; i++) { a1.field += i; a2.field += i; } Here, the second store has the first one as memory input, as `a1` and `a2` may be aliases. In this case, we only need to change the memory input of the first store in the chain, and it needs to point to the last memory node in the chain in the original version of the loop. ### Testing - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8364757) - [x] tier1-4, plus some internal testing Thank you for reviewing! ------------- Commit messages: - Fix bad test headers after remaining - Fix trailing whitespace - Add jtreg tests - Fix logic after failing TestStoresSunkInOuterStripMinedLoop - 8364757: First attempt at fixing the store node issue Changes: https://git.openjdk.org/jdk/pull/27225/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27225&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8364757 Stats: 141 lines in 3 files changed: 141 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27225.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27225/head:pull/27225 PR: https://git.openjdk.org/jdk/pull/27225 From roland at openjdk.org Fri Sep 12 07:27:02 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 12 Sep 2025 07:27:02 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v8] In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: <4Yzeo6gJlk-Jq5zlh3P9HPCm57-7AwIqsywOWbawzcI=.13938c72-a9d4-463d-a54c-a08c70482a6b@github.com> > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26424/files - new: https://git.openjdk.org/jdk/pull/26424/files/91a7d73c..ec28714e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=06-07 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26424.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26424/head:pull/26424 PR: https://git.openjdk.org/jdk/pull/26424 From roland at openjdk.org Fri Sep 12 07:27:06 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 12 Sep 2025 07:27:06 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v7] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Tue, 9 Sep 2025 11:56:56 GMT, Emanuel Peter wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8361702 >> - Merge branch 'master' into JDK-8361702 >> - review >> - Merge branch 'master' into JDK-8361702 >> - Update src/hotspot/share/opto/loopopts.cpp >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java >> >> Co-authored-by: Christian Hagedorn >> - Update src/hotspot/share/opto/loopopts.cpp >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java >> >> Co-authored-by: Christian Hagedorn >> - tests >> - ... and 1 more: https://git.openjdk.org/jdk/compare/a6afe4cc...91a7d73c > > src/hotspot/share/opto/loopopts.cpp line 1936: > >> 1934: // Sinking a node from a pre loop to its main loop pins the node between the pre and main loops. If that node is input >> 1935: // to a check that's eliminated by range check elimination, it becomes input to an expression that feeds into the exit >> 1936: // test of the pre loop above the point in the graph where it's pinned. > > I guess the alternative would have been not to do that RC elimination, right? > If yes: you could finish the thought and say that we prefer to have a chance at RC elimination, rather than sinking the node out of the pre-loop. I updated the comment based on your suggestion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2343254430 From roland at openjdk.org Fri Sep 12 07:30:29 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 12 Sep 2025 07:30:29 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v12] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> Message-ID: On Fri, 12 Sep 2025 01:00:20 GMT, Dean Long wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: >> >> - more >> - Merge branch 'master' into JDK-8327963 >> - more >> - more >> - Merge branch 'master' into JDK-8327963 >> - more >> - more >> - lambda return >> - lambda clean up >> - Merge branch 'master' into JDK-8327963 >> - ... and 35 more: https://git.openjdk.org/jdk/compare/e16c5100...b701d03e > > src/hotspot/share/opto/loopTransform.cpp line 3992: > >> 3990: Node* frame = new ParmNode(C->start(), TypeFunc::FramePtr); >> 3991: _igvn.register_new_node_with_optimizer(frame); >> 3992: call->init_req(TypeFunc::FramePtr, frame); > > This seems unrelated. Is it needed? It's one of the things mentioned in that comment: https://github.com/openjdk/jdk/pull/24570#issuecomment-2883651987 "I added asserts to catch cases where proj_out is called but the node has more than one matching projection. With those asserts, I caught some false positive/cases where we got lucky and worked around them by reworking the code so it doesn't use proj_out. That's the case in PhaseIdealLoop::intrinsify_fill(): we can end up there with more than one FramePtr projection because the code pattern used elsewhere is to add one more projection and let identical projections common during igvn. " ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2343260134 From dlunden at openjdk.org Fri Sep 12 08:02:20 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 08:02:20 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 23:47:39 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Lowercase _RM_INT and _RM_WORD > > src/hotspot/share/opto/chaitin.cpp line 1580: > >> 1578: _ifg->re_insert(lidx); >> 1579: if( !lrg->alive() ) continue; >> 1580: // capture allstackedness flag before mask is hacked > > allstackedness --> infiniteness? Thanks, I did not think to `grep` for that one... > src/hotspot/share/opto/regmask.hpp line 166: > >> 164: // indefinitely with ONE bits. Returns TRUE if mask is infinite or >> 165: // unbounded in size. Returns FALSE if mask is finite size. >> 166: bool is_infinite() const { > > "infinite" hides the fact that these unbounded bits are stack bits and not register bits, but `is_UnboundedStack` or `is_InfiniteStack` might be too verbose. How does `is_InfStack` sound? I like the suggestion, but should we not make it `is_infinite_stack` (current convention according to the style guide)? Or does historic conventions in `regmask.hpp` take precedence? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2343339147 PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2343345578 From dlunden at openjdk.org Fri Sep 12 08:02:22 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 08:02:22 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 05:48:39 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/regmask.hpp line 66: >> >>> 64: >>> 65: static const unsigned int _WordBitMask = BitsPerWord - 1U; >>> 66: static const unsigned int _LogWordBits = LogBitsPerWord; >> >> What about just replacing all uses of _LogWordBits with LogBitsPerWord? > > Yes, that would be a good step in the right direction. Sure, sounds good ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2343336560 From rehn at openjdk.org Fri Sep 12 08:05:38 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 12 Sep 2025 08:05:38 GMT Subject: Integrated: 8365926: RISC-V: Performance regression in renaissance (chi-square) In-Reply-To: References: Message-ID: On Tue, 26 Aug 2025 14:43:05 GMT, Robbin Ehn wrote: > Hey, please consider! > > A bunch of info in JBS entry, please read that also. > > I narrowed this issue down to the old jal optimization, making direct calls when in reach. > This patch restores them and removes this regression. > > In essence we turn "jalr ra,0(t1)" into a "jal ra," if reachable, and restore the jalr if a new destination is not reachable. > > Please test on your hardware! > > > Chi Square (100 runs each, 10 fastest iterations of each run, P550) > JDK-23 (last version with trampoline calls) > Mean: 3189.5827 > Standard Deviation: 284.6478 > > JDK-25 > Mean: 3424.8905 > Standard Deviation: 222.2208 > > Patch: > Mean: 3144.8535 > Standard Deviation: 229.2577 > > > No issues found in t1, running t2 also. Stress tested on vf2, bpi-f3, p550. This pull request has now been integrated. Changeset: 5c1865a4 Author: Robbin Ehn URL: https://git.openjdk.org/jdk/commit/5c1865a4fcd5da80ddcc506f4e41aada0fb93970 Stats: 86 lines in 3 files changed: 68 ins; 0 del; 18 mod 8365926: RISC-V: Performance regression in renaissance (chi-square) Reviewed-by: fyang, mli ------------- PR: https://git.openjdk.org/jdk/pull/26944 From wenanjian at openjdk.org Fri Sep 12 08:11:43 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Fri, 12 Sep 2025 08:11:43 GMT Subject: RFR: 8365732: RISC-V: implement AES CTR intrinsics [v7] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 10:22:49 GMT, Anjian Wen wrote: >> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2745: >> >>> 2743: __ vsetivli(x0, 4, Assembler::e32, Assembler::m1); >>> 2744: __ vrev8_v(v31, v31, Assembler::VectorMask::v0_t); // convert big-endien to little-endian >>> 2745: __ vadd_vi(v31, v31, 1, Assembler::VectorMask::v0_t); >> >> Are you sure this is correct? See `com.sun.crypto.provider.CounterMode::increment`. > > Thanks for the review. I'm still developing it. > Regarding the growth of the counter array, it should use 8 bytes to store the count. I use 4 Byte here according to OpenSSL aes-ctr code, I will try to fix it later > https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aes-riscv64-zvkb-zvkned.pl#L242 > Are you sure this is correct? See `com.sun.crypto.provider.CounterMode::increment`. Hi @theRealAph , according to your advice and code from `com.sun.crypto.provider.CounterMode::increment`, I have modified my patch about counter increase by increasing 2 8Byte. Most of case increasing the first 8 Byte(from 8bit to 15 bit) is enough, it only needs to increase the next 8Byte when the first 8Byte overflows. And I have added a test for limit case, could you please help review again? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25281#discussion_r2343365699 From epeter at openjdk.org Fri Sep 12 08:44:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 08:44:24 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) Message-ID: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> `CastX2PNode::Ideal` optimizes cases: CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. --------------------------------------- A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. ------------- Commit messages: - move test - Apply suggestions from code review - JDK-8367483 Changes: https://git.openjdk.org/jdk/pull/27249/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27249&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367483 Stats: 63 lines in 2 files changed: 62 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27249.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27249/head:pull/27249 PR: https://git.openjdk.org/jdk/pull/27249 From chagedorn at openjdk.org Fri Sep 12 08:44:26 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 12 Sep 2025 08:44:26 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: On Fri, 12 Sep 2025 08:18:40 GMT, Emanuel Peter wrote: > `CastX2PNode::Ideal` optimizes cases: > > CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) > CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) > > > But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. > > --------------------------------------- > > A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. Otherwise, looks good! test/hotspot/jtreg/compiler/c2/gvn/MissedOptimizationWithCastX2PSubX.java line 1: > 1: /* There is a `compiler/igvn` test folder. I think this suits better than `c2/gvn`. test/hotspot/jtreg/compiler/c2/gvn/MissedOptimizationWithCastX2PSubX.java line 34: > 32: * -XX:-TieredCompilation > 33: * -XX:+IgnoreUnrecognizedVMOptions > 34: * -XX:+UnlockDiagnosticVMOptions These are not required: Suggestion: * -XX:CompileCommand=compileonly,compiler.c2.gvn.MissedOptimizationWithCastX2PSubX::test * -XX:-TieredCompilation * -XX:+IgnoreUnrecognizedVMOptions test/hotspot/jtreg/compiler/c2/gvn/MissedOptimizationWithCastX2PSubX.java line 37: > 35: * -XX:VerifyIterativeGVN=1110 > 36: * compiler.c2.gvn.MissedOptimizationWithCastX2PSubX > 37: * @run driver compiler.c2.gvn.MissedOptimizationWithCastX2PSubX Should be `main` to allow additional flags to be passed in. Suggestion: * @run main compiler.c2.gvn.MissedOptimizationWithCastX2PSubX ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27249#pullrequestreview-3215370623 PR Review Comment: https://git.openjdk.org/jdk/pull/27249#discussion_r2343421069 PR Review Comment: https://git.openjdk.org/jdk/pull/27249#discussion_r2343415661 PR Review Comment: https://git.openjdk.org/jdk/pull/27249#discussion_r2343416742 From bmaillard at openjdk.org Fri Sep 12 08:44:27 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Fri, 12 Sep 2025 08:44:27 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: <5-DC_hrp0sdE4QYHLP5ChTq2NlFyXm_xpB2NWiJUuuE=.47b98166-122e-43f9-87cb-cb6d992f84b6@github.com> On Fri, 12 Sep 2025 08:18:40 GMT, Emanuel Peter wrote: > `CastX2PNode::Ideal` optimizes cases: > > CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) > CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) > > > But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. > > --------------------------------------- > > A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. Looks good to me! ------------- Marked as reviewed by bmaillard (Author). PR Review: https://git.openjdk.org/jdk/pull/27249#pullrequestreview-3215415303 From epeter at openjdk.org Fri Sep 12 08:44:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 08:44:27 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: <5-DC_hrp0sdE4QYHLP5ChTq2NlFyXm_xpB2NWiJUuuE=.47b98166-122e-43f9-87cb-cb6d992f84b6@github.com> References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> <5-DC_hrp0sdE4QYHLP5ChTq2NlFyXm_xpB2NWiJUuuE=.47b98166-122e-43f9-87cb-cb6d992f84b6@github.com> Message-ID: On Fri, 12 Sep 2025 08:37:55 GMT, Beno?t Maillard wrote: >> `CastX2PNode::Ideal` optimizes cases: >> >> CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) >> CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) >> >> >> But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. >> >> --------------------------------------- >> >> A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. > > Looks good to me! @benoitmaillard Thanks for the review! @chhagedorn Thanks for the suggestions, can I have your re-approval? ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27249#issuecomment-3284329031 From bmaillard at openjdk.org Fri Sep 12 08:44:29 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Fri, 12 Sep 2025 08:44:29 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: On Fri, 12 Sep 2025 08:34:45 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/c2/gvn/MissedOptimizationWithCastX2PSubX.java line 1: >> >>> 1: /* >> >> There is a `compiler/igvn` test folder. I think this suits better than `c2/gvn`. > > Yes. We already have other missed optimization tests in `c2/gvn` though ? > As always: quite a mess. This makes sense, but I remember that in the past similar tests (for example `test/hotspot/jtreg/compiler/c2/TestEliminateRedundantConversionSequences.java`) were simply put in `c2`. Not sure what is the policy here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27249#discussion_r2343448150 From rehn at openjdk.org Fri Sep 12 08:45:39 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 12 Sep 2025 08:45:39 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v5] In-Reply-To: References: <64z-PlrnxAISLzKBq-RZz7CXkQirGTvOgTGMJQl833o=.73ea3239-dfb6-4e32-b20f-8398334f2759@github.com> Message-ID: On Thu, 11 Sep 2025 09:01:23 GMT, Hamlin Li wrote: >> Hamlin had some offline Q so I gather this data for him: >> >> Benchmark Results, doing 20 iteration and 20 runs of each benchmarks for both options: >> (using P550 where I saw the largest regression) >> >> Base: JDK24* +UseTrampoline >> JAL OPT: JDK24* -UseTrampoline + JAL OPT >> >> Values are in ms, lower is better. >> >> >> +-----------------+--------------+--------------+----------------+----------------+--------------+------------------+-------------+----------------+------------------+--------------------+ >> | Benchmark | Mean (Base) | SD (Base) | Fastest (Base) | Mean (JAL OPT) | SD (JAL OPT) | Fastest (JAL OPT)| Diff Mean | Diff Fastest | Mean Diff Ratio | Fastest Diff Ratio | >> +-----------------+--------------+--------------+----------------+----------------+--------------+------------------+-------------+----------------+------------------+--------------------+ >> | future-genetic | 8317.8449 | 925.0775 | 7824.59 | 8421.137 | 1870.3916 | 7955.19 | 103.2922 | 130.6 | 1.012418145 | 1.01669097 | >> | akka-uct | 54775.8037 | 5220.7361 | 49614.46 | 54149.9939 | 4730.3662 | 48736.7 | -625.8097 | -877.76 | 0.9885750686 | 0.9823083835 | >> | movie-lens | 44859.3268 | 107.8713 | 38160.64 | 43043.6965 | 7932.6525 | 36807.2 | -1815.6295 | -1353.44 | 0.9595261529 | 0.9645330896 | >> | scala-doku | 10792.4933 | 3004.9348 | 970.34 | 10739.0164 | 2692.6155 | 9226.94 | -53.4766 | 256.59 | 0.9950450188 | 1.028605382 | >> | chi-square | 4740.1812 | 3552.9489 | 2579.09 | 4749.0893 | 3484.3178 | 2498.04 | 8.9081 | -81.05 | 1.001879274 | 0.968574187 | >> | fj-kmeans | 18597.656 | 2481.4036 | 17994.43 | 18588.154 | 4458.6089 | 18019.15 | -9.5018 | 24.72 | 0.9994890862 | 1.001373758 | >> | db-shootout | 26529.8048 | 3163.9087 | 21270.43 | 25101.5681 | 2483.0698 | 21419.11 | -1428.2367 | 148.67 | 0.9461648244 | 1.006989986 | >> | finagle-http | 20646.1713 | 1635.9154 | 14898.97 | 20250.4966 | 1046.1738 | 14735.66 | -395.6747 | -163.31 | 0.9808354443 | 0.9890388396 | >> | reactors | 52051.8872 | 2023.7865 | 49188.65 | 51625.9... > >> Hamlin had some offline Q so I gather this data for him: > > Thanks Robbin for collecting the data! > >> So on average using auipc+ld+jalr + JAL opt is 1.73% faster than the old trampolines. > > This looks great! @Hamlin-Li @RealFYang I broke release builds: src/hotspot/cpu/riscv/nativeInst_riscv.cpp is missing this include runtime/atomic.hpp https://bugs.openjdk.org/browse/JDK-8367498 If you can you fix that for me (away for an hour) I would very much be thankfull! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3284331961 From epeter at openjdk.org Fri Sep 12 08:44:29 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 08:44:29 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: On Fri, 12 Sep 2025 08:28:51 GMT, Christian Hagedorn wrote: >> `CastX2PNode::Ideal` optimizes cases: >> >> CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) >> CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) >> >> >> But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. >> >> --------------------------------------- >> >> A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. > > test/hotspot/jtreg/compiler/c2/gvn/MissedOptimizationWithCastX2PSubX.java line 1: > >> 1: /* > > There is a `compiler/igvn` test folder. I think this suits better than `c2/gvn`. Yes. We already have other missed optimization tests in `c2/gvn` though ? As always: quite a mess. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27249#discussion_r2343441022 From epeter at openjdk.org Fri Sep 12 08:44:31 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 08:44:31 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: On Fri, 12 Sep 2025 08:37:12 GMT, Christian Hagedorn wrote: >> This makes sense, but I remember that in the past similar tests (for example `test/hotspot/jtreg/compiler/c2/TestEliminateRedundantConversionSequences.java`) were simply put in `c2`. Not sure what is the policy here. > > Yes, indeed. We should probably move them as well at some point. And we should probably stick more to the convention to name tests "TestXYZ" to distinguish between helper classes and actual tests. Yes. Maybe we just have to at some point move all tests around. Will hurt a bit for backports maybe. But it should be ok on the whole. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27249#discussion_r2343463923 From chagedorn at openjdk.org Fri Sep 12 08:44:30 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 12 Sep 2025 08:44:30 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: On Fri, 12 Sep 2025 08:35:47 GMT, Beno?t Maillard wrote: >> Yes. We already have other missed optimization tests in `c2/gvn` though ? >> As always: quite a mess. > > This makes sense, but I remember that in the past similar tests (for example `test/hotspot/jtreg/compiler/c2/TestEliminateRedundantConversionSequences.java`) were simply put in `c2`. Not sure what is the policy here. Yes, indeed. We should probably move them as well at some point. And we should probably stick more to the convention to name tests "TestXYZ" to distinguish between helper classes and actual tests. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27249#discussion_r2343455094 From chagedorn at openjdk.org Fri Sep 12 08:47:09 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 12 Sep 2025 08:47:09 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: <7LnPwg-I-_AG6uTrehfAWH_2eg94EzvX3aWdhtjpiBs=.663f707b-f6a4-4d83-b564-83fcc38d2744@github.com> On Fri, 12 Sep 2025 08:18:40 GMT, Emanuel Peter wrote: > `CastX2PNode::Ideal` optimizes cases: > > CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) > CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) > > > But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. > > --------------------------------------- > > A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. Looks good and trivial, thanks for the update ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27249#pullrequestreview-3215461040 From mli at openjdk.org Fri Sep 12 08:51:22 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 12 Sep 2025 08:51:22 GMT Subject: RFR: 8365926: RISC-V: Performance regression in renaissance (chi-square) [v5] In-Reply-To: References: <64z-PlrnxAISLzKBq-RZz7CXkQirGTvOgTGMJQl833o=.73ea3239-dfb6-4e32-b20f-8398334f2759@github.com> Message-ID: On Thu, 11 Sep 2025 09:01:23 GMT, Hamlin Li wrote: >> Hamlin had some offline Q so I gather this data for him: >> >> Benchmark Results, doing 20 iteration and 20 runs of each benchmarks for both options: >> (using P550 where I saw the largest regression) >> >> Base: JDK24* +UseTrampoline >> JAL OPT: JDK24* -UseTrampoline + JAL OPT >> >> Values are in ms, lower is better. >> >> >> +-----------------+--------------+--------------+----------------+----------------+--------------+------------------+-------------+----------------+------------------+--------------------+ >> | Benchmark | Mean (Base) | SD (Base) | Fastest (Base) | Mean (JAL OPT) | SD (JAL OPT) | Fastest (JAL OPT)| Diff Mean | Diff Fastest | Mean Diff Ratio | Fastest Diff Ratio | >> +-----------------+--------------+--------------+----------------+----------------+--------------+------------------+-------------+----------------+------------------+--------------------+ >> | future-genetic | 8317.8449 | 925.0775 | 7824.59 | 8421.137 | 1870.3916 | 7955.19 | 103.2922 | 130.6 | 1.012418145 | 1.01669097 | >> | akka-uct | 54775.8037 | 5220.7361 | 49614.46 | 54149.9939 | 4730.3662 | 48736.7 | -625.8097 | -877.76 | 0.9885750686 | 0.9823083835 | >> | movie-lens | 44859.3268 | 107.8713 | 38160.64 | 43043.6965 | 7932.6525 | 36807.2 | -1815.6295 | -1353.44 | 0.9595261529 | 0.9645330896 | >> | scala-doku | 10792.4933 | 3004.9348 | 970.34 | 10739.0164 | 2692.6155 | 9226.94 | -53.4766 | 256.59 | 0.9950450188 | 1.028605382 | >> | chi-square | 4740.1812 | 3552.9489 | 2579.09 | 4749.0893 | 3484.3178 | 2498.04 | 8.9081 | -81.05 | 1.001879274 | 0.968574187 | >> | fj-kmeans | 18597.656 | 2481.4036 | 17994.43 | 18588.154 | 4458.6089 | 18019.15 | -9.5018 | 24.72 | 0.9994890862 | 1.001373758 | >> | db-shootout | 26529.8048 | 3163.9087 | 21270.43 | 25101.5681 | 2483.0698 | 21419.11 | -1428.2367 | 148.67 | 0.9461648244 | 1.006989986 | >> | finagle-http | 20646.1713 | 1635.9154 | 14898.97 | 20250.4966 | 1046.1738 | 14735.66 | -395.6747 | -163.31 | 0.9808354443 | 0.9890388396 | >> | reactors | 52051.8872 | 2023.7865 | 49188.65 | 51625.9... > >> Hamlin had some offline Q so I gather this data for him: > > Thanks Robbin for collecting the data! > >> So on average using auipc+ld+jalr + JAL opt is 1.73% faster than the old trampolines. > > This looks great! > @Hamlin-Li @RealFYang > > I broke release builds: src/hotspot/cpu/riscv/nativeInst_riscv.cpp is missing this include runtime/atomic.hpp https://bugs.openjdk.org/browse/JDK-8367498 > > If you can you fix that for me (away for an hour) I would very much be thankfull! Sure, let me do it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26944#issuecomment-3284352576 From roland at openjdk.org Fri Sep 12 09:10:20 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 12 Sep 2025 09:10:20 GMT Subject: RFR: 8366888: C2: incorrect assertion predicate with short running long counted loop Message-ID: In: for (int i = 100; i < 1100; i++) { v += floatArray[i - 100]; Objects.checkIndex(i, longRange); } The int counted loop has both an int range check and a long range. The int range check is optimized first. Assertion predicates are inserted above the loop. One predicates checks that: init - 100 References: Message-ID: On Fri, 12 Sep 2025 08:57:57 GMT, Roland Westrelin wrote: > In: > > > for (int i = 100; i < 1100; i++) { > v += floatArray[i - 100]; > Objects.checkIndex(i, longRange); > } > > > The int counted loop has both an int range check and a long range. The > int range check is optimized first. Assertion predicates are inserted > above the loop. One predicates checks that: > > > init - 100 > > The loop is then transformed to enable the optimization of the long > range check. The loop is short running, so there's no need to create a > loop nest. The counted loop is mostly left as is but, the loop's > bounds are changed from: > > > for (int i = 100; i < 1100; i++) { > > > to: > > > for (int i = 0; i < 1000; i++) { > > > The reason for that the long range check transformation expects the > loop to start at 0. > > Pre/main/post loops are created. Template Assertion predicates are > added above the main loop. The loop is unrolled. Initialized assertion > predicates are created. The one created from the condition: > > > init - 100 > > checks the value of `i` out of the pre loop which is 1. That check fails. > > The root cause of the failure is that when bounds of the counted loop > are changed, template assertion predicates need to be updated with and > adjusted init input. > > When the bounds of the loop are known, the assertion predicates can be > updated in place. Otherwise, when the loop is speculated to be short > running, the assertion predicates are updated when they are cloned. Thanks @chhagedorn for the test case ------------- PR Comment: https://git.openjdk.org/jdk/pull/27250#issuecomment-3284433245 From mli at openjdk.org Fri Sep 12 09:17:53 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 12 Sep 2025 09:17:53 GMT Subject: RFR: 8367501: RISC-V: build broken after JDK-8365926 Message-ID: Hi, Can you help to review this patch? check https://github.com/openjdk/jdk/pull/26944, https://github.com/openjdk/jdk/pull/27135 Thanks ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/27251/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27251&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367501 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27251.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27251/head:pull/27251 PR: https://git.openjdk.org/jdk/pull/27251 From roland at openjdk.org Fri Sep 12 09:24:28 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 12 Sep 2025 09:24:28 GMT Subject: RFR: 8364757: Missing Store nodes caused by bad wiring in PhaseIdealLoop::insert_post_loop In-Reply-To: References: Message-ID: <0sO2cPw0cvqc012qfyLQLLTukDO2q85ry3tGavZ3ZPM=.6d8c9b58-2eb5-46c0-ac53-d5041588d8ea@github.com> On Thu, 11 Sep 2025 13:05:21 GMT, Beno?t Maillard wrote: > This PR introduces a fix for wrong results caused by missing `Store` nodes in C2 IR due to incorrect wiring in `PhaseIdealLoop::insert_post_loop`. > > ### Context > > The issue was initially found by the fuzzer. After some trial and error, and with the help of @chhagedorn I was able to reduce the reproducer to something very simple. After being compiled by C2, the execution of the following method led to the last statement (`x = 0`) to be ignored: > > > static public void test() { > x = 0; > for (int i = 0; i < 20000; i++) { > x += i; > } > x = 0; > } > > > After some investigation and discussions with @robcasloz and @chhagedorn, it appeared that this issue is linked to how safepoints are inserted into long running loops, causing the loop to be transformed into a nested loop with an `OuterStripMinedLoop` node. `Store` node are moved out of the inner loop when encountering this pattern, and the associated `Phi` nodes are removed in order to avoid inhibiting loop optimizations taking place later. This was initially adressed in [JDK-8356708](https://bugs.openjdk.org/browse/JDK-8356708) by making the necessary corrections in macro expansion. As explained in the next section, this is not enough here as macro expansion happens too late. > > This PR aims at addressing the specific case of the wrong wiring of `Store` nodes in _post_ loops, but on the longer term further investigations into the missing `Phi` node issue are necessary, as they are likely to cause other issues (cf. related JBS issues). > > ### Detailed Analysis > > In `PhaseIdealLoop::create_outer_strip_mined_loop`, a simple `CountedLoop` is turned into a nested loop with an `OuterStripMinedLoop`. The body of the initial loop remains in the inner loop, but the safepoint is moved to the outer loop. Later, we attempt to move `Store` nodes after the inner loop in `PhaseIdealLoop::try_move_store_after_loop`. When the `Store` node is moved to the outer loop, we also get rid of its input `Phi` node in order not to confuse loop optimizations happening later. > > This only becomes a problem in `PhaseIdealLoop::insert_post_loop`, where we clone the body of the inner/outer loop for the iterations remaining after unrolling. There, we use `Phi` nodes to do the necessary rewiring between the original body and the cloned one. Because we do not have `Phi` nodes for the moved `Store` nodes, their memory inputs may end up being incorrect. > > This is what the IR looks like after the creation of the post lo... Not a review but a comment on the missing Phis. Your description makes it sound like if the `OuterStripMinedLoop` was created with `Phis` from the start, there would be no issue. That's no true AFAICT. The current logic for pre/main/post loops creation would simply not work because it doesn't expect the `Phis` and it would need to be extended so things are rewired correctly with the outer loop `Phis`. The inner loop would still have no `Phi` for the sunk store. So the existing logic, once fixed, would not find it either and you would need some new logic to find it maybe using the outer loop `Phis`. The current shape of the outer loop (without the Phis) is very simple and there's only one location where the Store can be (on the exit projection of the inner loop right above the safepoint which is right below the exit of the inner loop and can't be anywhere else). So you added logic to find the Store relying on the current shape of the outer loop. If the outer loop had `Phis`, some alt ernate version of that logic could be used. They seem like 2 ways of doing the same thing to me and nothing tells us one is better than the other. In short, I don't find this bug a good example of something that would work better if we had `Phi`s on the outer loop. I wouldn't say the root cause is that we don't have `Phi`s on the outer loop either. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27225#issuecomment-3284472055 From rehn at openjdk.org Fri Sep 12 10:44:28 2025 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 12 Sep 2025 10:44:28 GMT Subject: RFR: 8367501: RISC-V: build broken after JDK-8365926 In-Reply-To: References: Message-ID: <1DzpLBgwYHLES28Ke04iAutvg8CFvfHrWI4hWV5YQng=.13dabf77-37ce-467c-a2cd-e91fe3517ebc@github.com> On Fri, 12 Sep 2025 09:09:43 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > check https://github.com/openjdk/jdk/pull/26944, https://github.com/openjdk/jdk/pull/27135 > > Thanks Haha I did the same thing as @jdksjolen :) I should I have merge before. Locally test, thank you @Hamlin-Li! ------------- Marked as reviewed by rehn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27251#pullrequestreview-3216035215 From mli at openjdk.org Fri Sep 12 10:44:28 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 12 Sep 2025 10:44:28 GMT Subject: RFR: 8367501: RISC-V: build broken after JDK-8365926 In-Reply-To: <1DzpLBgwYHLES28Ke04iAutvg8CFvfHrWI4hWV5YQng=.13dabf77-37ce-467c-a2cd-e91fe3517ebc@github.com> References: <1DzpLBgwYHLES28Ke04iAutvg8CFvfHrWI4hWV5YQng=.13dabf77-37ce-467c-a2cd-e91fe3517ebc@github.com> Message-ID: On Fri, 12 Sep 2025 10:37:23 GMT, Robbin Ehn wrote: > Haha I did the same thing as @jdksjolen :) I should I have merge before. > > Locally test, thank you @Hamlin-Li! Trigger runtime tests, no failure found yet. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27251#issuecomment-3284761602 From mli at openjdk.org Fri Sep 12 10:44:29 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 12 Sep 2025 10:44:29 GMT Subject: Integrated: 8367501: RISC-V: build broken after JDK-8365926 In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 09:09:43 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > check https://github.com/openjdk/jdk/pull/26944, https://github.com/openjdk/jdk/pull/27135 > > Thanks This pull request has now been integrated. Changeset: d13769d6 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/d13769d6c12688edffb23965c23cac614a9e6926 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod 8367501: RISC-V: build broken after JDK-8365926 Reviewed-by: rehn ------------- PR: https://git.openjdk.org/jdk/pull/27251 From dlunden at openjdk.org Fri Sep 12 11:31:00 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 11:31:00 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v3] In-Reply-To: References: Message-ID: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Fix remaining references to all-stack ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27215/files - new: https://git.openjdk.org/jdk/pull/27215/files/61ff4f8c..82b85367 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=01-02 Stats: 7 lines in 2 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From dlunden at openjdk.org Fri Sep 12 11:36:27 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 11:36:27 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v4] In-Reply-To: References: Message-ID: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Remove _LogWordBits ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27215/files - new: https://git.openjdk.org/jdk/pull/27215/files/82b85367..47773ee9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=02-03 Stats: 9 lines in 3 files changed: 0 ins; 1 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From epeter at openjdk.org Fri Sep 12 11:39:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 11:39:11 GMT Subject: RFR: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: <7LnPwg-I-_AG6uTrehfAWH_2eg94EzvX3aWdhtjpiBs=.663f707b-f6a4-4d83-b564-83fcc38d2744@github.com> References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> <7LnPwg-I-_AG6uTrehfAWH_2eg94EzvX3aWdhtjpiBs=.663f707b-f6a4-4d83-b564-83fcc38d2744@github.com> Message-ID: On Fri, 12 Sep 2025 08:45:07 GMT, Christian Hagedorn wrote: >> `CastX2PNode::Ideal` optimizes cases: >> >> CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) >> CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) >> >> >> But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. >> >> --------------------------------------- >> >> A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. > > Looks good and trivial, thanks for the update @chhagedorn @benoitmaillard Thanks for the reviews! I agree, the patch is quite trivial. I'm risking a Friday afternoon integration, to make sure the CI does not fail on our stress job. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27249#issuecomment-3284931070 From epeter at openjdk.org Fri Sep 12 12:09:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:09:32 GMT Subject: RFR: 8366940: Test compiler/loopopts/superword/TestAliasingFuzzer.java timed out Message-ID: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> `TestAliasingFuzzer.java` generates 30 subtests for every run. They are randomized. Some vectorize and execute faster, some fail to vectorize and execute slower. Hence, some natural variance in the duration is expected. On most machines, it seems the variance in "Running Tests" is about 30-50sec (total test time about 35-70sec). But on some machines (macosx-x64-debug), the execution time is a bit slower: 60-100 in "Running Tests", with some outliers at 110+sec. These occasionally trip the 120sec timeout, and when they trip it, they somehow cause the harness to take an excessive 9+min to shut everything down. Solutions: - Option 1: generate fewer tests in `TestAliasingFuzzer.java`. Would be sad, the test has now found 2 real bugs within 2 weeks. - Option 2: increase test timeout. That is what I'll do. Because the "outliers" that caused the timeouts were not far from all other cases on the same platform, and so they are acceptable. ------------- Commit messages: - JDK-8366940 Changes: https://git.openjdk.org/jdk/pull/27257/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27257&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366940 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27257.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27257/head:pull/27257 PR: https://git.openjdk.org/jdk/pull/27257 From epeter at openjdk.org Fri Sep 12 12:09:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:09:32 GMT Subject: Integrated: 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) In-Reply-To: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> References: <_kMBdz-PsErEbxlHt7PDZTJmRqNEguaZS4GAgta9KtY=.2f177665-f392-4539-b490-b635a6afbe15@github.com> Message-ID: On Fri, 12 Sep 2025 08:18:40 GMT, Emanuel Peter wrote: > `CastX2PNode::Ideal` optimizes cases: > > CastX2P(AddX(x, y)) -> AddP(CastX2P(x), y) > CastX2P(SubL(x, y)) -> AddP(CastX2P(x), SubL(0, y)) > > > But the notification code `PhaseIterGVN::add_users_of_use_to_worklist` only adds `CastX2P` to the worklist for the `AddX` and not the `SubX` cases. > > --------------------------------------- > > A little brag: this is the second (unrelated, i.e. non aliasing) bug that `TestAliasingFuzzer.java` found. Fuzzing access to native MemorySegment seems to trigger new/rare patterns. This pull request has now been integrated. Changeset: 02d7281b Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/02d7281b93296e7700e215804cb9e2f8341cab06 Stats: 63 lines in 2 files changed: 62 ins; 0 del; 1 mod 8367483: C2 crash in PhaseValues::type: assert(t != nullptr) failed: must set before get - missing notification for CastX2P(SubL(x, y)) Reviewed-by: chagedorn, bmaillard ------------- PR: https://git.openjdk.org/jdk/pull/27249 From epeter at openjdk.org Fri Sep 12 12:16:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:16:01 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Thu, 11 Sep 2025 17:39:45 GMT, Hannes Greule wrote: >>> Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> Can we return `Type::TOP` instead? >> >> Besides, #17508 should be merged right after JDK-25 folk, do you want to wait for it first? > > @merykitty thanks, I hopefully addressed your comments :) > > @eme64 do you want to re-run the tests once again? @SirYwell Launching tests ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3285049047 From epeter at openjdk.org Fri Sep 12 12:19:33 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:19:33 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 16:25:32 GMT, Srinivas Vamsi Parasa wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> undo new match rules for RegMemReg for commutative operations > > Hi Emanuel (@eme64), > > Could you please run the tests for this PR? > > Thanks, > Vamsi @vamsi-parasa Quickly scanned the patch, looks reasonable. Launching tests ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26997#issuecomment-3285064018 From epeter at openjdk.org Fri Sep 12 12:27:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:27:11 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v4] In-Reply-To: References: <2unG-RdDR2e1mI-veaR3AdDGGs1q4XFdITrnQtBGOw8=.47f7d565-b722-434b-96ef-b51ed733b241@github.com> <1xDonJ67G3hUAWTdngutIb7LBboWxHRviCHXKDCSoN4=.2617f8e9-206b-424d-a1ab-501b182717bb@github.com> Message-ID: <1LfsfA8tLxKr7hmkLM8-ZR49IEblYMEjTkcUPC0P5cs=.e73b68b8-04ec-4714-a56e-f3af91dce5bc@github.com> On Thu, 11 Sep 2025 23:44:45 GMT, Dean Long wrote: >> Thanks, I agree that it seems more consistent to use `_rm_int` and `_rm_word` instead. The missing leading underscore for `RM_SIZE_IN_INTS` highlights that it is a macro, unlike `_RM_SIZE_IN_WORDS`. Maybe this is just for historical reasons and not up to date with today's conventions? >> >> Do we classify constant static fields such as `_RM_SIZE_IN_WORDS` as constants or fields? I.e., do we use upper or lower case? I guess it would be `_rm_size_in_words` if considered a field and `RM_SIZE_IN_WORDS` (without the leading underscore) if considered a constant. > > I vote for `RM_SIZE_IN_WORDS` because it is a constant, the same as if it was a value from an enum. Same as @dean-long : constants and enum values are generally `RM_SIZE_IN_WORDS`. Sometimes we also do `CamelCase`. Like for `LogBitsPerWord`. Underscore `_` is really only for fields. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344084682 From dlunden at openjdk.org Fri Sep 12 12:33:42 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 12:33:42 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v5] In-Reply-To: References: Message-ID: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Rename constants ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27215/files - new: https://git.openjdk.org/jdk/pull/27215/files/47773ee9..31c78597 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=03-04 Stats: 21 lines in 2 files changed: 0 ins; 0 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From epeter at openjdk.org Fri Sep 12 12:33:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:33:44 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v4] In-Reply-To: References: Message-ID: <19P8X88PcVoh8x62iBz5baOyefoWgGp-aFd_Bli-vm0=.fbd446d3-0dad-41e8-bb13-ab113a3b9767@github.com> On Fri, 12 Sep 2025 11:36:27 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove _LogWordBits Changes requested by epeter (Reviewer). src/hotspot/share/opto/regmask.hpp line 65: > 63: LP64_ONLY(STATIC_ASSERT(is_aligned(RM_SIZE_IN_INTS, 2))); > 64: > 65: static const unsigned int _WordBitMask = BitsPerWord - 1U; You could also remove the `_` here. I suppose we keep `CamelCase` here because to keep it parallel to `BitsPerWord`. What do you think @dean-long ? src/hotspot/share/opto/regmask.hpp line 68: > 66: static const unsigned int _RM_SIZE_IN_WORDS = > 67: LP64_ONLY(RM_SIZE_IN_INTS >> 1) NOT_LP64(RM_SIZE_IN_INTS); > 68: static const unsigned int _RM_WORD_MAX_INDEX = _RM_SIZE_IN_WORDS - 1U; I would get rid of the `_` here. Constants should preferrably be `UPPER_CASE` (most of the time), and occasionally `CamelCase` where we are already doing it (only do it if needed for consistency). Underscore is only used for fields, as far as I know. ------------- PR Review: https://git.openjdk.org/jdk/pull/27215#pullrequestreview-3216439510 PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344087574 PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344091072 From epeter at openjdk.org Fri Sep 12 12:33:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:33:46 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 07:59:44 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 166: >> >>> 164: // indefinitely with ONE bits. Returns TRUE if mask is infinite or >>> 165: // unbounded in size. Returns FALSE if mask is finite size. >>> 166: bool is_infinite() const { >> >> "infinite" hides the fact that these unbounded bits are stack bits and not register bits, but `is_UnboundedStack` or `is_InfiniteStack` might be too verbose. How does `is_InfStack` sound? > > I like the suggestion, but should we not make it `is_infinite_stack` (current convention according to the style guide)? Or does historic conventions in `regmask.hpp` take precedence? I would prefer `is_infinite_stack`. `is_InfiniteStack` and `is_InfStack` only make sense if `InfiniteStack` is a class / name that is used widely in CamelCase. Like for `ContedLoopNode` -> `is_CountedLoop`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344098469 From dlunden at openjdk.org Fri Sep 12 12:33:45 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 12:33:45 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v5] In-Reply-To: <1LfsfA8tLxKr7hmkLM8-ZR49IEblYMEjTkcUPC0P5cs=.e73b68b8-04ec-4714-a56e-f3af91dce5bc@github.com> References: <2unG-RdDR2e1mI-veaR3AdDGGs1q4XFdITrnQtBGOw8=.47f7d565-b722-434b-96ef-b51ed733b241@github.com> <1xDonJ67G3hUAWTdngutIb7LBboWxHRviCHXKDCSoN4=.2617f8e9-206b-424d-a1ab-501b182717bb@github.com> <1LfsfA8tLxKr7hmkLM8-ZR49IEblYMEjTkcUPC0P5cs=.e73b68b8-04ec-4714-a56e-f3af91dce5bc@github.com> Message-ID: On Fri, 12 Sep 2025 12:23:29 GMT, Emanuel Peter wrote: >> I vote for `RM_SIZE_IN_WORDS` because it is a constant, the same as if it was a value from an enum. > > Same as @dean-long : constants and enum values are generally `RM_SIZE_IN_WORDS`. Sometimes we also do `CamelCase`. Like for `LogBitsPerWord`. Underscore `_` is really only for fields. Thanks, now updated ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344100928 From syan at openjdk.org Fri Sep 12 12:34:10 2025 From: syan at openjdk.org (SendaoYan) Date: Fri, 12 Sep 2025 12:34:10 GMT Subject: RFR: 8366940: Test compiler/loopopts/superword/TestAliasingFuzzer.java timed out In-Reply-To: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> References: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> Message-ID: On Fri, 12 Sep 2025 12:01:25 GMT, Emanuel Peter wrote: > `TestAliasingFuzzer.java` generates 30 subtests for every run. They are randomized. Some vectorize and execute faster, some fail to vectorize and execute slower. > > Hence, some natural variance in the duration is expected. > On most machines, it seems the variance in "Running Tests" is about 30-50sec (total test time about 35-70sec). But on some machines (macosx-x64-debug), the execution time is a bit slower: 60-100 in "Running Tests", with some outliers at 110+sec. These occasionally trip the 120sec timeout, and when they trip it, they somehow cause the harness to take an excessive 9+min to shut everything down. > > Solutions: > - Option 1: generate fewer tests in `TestAliasingFuzzer.java`. Would be sad, the test has now found 2 real bugs within 2 weeks. > - Option 2: increase test timeout. That is what I'll do. Because the "outliers" that caused the timeouts were not far from all other cases on the same platform, and so they are acceptable. Marked as reviewed by syan (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27257#pullrequestreview-3216462290 From epeter at openjdk.org Fri Sep 12 12:37:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:37:27 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v7] In-Reply-To: References: Message-ID: <01wSitNTzH-39p7KkpV9C_aD-4pvob_CqtnreZGP9L8=.396f1a6a-84b4-465a-91f5-5ef2ecd074b7@github.com> On Thu, 11 Sep 2025 12:16:47 GMT, Jatin Bhateja wrote: >> This patch optimizes PopCount value transforms using KnownBits information. >> Following are the results of the micro-benchmark included with the patch >> >> >> >> System: 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s >> >> Withopt: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Adding random bound test point test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 56: > 54: static final long rand_bndL2 = G.uniformLongs(0xFFL, Long.MAX_VALUE).next(); > 55: static final long rand_popcL1 = G.uniformLongs(0, 3).next(); > 56: static final long rand_popcL2 = G.uniformLongs(20, 40).next(); What is the reason for limiting the range on all these values? For example, we now never generate negative values, i.e. the msb is never set. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2344107725 From dlunden at openjdk.org Fri Sep 12 12:37:27 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 12:37:27 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v4] In-Reply-To: <19P8X88PcVoh8x62iBz5baOyefoWgGp-aFd_Bli-vm0=.fbd446d3-0dad-41e8-bb13-ab113a3b9767@github.com> References: <19P8X88PcVoh8x62iBz5baOyefoWgGp-aFd_Bli-vm0=.fbd446d3-0dad-41e8-bb13-ab113a3b9767@github.com> Message-ID: <-hb0UM3-WL9h72oPsOvK5NA2-pYCyDQZS42R7FzPJ3s=.5dbaea39-391e-4e25-9e9e-f97940d3bc06@github.com> On Fri, 12 Sep 2025 12:24:45 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove _LogWordBits > > src/hotspot/share/opto/regmask.hpp line 65: > >> 63: LP64_ONLY(STATIC_ASSERT(is_aligned(RM_SIZE_IN_INTS, 2))); >> 64: >> 65: static const unsigned int _WordBitMask = BitsPerWord - 1U; > > You could also remove the `_` here. I suppose we keep `CamelCase` here because to keep it parallel to `BitsPerWord`. What do you think @dean-long ? I renamed this entire group of constants to use the same style (uppercase separated by `_`, without leading `_`). It is now `WORD_BIT_MASK`. I think it makes more sense to use the same style across `regmask.hpp`, rather than following styles in other files. > src/hotspot/share/opto/regmask.hpp line 68: > >> 66: static const unsigned int _RM_SIZE_IN_WORDS = >> 67: LP64_ONLY(RM_SIZE_IN_INTS >> 1) NOT_LP64(RM_SIZE_IN_INTS); >> 68: static const unsigned int _RM_WORD_MAX_INDEX = _RM_SIZE_IN_WORDS - 1U; > > I would get rid of the `_` here. Constants should preferrably be `UPPER_CASE` (most of the time), and occasionally `CamelCase` where we are already doing it (only do it if needed for consistency). Underscore is only used for fields, as far as I know. Yes, I had the same thought (now updated) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344107536 PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344109467 From rcastanedalo at openjdk.org Fri Sep 12 12:40:17 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 12 Sep 2025 12:40:17 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v2] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 09:42:18 GMT, Tobias Hartmann wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert clean-up in EA. Make catch statements more specific in test case. > > src/hotspot/share/opto/escape.cpp line 3135: > >> 3133: Node* phi = use->ideal_node(); >> 3134: if (phi->Opcode() == Op_Phi && reducible_merges.member(phi)) { >> 3135: if (!can_reduce_phi(phi->as_Phi())) { > > Drive-by comment: I think the ifs should be merged @JohnTortugo: this comment is marked as resolved in the PR but I cannot see any reply or actual code change, did you perhaps forget pushing the requested change? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2344117240 From epeter at openjdk.org Fri Sep 12 12:43:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:43:11 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v5] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 12:33:42 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Rename constants Nice, thanks for the improvements. I feel like the fog is lifting slowly from this code, and we can see the sunshine :sun_behind_large_cloud: -> :sun_behind_small_cloud: -> ? :rofl: src/hotspot/share/opto/chaitin.hpp line 51: > 49: class LRG : public ResourceObj { > 50: public: > 51: static const uint InfiniteStack_size = 0xFFFFF; // This mask size is used to tell that the mask of this LRG supports stack positions We may want to prevent this from snowballing everywhere ... but this is also a constant and we probably want to call it `INFINITE_STACK_SIZE`, right? src/hotspot/share/opto/regmask.cpp line 249: > 247: if (_rm_word[i]) { // Found some bits > 248: // Convert to bit number, return hi bit in pair > 249: return OptoReg::Name((i< References: <19P8X88PcVoh8x62iBz5baOyefoWgGp-aFd_Bli-vm0=.fbd446d3-0dad-41e8-bb13-ab113a3b9767@github.com> <-hb0UM3-WL9h72oPsOvK5NA2-pYCyDQZS42R7FzPJ3s=.5dbaea39-391e-4e25-9e9e-f97940d3bc06@github.com> Message-ID: On Fri, 12 Sep 2025 12:33:34 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 65: >> >>> 63: LP64_ONLY(STATIC_ASSERT(is_aligned(RM_SIZE_IN_INTS, 2))); >>> 64: >>> 65: static const unsigned int _WordBitMask = BitsPerWord - 1U; >> >> You could also remove the `_` here. I suppose we keep `CamelCase` here because to keep it parallel to `BitsPerWord`. What do you think @dean-long ? > > I renamed this entire group of constants to use the same style (uppercase separated by `_`, without leading `_`). It is now `WORD_BIT_MASK`. I think it makes more sense to use the same style across `regmask.hpp`, rather than following styles in other files. Nice! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344123876 From dlunden at openjdk.org Fri Sep 12 12:53:37 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 12:53:37 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v6] In-Reply-To: References: Message-ID: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Change infinite to infinite_stack ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27215/files - new: https://git.openjdk.org/jdk/pull/27215/files/31c78597..37f3cbd2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=04-05 Stats: 31 lines in 11 files changed: 0 ins; 0 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From dlunden at openjdk.org Fri Sep 12 12:53:39 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 12:53:39 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v2] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 00:18:59 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Lowercase _RM_INT and _RM_WORD > > src/hotspot/share/opto/regmask.hpp line 166: > >> 164: // indefinitely with ONE bits. Returns TRUE if mask is infinite or >> 165: // unbounded in size. Returns FALSE if mask is finite size. >> 166: bool is_infinite() const { > > "infinite" hides the fact that these unbounded bits are stack bits and not register bits, but `is_UnboundedStack` or `is_InfiniteStack` might be too verbose. How does `is_InfStack` sound? OK, I've now changed it to `is_infinite_stack`. @dean-long Let us know if you feel strongly about this and want to change it. I don't think it is too verbose. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344154909 From epeter at openjdk.org Fri Sep 12 12:56:26 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 12:56:26 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v14] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 07:03:02 GMT, Qizheng Xing wrote: >> The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases. >> >> This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch: >> >> >> public static int numberOfNibbles(int i) { >> int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i); >> return Math.max((mag + 3) / 4, 1); >> } >> >> >> Testing: tier1, IR test > > Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: > > Remove redundant import Thanks for the update. Though I think you modified the test example so far that it does not work any more, i.e. it would not constant fold if the output range was wrong. I've identified 2 sources that would prevent constant folding: - It is unclear if `getResultChecksum` would get inlined. That way, the `result` loses the type information about the ranges. - The comparisons themselves would not constant fold, because the values you compare with are not constants, but array element loads. You need to compare `result` with a compile time constant. Maybe the idea is not 100% clear for you: Imagine `result` should be in some range `2..10`. But with a bug, we now return `3..10`. This means the output of `numberOfLeadingZeros` is still variable, and it does not constant fold. But: if there is something below it, like a `result < 3` ... this would now constant fold to `false`, even though we could have had a `2` at runtime. But for this to work, it all needs to be in the same compilation unit, and the value we compare to `result` must be a compile time constant. Does that make sense? test/hotspot/jtreg/compiler/c2/gvn/TestCountBitsRange.java line 516: > 514: } > 515: > 516: int getResultChecksum(int result, int[] LIMITS) { I would put a `@ForceInlinie` before this. You are using it in many methods, and so it may not get inlined reliably. And if it does not get inlined, then the result verifcation would not constant-fold, and so it would be kind of useless. Because we rely on the fact that if the range is wrong, we could get bad constant folding ;) test/hotspot/jtreg/compiler/c2/gvn/TestCountBitsRange.java line 521: > 519: if (result < LIMITS[i]) sum += 1 << i; > 520: if (result > LIMITS[i + 1]) sum += 1 << (i + 1); > 521: } I doublt that this works, because the test would not constant fold if the range was too narrow. I think you need to manually unroll the loop, and load the constants from `static final` values, or another method that allows it to be a compile time constant. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25928#pullrequestreview-3216519184 PR Review Comment: https://git.openjdk.org/jdk/pull/25928#discussion_r2344141663 PR Review Comment: https://git.openjdk.org/jdk/pull/25928#discussion_r2344145678 From epeter at openjdk.org Fri Sep 12 13:03:19 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 13:03:19 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v3] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 21:45:00 GMT, Vladimir Ivanov wrote: >> Also: you promise that it happens randomly. But it seems to be added deterministically everywhere. Did I miss something? > > Sorry for the confusion. Reworded the comment. I didn't intend to make it truly random. The idea was to automatically insert RF nodes during parsing to stress the implementation. It doesn't slow down compilation times that much, so aggressive insertion just works. @iwanowww Ok, makes sense. I wonder though if we should consider random insertion. It would also be nice to document somewhere that this only really tests the internal mechanism of ReachabilityFence. It does not really stess the case where we should have had a ReachabilityFence but fail to have one. Instead, we just insert more than we need. So we don't really expect this to trigger bugs with missing RFs. >> Hmm. The way it is formulated it sounds more like: >> - `true` -> we are guaranteed that it is a safepoint. >> - `false` -> it may or may not be a safepoint - no guarantees. >> Am I understanding this right? >> >> If yes, then it would make more sense to have a default that is `no guarantee`. But maybe that makes things more complicated in other ways. All I'm saying it makes me nervous ;) > > You are right. I studied the code and `guaranteed_safepoint()` behaves as you described. It doesn't work for RF purposes, so I migrated the code to `sfpt->jvms() != nullptr` check and fixed a bug along the way. The changes related to `guaranteed_safepoint()` are reverted. Maybe it could be worth putting some comments around `guaranteed_safepoint` that describe the logic of it, while you have the advantage of understanding what it means? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344175842 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344184330 From dlunden at openjdk.org Fri Sep 12 13:08:01 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 13:08:01 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v7] In-Reply-To: References: Message-ID: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Rename InfiniteStack_size and fix style of touched code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27215/files - new: https://git.openjdk.org/jdk/pull/27215/files/37f3cbd2..cf247cd2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=05-06 Stats: 39 lines in 8 files changed: 15 ins; 0 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From epeter at openjdk.org Fri Sep 12 13:08:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 13:08:20 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> Message-ID: <6XMjW5KmnMDigmEXRpEy4lDGEUpElgzTq2YDULaFAAk=.1241a58d-3143-473c-b78b-afd60be2ef4b@github.com> On Tue, 9 Sep 2025 21:51:31 GMT, Vladimir Ivanov wrote: >> I'm also not sure yet why there is a difference between incremental inlining and regular inlining. >> Do you think it would make sense to explain that here, or is it explained elsewhere? > > There are no safepoint-attached reachability edges present during normal parsing. For incremental inlining, JVMS from the original call is taken and extended with callee state. If there are reachability edges present, they have to be treated specially and carried over to all safepoints produced during incremental inlining attempt. There's no such support in place yet. @iwanowww Ok, sounds a bit complicated. Maybe that is what we have to do, at least for now. But please make sure that this is documented, maybe right here or elsewhere. Because it is only half-clear to me now. Ok, so if the outer scope has RF edges, we need to make sure the inner scope has those RF edges too, right? Ah, you are saying we are not doing that yet? Are you keeping track of that information for later? Could we now create a reproducer that would fail in incremental inlining with a missing RF edge? Probably tricky, but very valuable ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344201256 From dlunden at openjdk.org Fri Sep 12 13:08:06 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 12 Sep 2025 13:08:06 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v5] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 12:36:55 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename constants > > src/hotspot/share/opto/chaitin.hpp line 51: > >> 49: class LRG : public ResourceObj { >> 50: public: >> 51: static const uint InfiniteStack_size = 0xFFFFF; // This mask size is used to tell that the mask of this LRG supports stack positions > > We may want to prevent this from snowballing everywhere ... but this is also a constant and we probably want to call it `INFINITE_STACK_SIZE`, right? I agree, now fixed (but yes, we need to stop snowballing at some point!) > src/hotspot/share/opto/regmask.cpp line 249: > >> 247: if (_rm_word[i]) { // Found some bits >> 248: // Convert to bit number, return hi bit in pair >> 249: return OptoReg::Name((i< > Suggestion: > > return OptoReg::Name((i << LogBitsPerWord) + find_lowest_bit(_rm_word[i]) + (size - 1)); > > Might as well fix code style while we touch it ;) Sure, now fixed (and checked and updated all other touched code as well) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344192294 PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2344194711 From epeter at openjdk.org Fri Sep 12 13:12:35 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 13:12:35 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 13:08:49 GMT, Emanuel Peter wrote: >>> could we just go through _reachability_fences, and hack the graph and clean up with IGVN? Or do we really need the loop state to do this successfully? >> >> RF elimination needs control for referent to enumerate all interfering safepoints. >> >> Theoretically, it's possible to use a conservative estimate, but then: >> (1) it can worsen the result (by enumerating more interfering safepoints than needed); and >> (2) build an unschedulable graph if referent doesn't dominate safepoint node (if estimate is way too conservative). >> >> IMO it's safer to build full dominator tree here. >> >>> It probably has a performance impact, right? Have you measured that? >> >> It does have a noticeable cost. On my laptop it bumps the time spent doing RF processing from 170ms to 210ms >> >> $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:-StressReachabilityFences >> >> IdealLoop: 0.173 s >> ReachabilityFence: 0.000 s >> Optimize: 0.000 s >> Eliminate: 0.000 s >> ``` >> vs >> >> $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:+StressReachabilityFences >> >> IdealLoop: 0.212 s >> ReachabilityFence: 0.030 s >> Optimize: 0.004 s >> Eliminate: 0.004 s >> ``` >> >> I reimplemented it to piggyback on the last loop optimization attempt if there's any and it drastically improves the situation: >> >> $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:+StressReachabilityFences >> >> IdealLoop: 0.193 s >> ReachabilityFence: 0.009 s >> Optimize: 0.003 s >> Eliminate: 0.004 s > > @iwanowww > Ok, thanks for measuring this. We really need to keep an eye on this, otherwise it will surely trip @robcasloz 's C2 compile time benchmarking eventualyl ;) > > Can you point me to the code where you are actually using the dominator information? I think I did not find it the last time I reviewed. Ah, you mentioned it somewhere else: > It's solely for get_ctrl(referent) call in enumerate_interfering_sfpts(). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344214962 From epeter at openjdk.org Fri Sep 12 13:12:34 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 13:12:34 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 21:34:37 GMT, Vladimir Ivanov wrote: >> src/hotspot/share/opto/compile.cpp line 2522: >> >>> 2520: if (failing()) return; >>> 2521: assert(_reachability_fences.length() == 0, "no RF nodes allowed"); >>> 2522: } >> >> Looks better than before :) >> >> I'm still wondering: do we need to do a whole loop-opts phase here? It probably has a performance impact, right? >> Have you measured that? >> >> If it is measurable: could we just go through `_reachability_fences`, and hack the graph and clean up with IGVN? Or do we really need the loop state to do this successfully? > >> could we just go through _reachability_fences, and hack the graph and clean up with IGVN? Or do we really need the loop state to do this successfully? > > RF elimination needs control for referent to enumerate all interfering safepoints. > > Theoretically, it's possible to use a conservative estimate, but then: > (1) it can worsen the result (by enumerating more interfering safepoints than needed); and > (2) build an unschedulable graph if referent doesn't dominate safepoint node (if estimate is way too conservative). > > IMO it's safer to build full dominator tree here. > >> It probably has a performance impact, right? Have you measured that? > > It does have a noticeable cost. On my laptop it bumps the time spent doing RF processing from 170ms to 210ms > > $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:-StressReachabilityFences > > IdealLoop: 0.173 s > ReachabilityFence: 0.000 s > Optimize: 0.000 s > Eliminate: 0.000 s > ``` > vs > > $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:+StressReachabilityFences > > IdealLoop: 0.212 s > ReachabilityFence: 0.030 s > Optimize: 0.004 s > Eliminate: 0.004 s > ``` > > I reimplemented it to piggyback on the last loop optimization attempt if there's any and it drastically improves the situation: > > $ java -Xcomp -XX:-TieredCompilation -XX:+CITime -XX:+UnlockDiagnosticVMOptions -XX:+StressReachabilityFences > > IdealLoop: 0.193 s > ReachabilityFence: 0.009 s > Optimize: 0.003 s > Eliminate: 0.004 s @iwanowww Ok, thanks for measuring this. We really need to keep an eye on this, otherwise it will surely trip @robcasloz 's C2 compile time benchmarking eventualyl ;) Can you point me to the code where you are actually using the dominator information? I think I did not find it the last time I reviewed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344212858 From epeter at openjdk.org Fri Sep 12 13:21:41 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 13:21:41 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> Message-ID: On Wed, 10 Sep 2025 21:39:02 GMT, Vladimir Ivanov wrote: >> Ah, you could mention that later `ReachabilityFenceNode::Identity` removes the rf. > >> Is this rf guaranteed to belong to the Allocation somehow? > > I don't get your question. The code iterates over users of an allocation which is being eliminated. Semantically, RF is a no-op on a scalarizable referent and has to be removed in order to let the scalarization happen. > >> Ah, you could mention that later ReachabilityFenceNode::Identity removes the rf. > > Done. But are we sure that the `ReachabilityFence` really belongs to the `Allocation` that is eliminated? Can we check if the referent matches? Because what if there are multiple allocations: x = allocation; y = allocation; // -> eliminate ReachabilityFence(x); // is only ctrl use of Allocation for y, but belongs to Allocation of x. Could there be such cases? @iwanowww ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344241787 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344242372 From epeter at openjdk.org Fri Sep 12 13:29:28 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 13:29:28 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 21:27:12 GMT, Vladimir Ivanov wrote: >> src/hotspot/share/opto/reachability.cpp line 49: >> >>> 47: * >>> 48: * It is tempting to directly attach referents to interfering safepoints right from the beginning, but it >>> 49: * doesn't play well with some optimizations C2 does. >> >> Do you have an example for such optimizations? > > Loop-invariant code motion is one example. Do you want me to add it to the comment? > > After parsing is over, the IR is in valid state, but loop optimizations are the primary reason why it can be broken later. Just make sure that this information is in the code comments - I'm not just asking for myself here ;) >> src/hotspot/share/opto/reachability.cpp line 71: >> >>> 69: * Unfortunately, it's not straightforward to stay with safepoint-attached representation till the very end, >>> 70: * because information about derived oops is attached to safepoints the very same similar way. So, for now RFs are >>> 71: * rematerialized at safepoints before RA (phase #3). >> >> `the very same similar way` sounds a little funny. I'm also not quite seeing the problem yet. What is the issue with the edges being attached to safepoints here? > >> the very same similar way sounds a little funny. I > Fixed. > >> What is the issue with the edges being attached to safepoints here? > > The issue is safepoint-attached representation conflicts with derived oops representation. There's no way to distinguish between them. As of now, VM treats post-debug info edges as representing derived oops which is completely wrong when there are reachability edges present. More work is needed to support both cases. Ok, thanks for the explanation. Just make sure that information is in the code comments ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344263650 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344265719 From epeter at openjdk.org Fri Sep 12 14:12:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 14:12:42 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Thu, 11 Sep 2025 18:18:13 GMT, Vladimir Ivanov wrote: >> This PR introduces C2 support for `Reference.reachabilityFence()`. >> >> After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. >> >> `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. >> >> Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. >> >> Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 >> "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." >> >> Testing: >> - [x] hs-tier1 - hs-tier8 >> - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations >> - [x] java/lang/foreign microbenchmarks > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > Minor fix @iwanowww Thanks for all the updates, and the presentation on Tuesday in our staff-meeting! src/hotspot/share/opto/callGenerator.cpp line 620: > 618: // Inlining logic doesn't expect any extra edges past debug info and fails with > 619: // an assert in SafePointNode::grow_stack. > 620: assert(endoff == call->req(), "reachability edges not supported"); Could we trip over this assert by modifying the reproducer, and add some method somewhere that gets inlined late? src/hotspot/share/opto/compile.hpp line 110: > 108: LoopOptsNone, > 109: LoopOptsMaxUnroll, > 110: LoopOptsEliminateRFs, With the additional flags, I think we now need some kind of documentation here. I'm losing a bit the overview - and maybe never really had it. src/hotspot/share/opto/loopnode.cpp line 5341: > 5339: C->print_method(PHASE_ELIMINATE_REACHABILITY_FENCES, 2); > 5340: assert(C->reachability_fences_count() == 0, "no RF nodes allowed"); > 5341: } Can we somehow assert that we now really will never do loop-opts again? Why are you checking for `_mode == LoopOptsDefaultFinal` and not for `LoopOptsEliminateRFs`? If that was a bug, then more verification would be extra justified ;) src/hotspot/share/opto/reachability.cpp line 52: > 50: * > 51: * Instead, reachability representation transitions through multiple phases: > 52: * (0) initial set of RFs is materialized during parsing; Suggestion: * (0) initial set of RFs is materialized during parsing, by intrinsifying calls to Reference.reachabilityFence; src/hotspot/share/opto/reachability.cpp line 54: > 52: * (0) initial set of RFs is materialized during parsing; > 53: * (1) optimization pass during loop opts eliminates redundant RF nodes and > 54: * moves the ones with loop-invariant referents outside loops; Suggestion: * (1) optimization pass during loop opts eliminates redundant RF nodes and * moves the ones with loop-invariant referents outside (after) loops; src/hotspot/share/opto/reachability.cpp line 67: > 65: * Live ranges of values are routinely extended during loop opts. And it can break the invariant that > 66: * all interfering safepoints contain the referent in their oop map. (If an interfering safepoint doesn't > 67: * keep the referent alive, then it becomes possible for the referent to be prematurely GCed.) Can we have a concrete example. I thought of a store that is sunk out of the loop. But of course that should not cross a SafePoint on the way either. So then that's not a good argument. Do you have one that works? src/hotspot/share/opto/reachability.cpp line 70: > 68: * > 69: * After loop opts are over, it becomes possible to reliably enumerate all interfering safe points and > 70: * ensure the referent present in their oop maps. Suggestion: * After loop opts are over, it becomes possible to reliably enumerate all interfering safe points and * to ensure that the referent is present in their oop maps. Grammar. Maybe you need to fix it in a different way if it does not match the intended semantics ;) src/hotspot/share/opto/reachability.cpp line 81: > 79: * (c) Unfortunately, it's not straightforward to stay with safepoint-attached representation till the very end, > 80: * because information about derived oops is attached to safepoints in a similar way. So, for now RFs are > 81: * rematerialized at safepoints before RA (phase #3). I still don't understand this. What is similar to what? And why is that a problem? src/hotspot/share/opto/reachability.hpp line 32: > 30: #include "opto/type.hpp" > 31: > 32: //------------------------ReachabilityFenceNode-------------------------- Suggestion: // Represents a Reference.reachabilityFence call // See documentation in reachability.cpp test/hotspot/jtreg/compiler/c2/TestReachabilityFence.java line 40: > 38: * @run main/othervm -Xbatch compiler.c2.TestReachabilityFence > 39: */ > 40: public class TestReachabilityFence { This test seems very important to me. Can you please add some extra code comments, about what goes wrong before the fix, i.e. if RF are not present? Maybe some explanation about what it took to write this test, so that we can build on that to extend the test later? ------------- PR Review: https://git.openjdk.org/jdk/pull/25315#pullrequestreview-3216761112 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344292203 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344381871 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344313204 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344334081 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344337061 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344345310 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344349802 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344355280 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344359681 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344374341 From epeter at openjdk.org Fri Sep 12 14:12:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 14:12:45 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Fri, 12 Sep 2025 13:38:08 GMT, Emanuel Peter wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> Minor fix > > src/hotspot/share/opto/callGenerator.cpp line 620: > >> 618: // Inlining logic doesn't expect any extra edges past debug info and fails with >> 619: // an assert in SafePointNode::grow_stack. >> 620: assert(endoff == call->req(), "reachability edges not supported"); > > Could we trip over this assert by modifying the reproducer, and add some method somewhere that gets inlined late? Could we also bail out here? Or what would happen now in production if there is a RF edge? > src/hotspot/share/opto/loopnode.cpp line 5341: > >> 5339: C->print_method(PHASE_ELIMINATE_REACHABILITY_FENCES, 2); >> 5340: assert(C->reachability_fences_count() == 0, "no RF nodes allowed"); >> 5341: } > > Can we somehow assert that we now really will never do loop-opts again? > Why are you checking for `_mode == LoopOptsDefaultFinal` and not for `LoopOptsEliminateRFs`? > If that was a bug, then more verification would be extra justified ;) Otherwise, please explain the meaning of `LoopOptsDefaultFinal`. Maybe it should be an OR here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344294495 PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344320789 From epeter at openjdk.org Fri Sep 12 14:12:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 14:12:43 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> Message-ID: On Thu, 11 Sep 2025 18:24:46 GMT, Vladimir Ivanov wrote: >> @iwanowww Let me know whenever this is ready to review again ? > > @eme64 I think I addressed/answered all your suggestions/questions. Please, take another look. Thanks! @iwanowww Thanks for the updates! I again only looked through most comments as well. These are the major topics for me: - `StressReachabilityFences` only inserts RF where they are not needed. So this allows us to test the consistency of the RF machinery, but not to test if we are missing RF where they are needed. That is much harder, and we should probably invest in writing more tests for those cases, even if it is really hard. Maybe we can even write fuzzing tests for it? - There seems to be missing support for carrying RF edges through incremental inlining, right? File an RFE, or track it elsewhere. Could we create a reproducer for this case / can we extend the existing one? https://github.com/openjdk/jdk/pull/25315#discussion_r2330095168 - Are we sure that we don't eliminate the RF for the wrong allocation? https://github.com/openjdk/jdk/pull/25315#discussion_r2330230044 - Extra compile-time due to extra loop-opts round. https://github.com/openjdk/jdk/pull/25315#discussion_r2330176841 . It used to be a 20% increase, now you managed to make it only 10%. Still considerable. All of it just to call `get_ctrl(referent)` in `enumerate_interfering_sfpts`. I think some of these issues should also be discussed in the PR description / JIRA description. It would be especially nice if you could summarize the scope of the problem of RF, and which parts are now fixed, and which parts you know are not yet fixed. Of course there may be even more we don't know, but best write everything down we already do know. ;) Other ideas: - You should file an RFE to add your stress flags to the stress job, and also the fuzzer. - I did not yet study the reproducer `TestReachabilityFence.java`. We should consider making a fuzzer style test out of it, maybe using the template framework. Feel free to just file an RFE for that, and assign it to me. @shipilev @TobiHartmann @chhagedorn I'm soon going on vacation (in a week), and so I'd like the other reviewers to be aware of these issues. I don't want to hold up the patch, so feel free to have someone else review. But I'm also happy to come back to this mid October. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3285447179 From epeter at openjdk.org Fri Sep 12 14:12:49 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 14:12:49 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 13:26:08 GMT, Emanuel Peter wrote: >> Loop-invariant code motion is one example. Do you want me to add it to the comment? >> >> After parsing is over, the IR is in valid state, but loop optimizations are the primary reason why it can be broken later. > > Just make sure that this information is in the code comments - I'm not just asking for myself here ;) Yes, maybe say what the general problem is, and make a concrete example. I'm currently a bit struggling to think of one that is relevant. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344331771 From epeter at openjdk.org Fri Sep 12 14:12:49 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 14:12:49 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: Message-ID: <4jTV6y9R_JfATA54LC7FK3DKdBX1srsU09DK1I25Uo0=.94233927-71f2-4f13-894d-206d00f5fdaa@github.com> On Fri, 12 Sep 2025 13:51:42 GMT, Emanuel Peter wrote: >> Just make sure that this information is in the code comments - I'm not just asking for myself here ;) > > Yes, maybe say what the general problem is, and make a concrete example. I'm currently a bit struggling to think of one that is relevant. Ah yes: we may for example move a store out (after) the loop. But wait. We can't move a store across a SafePoint, so that's not a good example. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2344342049 From epeter at openjdk.org Fri Sep 12 14:20:34 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 12 Sep 2025 14:20:34 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block Message-ID: I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: https://github.com/openjdk/jdk/pull/20964 [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. ------------------------------ **Goals** - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) - Remove `_nodes` from the vector vtnodes. **Details** - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. I also made a lot of annotations in the code below, for easier review. **Suggested order for review** - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` - `VTransformApplyState`: how it now tracks the memory state. - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) - Then look at all the other details. ------------- Commit messages: - fix documentation - mem_ref -> vpointer - wip rm nodes - control dependency - phi cleanup - apply_backedge - hook inputs - apply - wip init memory state - small improvement - ... and 6 more: https://git.openjdk.org/jdk/compare/2826d170...3ec3ea2a Changes: https://git.openjdk.org/jdk/pull/27208/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27208&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367389 Stats: 690 lines in 10 files changed: 363 ins; 243 del; 84 mod Patch: https://git.openjdk.org/jdk/pull/27208.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27208/head:pull/27208 PR: https://git.openjdk.org/jdk/pull/27208 From missa at openjdk.org Fri Sep 12 20:32:54 2025 From: missa at openjdk.org (Mohamed Issa) Date: Fri, 12 Sep 2025 20:32:54 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v12] In-Reply-To: References: Message-ID: <_p9DjOv5DH3cP7WAD4Sf4f9pxil8WEx7cf7d-6Od1XI=.b28fbc4a-6632-47f7-a098-860daabd9ec8@github.com> On Thu, 11 Sep 2025 23:27:31 GMT, Sandhya Viswanathan wrote: >> Mohamed Issa has updated the pull request incrementally with two additional commits since the last revision: >> >> - Change the floating point conversion instruction, IR nodes, and test rules to make them clearer >> - Change debug text format of AVX 10.2 vector conversion instructions > > src/hotspot/cpu/x86/x86.ad line 7669: > >> 7667: predicate(!VM_Version::supports_avx10_2() && >> 7668: !VM_Version::supports_avx512vl() && >> 7669: Matcher::vector_length_in_bytes(n->in(1)) < 64 && > > Good to add "is_integral_type(Matcher::vector_element_basic_type(n)) &&" here. Added > test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java line 26: > >> 24: /** >> 25: * @test >> 26: * @bug 8287835 8320347 > > Did you mean 8364305 here? Yes, I was looking up a different one. I correct it now. Thanks. > test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 364: > >> 362: applyIfCPUFeatureAnd = {"avx2", "true", "avx10_2", "false"}) >> 363: @IR(counts = {IRNode.X86_VCAST_F2X_AVX10, "> 0"}, >> 364: applyIfCPUFeature = {"avx10_2", "true"}) > > Need to add the following for X86_VCAST_F2X as well as X86_VCAST_F2X_AVX10. > applyIfOr = {"AlignVector", "false", "UseCompactObjectHeaders", "false"}, Added > test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 387: > >> 385: applyIfCPUFeatureAnd = {"avx2", "true", "avx10_2", "false"}) >> 386: @IR(counts = {IRNode.X86_VCAST_F2X_AVX10, "> 0"}, >> 387: applyIfCPUFeature = {"avx10_2", "true"}) > > Need to add the following for X86_VCAST_F2X as well as X86_VCAST_F2X_AVX10. > applyIfOr = {"AlignVector", "false", "UseCompactObjectHeaders", "false"}, Added > test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 413: > >> 411: applyIfCPUFeatureAnd = {"avx", "true", "avx10_2", "false"}) >> 412: @IR(counts = {IRNode.X86_VCAST_D2X_AVX10, "> 0"}, >> 413: applyIfCPUFeature = {"avx10_2", "true"}) > > Need to add the following for X86_VCAST_D2X and X86_VCAST_D2X_AVX10: > applyIf = {"MaxVectorSize", ">=16"}, Added > test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 432: > >> 430: applyIfCPUFeatureAnd = {"avx", "true", "avx10_2", "false"}) >> 431: @IR(counts = {IRNode.X86_VCAST_D2X_AVX10, "> 0"}, >> 432: applyIfCPUFeature = {"avx10_2", "true"}) > > Need to add the following for X86_VCAST_D2X and X86_VCAST_D2X_AVX10: > applyIf = {"MaxVectorSize", ">=16"}, Added ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2345368353 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2345369890 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2345368764 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2345368997 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2345369217 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2345369350 From missa at openjdk.org Fri Sep 12 20:32:53 2025 From: missa at openjdk.org (Mohamed Issa) Date: Fri, 12 Sep 2025 20:32:53 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v13] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/vectorapi/VectorFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg/com... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Add extra constraints to vector floating point conversion instruction predicates and tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/df175756..025d815f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=11-12 Stats: 11 lines in 3 files changed: 9 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From vlivanov at openjdk.org Fri Sep 12 20:53:30 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 12 Sep 2025 20:53:30 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure Message-ID: As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. Consider `FloatVector::lanewiseTemplate`: FloatVector lanewiseTemplate(VectorOperators.Unary op) { if (opKind(op, VO_SPECIAL)) { ... else if (opKind(op, VO_MATHLIB)) { return unaryMathOp(op); } } int opc = opCode(op); return VectorSupport.unaryOp(opc, ...); } At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. The fix is to fail-fast intrinsification rather than crashing the VM. Testing: tier1 - tier4 ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/27263/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27263&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367333 Stats: 168 lines in 3 files changed: 168 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27263.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27263/head:pull/27263 PR: https://git.openjdk.org/jdk/pull/27263 From dlong at openjdk.org Fri Sep 12 22:15:12 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 12 Sep 2025 22:15:12 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v7] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 13:08:01 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Rename InfiniteStack_size and fix style of touched code Marked as reviewed by dlong (Reviewer). src/hotspot/share/opto/regmask.hpp line 173: > 171: } > 172: > 173: void set_infinite() { Suggestion: void set_infinite_stack() { For consistency with `is_infinite_stack()`. ------------- PR Review: https://git.openjdk.org/jdk/pull/27215#pullrequestreview-3218936271 PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2345519216 From dlong at openjdk.org Fri Sep 12 22:25:18 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 12 Sep 2025 22:25:18 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v4] In-Reply-To: References: <19P8X88PcVoh8x62iBz5baOyefoWgGp-aFd_Bli-vm0=.fbd446d3-0dad-41e8-bb13-ab113a3b9767@github.com> <-hb0UM3-WL9h72oPsOvK5NA2-pYCyDQZS42R7FzPJ3s=.5dbaea39-391e-4e25-9e9e-f97940d3bc06@github.com> Message-ID: On Fri, 12 Sep 2025 12:40:29 GMT, Emanuel Peter wrote: >> I renamed this entire group of constants to use the same style (uppercase separated by `_`, without leading `_`). It is now `WORD_BIT_MASK`. I think it makes more sense to use the same style across `regmask.hpp`, rather than following styles in other files. > > Nice! I would have been OK with CamelCase, but then I would have wondered why the constant wasn't defined in globalDefinitions.hpp instead :-) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2345530581 From dlong at openjdk.org Fri Sep 12 22:44:26 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 12 Sep 2025 22:44:26 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v12] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> Message-ID: <_YXE9yfxaouyeyMsdurEy_uEx0FJDbGcX8M8L7aDqm0=.770ff0aa-8ae3-46ac-8cc1-7d38710e859e@github.com> On Fri, 12 Sep 2025 07:26:20 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/loopTransform.cpp line 3992: >> >>> 3990: Node* frame = new ParmNode(C->start(), TypeFunc::FramePtr); >>> 3991: _igvn.register_new_node_with_optimizer(frame); >>> 3992: call->init_req(TypeFunc::FramePtr, frame); >> >> This seems unrelated. Is it needed? > > It's one of the things mentioned in that comment: > https://github.com/openjdk/jdk/pull/24570#issuecomment-2883651987 > > "I added asserts to catch cases where proj_out is called but the node has more than one matching projection. With those asserts, I caught some false positive/cases where we got lucky and worked around them by reworking the code so it doesn't use proj_out. That's the case in PhaseIdealLoop::intrinsify_fill(): we can end up there with more than one FramePtr projection because the code pattern used elsewhere is to add one more projection and let identical projections common during igvn. " Are we just lucky that we don't have the same problem with ReturnAdr here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2345548490 From missa at openjdk.org Sat Sep 13 01:21:12 2025 From: missa at openjdk.org (Mohamed Issa) Date: Sat, 13 Sep 2025 01:21:12 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v14] In-Reply-To: References: Message-ID: <4Eui7URmA1Y5NPrrV4813qb7UUsNVSRP-JSnPdX0Ojg=.4db7c50e-18cd-47ec-ae8c-4ae17597b286@github.com> > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Introduce scalar floating point conversion tests with IR rules ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/025d815f..5d26ff48 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=12-13 Stats: 262 lines in 3 files changed: 252 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From epeter at openjdk.org Sat Sep 13 04:52:10 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Sat, 13 Sep 2025 04:52:10 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v7] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 13:08:01 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Rename InfiniteStack_size and fix style of touched code Changes requested by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27215#pullrequestreview-3219509250 From epeter at openjdk.org Sat Sep 13 04:52:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Sat, 13 Sep 2025 04:52:12 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v7] In-Reply-To: References: Message-ID: <4TEpZlUksghJKcxz5Vd0kxvlt13fr-Z-4xdHD91NFtQ=.1ab10a84-1bb7-4704-a86b-dadbda0e38f4@github.com> On Fri, 12 Sep 2025 22:12:03 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename InfiniteStack_size and fix style of touched code > > src/hotspot/share/opto/regmask.hpp line 173: > >> 171: } >> 172: >> 173: void set_infinite() { > > Suggestion: > > void set_infinite_stack() { > > For consistency with `is_infinite_stack()`. Yes, it should be `set_infinite_stack` in parallel with `is_infinite_stack`, nice catch! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2345896440 From jbhateja at openjdk.org Sat Sep 13 08:40:27 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 13 Sep 2025 08:40:27 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v14] In-Reply-To: <4Eui7URmA1Y5NPrrV4813qb7UUsNVSRP-JSnPdX0Ojg=.4db7c50e-18cd-47ec-ae8c-4ae17597b286@github.com> References: <4Eui7URmA1Y5NPrrV4813qb7UUsNVSRP-JSnPdX0Ojg=.4db7c50e-18cd-47ec-ae8c-4ae17597b286@github.com> Message-ID: On Sat, 13 Sep 2025 01:21:12 GMT, Mohamed Issa wrote: >> Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. >> >> Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regis ters to store intermediate results. >> >> This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). >> >> 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` >> 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` >> 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` >> 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` >> 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` >> 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` >> 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` >> 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` >> 9. `jtreg:test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java`... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Introduce scalar floating point conversion tests with IR rules test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 70: > 68: float_arr[i] = ran.nextFloat(floor_val, ceil_val); > 69: double_arr[i] = ran.nextDouble(floor_val, ceil_val); > 70: } Please use Generators instead of direct initialization. test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 89: > 87: if (int_arr[i] != expected) { > 88: throw new RuntimeException("Invalid result: int_arr[" + i + "] = " + int_arr[i] + " != " + expected); > 89: } Use Verify.checkEQ instead. test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 109: > 107: if (long_arr[i] != expected) { > 108: throw new RuntimeException("Invalid result: long_arr[" + i + "] = " + long_arr[i] + " != " + expected); > 109: } Use Verify.checkEQ, checkout relevant code in https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/lib and their usages test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 122: > 120: checkf2short(); > 121: } > 122: What is the reason behind additional level of abstraction when now manually inline this code. test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 138: > 136: applyIfCPUFeature = {"avx10_2", "false"}) > 137: @IR(counts = {IRNode.X86_SCONV_F2I_AVX10, "> 0"}, > 138: applyIfCPUFeature = {"avx10_2", "true"}) These IR rules apply to the CompilePhase.MATCHING ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2346076729 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2346078144 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2346080343 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2346082356 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2346097798 From hgreule at openjdk.org Sun Sep 14 14:44:02 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Sun, 14 Sep 2025 14:44:02 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v9] In-Reply-To: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> > This change improves the precision of the `Mod(I|L)Node::Value()` functions. > > I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. > The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. > > ### Monotonicity > > Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). > > ### Testing > > I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). > > Please review and let me know what you think. > > ### Other > > The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. > > During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: > - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? > - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: remove unused parameter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25254/files - new: https://git.openjdk.org/jdk/pull/25254/files/41d0e2c7..96602c67 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25254&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25254&range=07-08 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/25254.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25254/head:pull/25254 PR: https://git.openjdk.org/jdk/pull/25254 From hgreule at openjdk.org Sun Sep 14 14:44:04 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Sun, 14 Sep 2025 14:44:04 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v8] In-Reply-To: <19JdaOkvM92QSjXvYVr1CNSXD5hkXINl1gh6qj-DCMQ=.6b268ebd-6c9a-4b33-b355-1dc41de53454@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> <19JdaOkvM92QSjXvYVr1CNSXD5hkXINl1gh6qj-DCMQ=.6b268ebd-6c9a-4b33-b355-1dc41de53454@github.com> Message-ID: On Thu, 11 Sep 2025 17:42:46 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > address comments I noticed one parameter was unused, I removed it now. This shouldn't affect testing I guess. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3289599649 From duke at openjdk.org Mon Sep 15 02:22:46 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 02:22:46 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v12] In-Reply-To: References: Message-ID: > This patch optimizes the following patterns: > For integer types: > > (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) > => (VectorMaskCmp src1 src2 ncond) > (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) > => (VectorMaskCmp src1 src2 ncond) > > cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. > > For float and double types: > > (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > > cond can be eq or ne. > > Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: > > Benchmark Unit Before Score Error After Score Error Uplift > testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 > testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 > testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 > testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 > testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 > testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 > testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 > testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 > testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 > testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 > testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 > testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 > testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 > testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 > testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 > testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 > testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 > testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 > testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 > testCompareLTMaskNotInt ops/s 1672180.09 995.238142 2353757.863 853.774734 1.4 > testCompareLTMaskNotLong ops/s 856502.26... erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: - Simplify JMH testing - Merge branch 'master' into JDK-8354242 - Update the code comment - Align indentation - Merge branch 'master' into JDK-8354242 - Address more comments ATT. - Merge branch 'master' into JDK-8354242 - Support negating unsigned comparison for BoolTest::mask Added a static method `negate_mask(mask btm)` into BoolTest class to negate both signed and unsigned comparison. - Addressed some review comments - Merge branch 'master' into JDK-8354242 - ... and 10 more: https://git.openjdk.org/jdk/compare/4d660b21...52bbd3cd ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24674/files - new: https://git.openjdk.org/jdk/pull/24674/files/04142a19..52bbd3cd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=10-11 Stats: 129948 lines in 3408 files changed: 76187 ins; 35380 del; 18381 mod Patch: https://git.openjdk.org/jdk/pull/24674.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24674/head:pull/24674 PR: https://git.openjdk.org/jdk/pull/24674 From duke at openjdk.org Mon Sep 15 02:30:21 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 02:30:21 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v11] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 13:03:03 GMT, Emanuel Peter wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Update the code comment > > test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 1007: > >> 1005: testCompareMaskNotFloat(F_SPECIES, VectorOperators.NE, fa, fninf, (m) -> { return F_SPECIES.maskAll(true).xor(m); }); >> 1006: verifyResultsFloat(F_SPECIES, VectorOperators.NE, fa, fninf); >> 1007: } > > Do you have test cases for the cases other than `EQ` and `NE`? After all, we don't that someone accidentally messes with the logic you implemented later and we don't notice the bug ;) For `float` and `double`, only `EQ` and `NE` are supported. So the positive test only includes these two OPs. And we have one negative test for other unsupported OPs, see `testCompareMaskNotFloatNegative`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2347761055 From duke at openjdk.org Mon Sep 15 03:34:19 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 03:34:19 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 06:10:59 GMT, Galder Zamarre?o wrote: >> The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. >> >> This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: >> 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. >> 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. >> >> This pull request introduces the following changes: >> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. >> 2. Eliminates unnecessary compress operations for partial subword type cases. >> 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. >> >> Benchmark results demonstrate that these changes significantly improve performance. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE: >> >> Benchmark Unit Before Error After Error Uplift >> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 >> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 >> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 >> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 >> >> >> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2292: > >> 2290: // Return if the vector length is no more than MaxVectorSize/2, since the >> 2291: // highest half is invalid. >> 2292: if (vector_length_in_bytes <= (MaxVectorSize >> 1)) { > > Couldn't this check be done first thing when the function is called? Then you would avoid unnecessary work? > > I also wonder if this check should be done before `sve_compress_byte` is called, but I think at the very least it should be done first thing in this function. We need to do the lower half, so I think there's no unnecessary work. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2347805971 From duke at openjdk.org Mon Sep 15 05:43:11 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 05:43:11 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v13] In-Reply-To: References: Message-ID: > This patch optimizes the following patterns: > For integer types: > > (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) > => (VectorMaskCmp src1 src2 ncond) > (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) > => (VectorMaskCmp src1 src2 ncond) > > cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. > > For float and double types: > > (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > > cond can be eq or ne. > > Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: > > Benchmark Unit Before Score Error After Score Error Uplift > testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 > testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 > testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 > testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 > testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 > testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 > testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 > testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 > testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 > testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 > testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 > testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 > testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 > testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 > testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 > testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 > testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 > testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 > testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 > testCompareLTMaskNotInt ops/s 1672180.09 995.238142 2353757.863 853.774734 1.4 > testCompareLTMaskNotLong ops/s 856502.26... erifan has updated the pull request incrementally with one additional commit since the last revision: Add an IR rule for vector mask cast operation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24674/files - new: https://git.openjdk.org/jdk/pull/24674/files/52bbd3cd..56bb34ff Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=11-12 Stats: 40 lines in 1 file changed: 40 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/24674.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24674/head:pull/24674 PR: https://git.openjdk.org/jdk/pull/24674 From duke at openjdk.org Mon Sep 15 05:43:14 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 05:43:14 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v11] In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 13:04:48 GMT, Emanuel Peter wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Update the code comment > > test/hotspot/jtreg/compiler/vectorapi/VectorMaskCompareNotTest.java line 911: > >> 909: testCompareMaskNotLong(L_SPECIES_FOR_CAST, VectorOperators.UGE, (m) -> { return m.cast(I_SPECIES_FOR_CAST).not(); }); >> 910: verifyResultsLong(L_SPECIES_FOR_CAST, VectorOperators.UGE); >> 911: } > > You have some cast in here, and in similar tests. > Can you add an IR rule to check if we do or do not have the expected casts? Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2347920402 From duke at openjdk.org Mon Sep 15 05:46:36 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 05:46:36 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v7] In-Reply-To: References: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> Message-ID: On Wed, 10 Sep 2025 07:43:20 GMT, Emanuel Peter wrote: >> Hi @eme64 @theRealAph @XiaohongGong @fg1417 @shqking , could you help take a look at this PR, thanks > > @erifan Sounds good. No rush, it takes as long as it takes. I'll soon be on vacation too and may not respond until mid of October. Hi @eme64 I have dealt with all of your suggestions except one that I think it has already been covered. Could you please have a look at this PR when you have a chance? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3290572086 From duke at openjdk.org Mon Sep 15 05:55:43 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 05:55:43 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v4] In-Reply-To: References: Message-ID: > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: > > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > > Step 1: calculate the index input of the TBL instruction. > > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 > > dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 > tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 > > // Clear inactive elements. > dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 > > // Set the inactive lane value to -1 and set the active lane to the target index. > dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 > > Step 2: shuffle the source vector elements to the target vector > > tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > > > The same algorithm is used for NEON and SVE1, but with different instructions where appropriate. > > The following benchmarks are from panama-... erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'master' into JDK-8363989 - Align code example data for better reading - Merge branch 'master' into JDK-8363989 - Improve the comment of the vector expand implementation - Merge branch 'master' into JDK-8363989 - 8363989: AArch64: Add missing backend support of VectorAPI expand operation Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: 1. **Subword types** on SVE2-capable hardware. 2. **All types** on NEON and SVE1 environments. As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: ``` To compute: dst = src.expand(mask) Data direction: high <== low Input: src = p o n m l k j i h g f e d c b a mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 Expected result: dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a ``` Step 1: calculate the index input of the TBL instruction. ``` // Set tmp1 as all 0 vector. tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 // Move the mask bits from the predicate register to a vector register. // **1-bit** mask lane of P register to **8-bit** mask lane of V register. tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 // Shift the entire register. Prefix sum algorithm. dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 // Clear inactive elements. dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 // Set the inactive lane value to -1 and set the active lane to the target index. dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 ``` Step 2: shuffle the source vector elements to the target vector ``` tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a ``` The same algorithm is used for NEON and SVE1, but with different instructions where appropriate. The following benchmarks are from panama-vector/vectorIntrinsics. On Nvidia Grace machine with option `-XX:UseSVE=2`: ``` Benchmark Unit Before Score Error After Score Error Uplift Byte128Vector.expand ops/ms 1791.022366 5.619883 9633.388683 1.968788 5.37 Double128Vector.expand ops/ms 4489.255846 0.48485 4488.772949 0.491596 0.99 Float128Vector.expand ops/ms 8863.02424 6.888087 8908.352235 51.487453 1 Int128Vector.expand ops/ms 8873.485683 3.275682 8879.635643 1.243863 1 Long128Vector.expand ops/ms 4485.1149 4.458073 4489.365269 0.851093 1 Short128Vector.expand ops/ms 792.068834 2.640398 5880.811288 6.40683 7.42 Byte64Vector.expand ops/ms 854.455002 8.548982 5999.046295 37.209987 7.02 Double64Vector.expand ops/ms 46.49763 0.104773 46.526043 0.102451 1 Float64Vector.expand ops/ms 4510.596811 0.504477 4509.984244 1.519178 0.99 Int64Vector.expand ops/ms 4508.778322 1.664461 4535.216611 26.742484 1 Long64Vector.expand ops/ms 45.665462 0.705485 46.496232 0.075648 1.01 Short64Vector.expand ops/ms 394.527324 1.284691 3860.199621 0.720015 9.78 ``` On Nvidia Grace machine with option `-XX:UseSVE=1`: ``` Benchmark Unit Before Score Error After Score Error Uplift Byte128Vector.expand ops/ms 1767.314171 12.431526 9630.892248 1.478813 5.44 Double128Vector.expand ops/ms 197.614381 0.945541 2416.075281 2.664325 12.22 Float128Vector.expand ops/ms 390.878183 2.089234 3844.011978 3.792751 9.83 Int128Vector.expand ops/ms 394.550044 2.025371 3843.280133 3.528017 9.74 Long128Vector.expand ops/ms 198.366863 0.651726 2423.234639 4.911434 12.21 Short128Vector.expand ops/ms 790.044704 3.339363 5885.595035 1.440598 7.44 Byte64Vector.expand ops/ms 853.479119 7.158898 5942.750116 1.054905 6.96 Double64Vector.expand ops/ms 46.550458 0.079191 46.423053 0.057554 0.99 Float64Vector.expand ops/ms 197.977215 1.156535 2445.010767 1.992358 12.34 Int64Vector.expand ops/ms 198.326857 1.02785 2444.211583 2.5432 12.32 Long64Vector.expand ops/ms 46.526513 0.25779 45.984253 0.566691 0.98 Short64Vector.expand ops/ms 398.649412 1.87764 3837.495773 3.528926 9.62 ``` On Nvidia Grace machine with option `-XX:UseSVE=0`: ``` Benchmark Unit Before Score Error After Score Error Uplift Byte128Vector.expand ops/ms 1802.98702 6.906394 9427.491602 2.067934 5.22 Double128Vector.expand ops/ms 198.498191 0.429071 1190.476326 0.247358 5.99 Float128Vector.expand ops/ms 392.849005 2.034676 2373.195574 2.006566 6.04 Int128Vector.expand ops/ms 395.69179 2.194773 2372.084745 2.058303 5.99 Long128Vector.expand ops/ms 198.191673 1.476362 1189.712301 1.006821 6 Short128Vector.expand ops/ms 795.785831 5.62611 4731.514053 2.365213 5.94 Byte64Vector.expand ops/ms 843.549268 7.174254 5865.556155 37.639415 6.95 Double64Vector.expand ops/ms 45.943599 0.484743 46.529755 0.111551 1.01 Float64Vector.expand ops/ms 193.945993 0.943338 1463.836772 0.618393 7.54 Int64Vector.expand ops/ms 194.168021 0.492286 1473.004575 8.802656 7.58 Long64Vector.expand ops/ms 46.570488 0.076372 46.696353 0.078649 1 Short64Vector.expand ops/ms 387.973334 2.367312 2920.428114 0.863635 7.52 ``` Some JTReg test cases are added for the above changes. And the patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. ------------- Changes: https://git.openjdk.org/jdk/pull/26740/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26740&range=03 Stats: 485 lines in 9 files changed: 388 ins; 12 del; 85 mod Patch: https://git.openjdk.org/jdk/pull/26740.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26740/head:pull/26740 PR: https://git.openjdk.org/jdk/pull/26740 From chagedorn at openjdk.org Mon Sep 15 07:02:22 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 15 Sep 2025 07:02:22 GMT Subject: RFR: 8366940: Test compiler/loopopts/superword/TestAliasingFuzzer.java timed out In-Reply-To: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> References: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> Message-ID: On Fri, 12 Sep 2025 12:01:25 GMT, Emanuel Peter wrote: > `TestAliasingFuzzer.java` generates 30 subtests for every run. They are randomized. Some vectorize and execute faster, some fail to vectorize and execute slower. > > Hence, some natural variance in the duration is expected. > On most machines, it seems the variance in "Running Tests" is about 30-50sec (total test time about 35-70sec). But on some machines (macosx-x64-debug), the execution time is a bit slower: 60-100 in "Running Tests", with some outliers at 110+sec. These occasionally trip the 120sec timeout, and when they trip it, they somehow cause the harness to take an excessive 9+min to shut everything down. > > Solutions: > - Option 1: generate fewer tests in `TestAliasingFuzzer.java`. Would be sad, the test has now found 2 real bugs within 2 weeks. > - Option 2: increase test timeout. That is what I'll do. Because the "outliers" that caused the timeouts were not far from all other cases on the same platform, and so they are acceptable. That looks reasonable! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27257#pullrequestreview-3223155288 From epeter at openjdk.org Mon Sep 15 07:02:23 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 07:02:23 GMT Subject: RFR: 8366940: Test compiler/loopopts/superword/TestAliasingFuzzer.java timed out In-Reply-To: References: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> Message-ID: On Fri, 12 Sep 2025 12:32:00 GMT, SendaoYan wrote: >> `TestAliasingFuzzer.java` generates 30 subtests for every run. They are randomized. Some vectorize and execute faster, some fail to vectorize and execute slower. >> >> Hence, some natural variance in the duration is expected. >> On most machines, it seems the variance in "Running Tests" is about 30-50sec (total test time about 35-70sec). But on some machines (macosx-x64-debug), the execution time is a bit slower: 60-100 in "Running Tests", with some outliers at 110+sec. These occasionally trip the 120sec timeout, and when they trip it, they somehow cause the harness to take an excessive 9+min to shut everything down. >> >> Solutions: >> - Option 1: generate fewer tests in `TestAliasingFuzzer.java`. Would be sad, the test has now found 2 real bugs within 2 weeks. >> - Option 2: increase test timeout. That is what I'll do. Because the "outliers" that caused the timeouts were not far from all other cases on the same platform, and so they are acceptable. > > Marked as reviewed by syan (Committer). @sendaoYan @chhagedorn Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27257#issuecomment-3290745075 From epeter at openjdk.org Mon Sep 15 07:02:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 07:02:24 GMT Subject: Integrated: 8366940: Test compiler/loopopts/superword/TestAliasingFuzzer.java timed out In-Reply-To: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> References: <-iszfG2luNYZtsMxsMnsDWoQIscvcY37XSpi8fDDcEE=.2d26cf70-ba2e-4967-a083-50867f291784@github.com> Message-ID: On Fri, 12 Sep 2025 12:01:25 GMT, Emanuel Peter wrote: > `TestAliasingFuzzer.java` generates 30 subtests for every run. They are randomized. Some vectorize and execute faster, some fail to vectorize and execute slower. > > Hence, some natural variance in the duration is expected. > On most machines, it seems the variance in "Running Tests" is about 30-50sec (total test time about 35-70sec). But on some machines (macosx-x64-debug), the execution time is a bit slower: 60-100 in "Running Tests", with some outliers at 110+sec. These occasionally trip the 120sec timeout, and when they trip it, they somehow cause the harness to take an excessive 9+min to shut everything down. > > Solutions: > - Option 1: generate fewer tests in `TestAliasingFuzzer.java`. Would be sad, the test has now found 2 real bugs within 2 weeks. > - Option 2: increase test timeout. That is what I'll do. Because the "outliers" that caused the timeouts were not far from all other cases on the same platform, and so they are acceptable. This pull request has now been integrated. Changeset: cf00f96f Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/cf00f96fd49ac7e6e04fdde74a3015531a0b59c8 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8366940: Test compiler/loopopts/superword/TestAliasingFuzzer.java timed out Reviewed-by: syan, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/27257 From jbhateja at openjdk.org Mon Sep 15 08:20:35 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 15 Sep 2025 08:20:35 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v8] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Extending the random ranges ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/a7f9b79c..278f1dc8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=06-07 Stats: 29 lines in 1 file changed: 2 ins; 6 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From chagedorn at openjdk.org Mon Sep 15 09:14:13 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 15 Sep 2025 09:14:13 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v8] In-Reply-To: <4Yzeo6gJlk-Jq5zlh3P9HPCm57-7AwIqsywOWbawzcI=.13938c72-a9d4-463d-a54c-a08c70482a6b@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> <4Yzeo6gJlk-Jq5zlh3P9HPCm57-7AwIqsywOWbawzcI=.13938c72-a9d4-463d-a54c-a08c70482a6b@github.com> Message-ID: On Fri, 12 Sep 2025 07:27:02 GMT, Roland Westrelin wrote: >> A node in a pre loop only has uses out of the loop dominated by the >> loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control >> to the loop exit projection. A range check in the main loop has this >> node as input (through a chain of some other nodes). Range check >> elimination needs to update the exit condition of the pre loop with an >> expression that depends on the node pinned on its exit: that's >> impossible and the assert fires. This is a variant of 8314024 (this >> one was for a node with uses out of the pre loop on multiple paths). I >> propose the same fix: leave the node with control in the pre loop in >> this case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26424#pullrequestreview-3223618579 From shade at openjdk.org Mon Sep 15 09:21:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 15 Sep 2025 09:21:44 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 19:14:18 GMT, Vladimir Ivanov wrote: > As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. > > Consider `FloatVector::lanewiseTemplate`: > > FloatVector lanewiseTemplate(VectorOperators.Unary op) { > if (opKind(op, VO_SPECIAL)) { > ... > else if (opKind(op, VO_MATHLIB)) { > return unaryMathOp(op); > } > } > int opc = opCode(op); > return VectorSupport.unaryOp(opc, ...); > } > > > At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. > > It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. > > The fix is to fail-fast intrinsification rather than crashing the VM. > > Testing: tier1 - tier4 Looks reasonable! Thanks. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27263#pullrequestreview-3223639697 From chagedorn at openjdk.org Mon Sep 15 09:30:23 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 15 Sep 2025 09:30:23 GMT Subject: RFR: 8362394: C2: Repeated stacked string concatenation fails with "Hit MemLimit" and other resourcing errors [v4] In-Reply-To: References: Message-ID: On Thu, 21 Aug 2025 07:41:32 GMT, Daniel Skantz wrote: >> This PR addresses a bug in the stringopts phase. During string concatenation, repeated stacking of concatenations can lead to excessive compilation resource use and generation of questionable code as the merging of two StringBuilder-append-toString links sc1 and sc2 can result in a new StringBuilder with the size sc1->num_arguments() * sc2->num_arguments(). >> >> In the attached test, the size of the successively merged StringBuilder doubles on each merge -- there's 24 of them -- as the toString result of the first component is used twice in the second component [1], etc. Not only does the compiler hang on this test case, but the string concat optimization seems to give an arbitrary amount of back-to-back stores in the generated code depending on the number of stacked concatenations. >> >> The proposed solution is to put an upper bound on the size of a merged concatenation, which guards against this case of repeated concatenations on the same string variable, and potentially other edge cases. 100 seems like a generous limit, and higher limits could be insufficient as each argument corresponds to about 20 new nodes later in replace_string_concat [2]. >> >> [1] https://github.com/openjdk/jdk/blob/0ceb366dc26e2e4f6252da9dd8930b016a5d46ba/src/hotspot/share/opto/stringopts.cpp#L303 >> >> [2] https://github.com/openjdk/jdk/blob/0ceb366dc26e2e4f6252da9dd8930b016a5d46ba/src/hotspot/share/opto/stringopts.cpp#L1806 >> >> Testing: T1-4. >> >> Extra testing: verified that no method in T1-4 is being compiled with a merged concat candidate exceeding the suggested limit of 100 aguments, regardless of whether or not the later checks verify_control_flow() and verify_mem_flow pass. > > Daniel Skantz has updated the pull request incrementally with one additional commit since the last revision: > > compare order The fix looks good to me, too! A few small comments/suggestions. src/hotspot/share/opto/stringopts.cpp line 56: > 54: // to restart at the initial JVMState. > 55: > 56: static constexpr uint STACKED_CONCAT_UPPER_BOUND = 256; // argument limit for a merged concat. Can you add a comment how we ended up with 256? src/hotspot/share/opto/stringopts.cpp line 319: > 317: // -- and bail out in that case. > 318: if (arguments_appended > STACKED_CONCAT_UPPER_BOUND) { > 319: return nullptr; Should we also print an error message for `PrintOptimizeStringConcat` here? test/hotspot/jtreg/compiler/stringopts/TestStackedConcatsMany.java line 26: > 24: /* > 25: * @test > 26: * @bug 8357105 Wrong bug number: Suggestion: * @bug 8362394 test/hotspot/jtreg/compiler/stringopts/TestStackedConcatsMany.java line 37: > 35: */ > 36: > 37: // The test uses -XX:-OptoScheduling to avoid the assert "too many D-U pinch points" on aarch64. I assume this is due to JDK-8328078? Maybe you can also mention the bug number here for completeness. ------------- PR Review: https://git.openjdk.org/jdk/pull/26685#pullrequestreview-3223638117 PR Review Comment: https://git.openjdk.org/jdk/pull/26685#discussion_r2348385681 PR Review Comment: https://git.openjdk.org/jdk/pull/26685#discussion_r2348379717 PR Review Comment: https://git.openjdk.org/jdk/pull/26685#discussion_r2348366153 PR Review Comment: https://git.openjdk.org/jdk/pull/26685#discussion_r2348364931 From chagedorn at openjdk.org Mon Sep 15 09:32:26 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 15 Sep 2025 09:32:26 GMT Subject: RFR: 8356779: IGV: dump the index of the SafePointNode containing the current JVMS during parsing In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 05:22:00 GMT, Saranya Natarajan wrote: > This PR prints index of the SafePointNode containing the current JVMS during parsing in IGV. As stated in JBS the reason for this is that there are a lot of nodes during parsing, it would be nice to know what are the current nodes in the local slots or in the stack when looking at a graph. That's a good addition, looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27083#pullrequestreview-3223687303 From jbhateja at openjdk.org Mon Sep 15 09:38:14 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 15 Sep 2025 09:38:14 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v13] In-Reply-To: References: Message-ID: <4D63cqV0LkPYrSMSkfachZzoH_qpH9vhAbo57RRe1Js=.7a21d73b-7963-4e15-b013-8295b274d5d0@github.com> On Mon, 15 Sep 2025 05:43:11 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Add an IR rule for vector mask cast operation Your benchmark and code changes look good to me. Thanks for addressing my comments. ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24674#pullrequestreview-3223705838 From eastigeevich at openjdk.org Mon Sep 15 09:38:33 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 15 Sep 2025 09:38:33 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v42] In-Reply-To: References: Message-ID: On Thu, 21 Aug 2025 14:56:30 GMT, Robbin Ehn wrote: > Hey! > > @fisk > > > Also, do you have any numbers showing if iTLB pressure improved? Or performance improved? Or in general that anything improved? I'm guessing so but I'd like to see some data. > > The issue is that some of the major arm manufacturers seems to have missed appendix C in Intel opt manual - "OPTIMIZATION WITH LARGE CODE PAGES". > > E.g. running renaissance dotty on a G3 I saw 37% front-ends stall (G2 28%, they made significant improvement to backend on G3, presumably not front-end hence more stalling). > > By using less itbl entries we can significant increase ipc on these CPUs. Simple testing with some eariler version of this got ~10% reduction in frontend stalls (take that number with a grain of salt). Now if this is correct approach or not, that's is still unclear to me. @robehn @fisk I added a microbenchmark demonstrating performance impact of the sparse CodeCache: https://github.com/openjdk/jdk/pull/23831 It shows the code sparsity affects both Intel and Graviton CPUs. In case of Graviton CPUs you can measure the code sparsity: r11c counter per 1000 instructions. BTW there might be no ITLB misses or minor ones in the case of the code sparsity. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3291256979 From epeter at openjdk.org Mon Sep 15 09:42:37 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 09:42:37 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v4] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> <6dWR-SxhuKd9-T3q313I6at4vTBcYlufyCBNjGGopv4=.cae3abea-0752-4191-ac08-890476489af3@github.com> Message-ID: On Tue, 26 Aug 2025 09:27:04 GMT, Roland Westrelin wrote: >> test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java line 28: >> >>> 26: * @bug 8361702 >>> 27: * @summary C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? >>> 28: * @requires vm.flavor == "server" >> >> Would this test fail without this requires? Or could we remove it, in the hopes of catching something else somewhere else? > > The `@requires` is there because the test run needs command line options that are c2 specific. Ok, but then you should make the run below without flags in a separate `@test` that does not have this restriction. @rwestrel ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2348419584 PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2348424709 From duke at openjdk.org Mon Sep 15 09:49:19 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 09:49:19 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v13] In-Reply-To: <4D63cqV0LkPYrSMSkfachZzoH_qpH9vhAbo57RRe1Js=.7a21d73b-7963-4e15-b013-8295b274d5d0@github.com> References: <4D63cqV0LkPYrSMSkfachZzoH_qpH9vhAbo57RRe1Js=.7a21d73b-7963-4e15-b013-8295b274d5d0@github.com> Message-ID: <0SEFllVEITC_xA1OeWHnPC0S9-nbnicZOCKlAcbwH1M=.ecc56fb6-45fd-44c9-a9ca-4a5f5a391a34@github.com> On Mon, 15 Sep 2025 09:33:47 GMT, Jatin Bhateja wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Add an IR rule for vector mask cast operation > > Your benchmark and code changes look good to me. Thanks for addressing my comments. Thanks @jatin-bhateja . And the updated benchmarks test results are as follow, no much changes. On Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: Benchmark COMPARISON_OP Unit Before Score Error After Score Error Uplift testCompareMaskNotDouble EQ ops/s 908008.7644 827.699314 1175289.515 240.548861 1.294359 testCompareMaskNotDouble NE ops/s 872199.2489 131.090115 1175667.777 129.741515 1.347934 testCompareMaskNotDouble LT ops/s 880166.7559 1570.41653 882160.6889 4723.507639 1.002265 testCompareMaskNotDouble LE ops/s 878115.3293 2919.637497 879033.7895 5404.617017 1.001045 testCompareMaskNotDouble GT ops/s 877068.5325 9595.275981 865832.864 5054.26002 0.987189 testCompareMaskNotDouble GE ops/s 895695.0228 3276.687933 871153.7117 7714.572967 0.9726 testCompareMaskNotFloat EQ ops/s 1811841.295 278.140948 2350971.83 606.667654 1.297559 testCompareMaskNotFloat NE ops/s 1727124.634 1755.717051 2351789.019 269.531198 1.361678 testCompareMaskNotFloat LT ops/s 1735243.319 4912.343726 1726257.01 823.746765 0.994821 testCompareMaskNotFloat LE ops/s 1726151.367 1071.383328 1727029.339 960.336314 1.000508 testCompareMaskNotFloat GT ops/s 1729704.897 1646.026351 1726069.02 440.981281 0.997897 testCompareMaskNotFloat GE ops/s 1726515.227 2171.61643 1728365.682 1404.298156 1.001071 testCompareMaskNotByte EQ ops/s 8480574.694 1254.415788 10200329.86 8560.199493 1.202787 testCompareMaskNotByte NE ops/s 8480141.263 1437.762594 10207424.91 3664.106923 1.203685 testCompareMaskNotByte LT ops/s 8471471.384 7699.585554 10203300.19 4675.047416 1.20443 testCompareMaskNotByte LE ops/s 8476165.519 6045.944392 10204956.23 2174.866199 1.203959 testCompareMaskNotByte GT ops/s 8479397.377 1290.560961 10207032.3 5414.789178 1.203745 testCompareMaskNotByte GE ops/s 8479979.908 1094.823175 10203115.77 2909.433184 1.2032 testCompareMaskNotByte ULT ops/s 8480915.515 1420.30856 10213140.54 19628.56888 1.204249 testCompareMaskNotByte ULE ops/s 8481768.961 1806.086454 10191601.05 9537.089409 1.201589 testCompareMaskNotByte UGT ops/s 8477948.807 3652.437106 10208439.79 8335.226416 1.204116 testCompareMaskNotByte UGE ops/s 8477320.065 2191.753237 10198589.9 5748.761942 1.203044 testCompareMaskNotInt EQ ops/s 1906386.393 208.045573 2346741.129 383.461819 1.230989 testCompareMaskNotInt NE ops/s 1674206.146 169.967081 2346609.602 652.964692 1.401625 testCompareMaskNotInt LT ops/s 1684755.085 4939.806653 2345939.728 738.842445 1.392451 testCompareMaskNotInt LE ops/s 1659985.83 2408.542766 2346929.8 192.550397 1.413825 testCompareMaskNotInt GT ops/s 1674460.437 447.120589 2347037.155 342.433085 1.401667 testCompareMaskNotInt GE ops/s 1658699.073 884.268891 2347411.827 281.885914 1.415212 testCompareMaskNotInt ULT ops/s 1677043.66 6215.834359 2347155.384 425.141786 1.399579 testCompareMaskNotInt ULE ops/s 1667049.76 9521.094204 2346815.213 316.03901 1.407765 testCompareMaskNotInt UGT ops/s 1661045.828 3669.548525 2346711.365 2808.608132 1.412791 testCompareMaskNotInt UGE ops/s 1663715.691 4570.73053 2347096.847 191.804359 1.410755 testCompareMaskNotLong EQ ops/s 885668.5947 203.053456 1174274.006 113.51354 1.325861 testCompareMaskNotLong NE ops/s 837449.9353 198.611966 1174330.269 106.514374 1.402269 testCompareMaskNotLong LT ops/s 846790.2128 7005.585657 1174290.879 93.56413 1.386755 testCompareMaskNotLong LE ops/s 851253.2346 7624.045467 1174162.355 179.854316 1.379333 testCompareMaskNotLong GT ops/s 837715.7563 4272.558281 1173797.819 289.311518 1.401188 testCompareMaskNotLong GE ops/s 883137.593 14804.63746 1174216.909 86.404559 1.329596 testCompareMaskNotLong ULT ops/s 872478.9017 4955.722542 1174341.995 124.656933 1.345983 testCompareMaskNotLong ULE ops/s 866570.738 12541.58528 1174185.197 594.850706 1.354979 testCompareMaskNotLong UGT ops/s 866389.0927 3971.492766 1174210.803 153.960084 1.355292 testCompareMaskNotLong UGE ops/s 848339.3876 4555.514721 1174060.638 240.326562 1.383951 testCompareMaskNotShort EQ ops/s 3336170.783 2286.717236 4684904.156 2134.72575 1.404275 testCompareMaskNotShort NE ops/s 3334775.472 717.588615 4690264.12 3017.756867 1.40647 testCompareMaskNotShort LT ops/s 3334619.058 1138.901707 4685883.864 3808.321694 1.405223 testCompareMaskNotShort LE ops/s 3335538.353 538.676789 4688238.934 1029.406266 1.405541 testCompareMaskNotShort GT ops/s 3301425.217 694.060525 4689167.049 2845.363801 1.420346 testCompareMaskNotShort GE ops/s 3301580.972 317.042851 4688970.211 1292.83929 1.420219 testCompareMaskNotShort ULT ops/s 3336318.051 892.515034 4687549.384 1403.281648 1.405006 testCompareMaskNotShort ULE ops/s 3335188.292 972.230191 4684723.63 3937.599084 1.404635 testCompareMaskNotShort UGT ops/s 3334490.656 930.409628 4688058.378 1166.776081 1.405929 testCompareMaskNotShort UGE ops/s 3333050.033 3146.019596 4689197.9 456.439188 1.406878 With option `-XX:UseSVE=0`: Benchmark COMPARISON_OP Unit Before Score Error After Score Error Uplift testCompareMaskNotDouble EQ ops/s 788505.9464 579.254839 769969.5798 138.792325 0.976491 testCompareMaskNotDouble NE ops/s 655499.7935 471.970429 915086.3257 183.495964 1.396013 testCompareMaskNotDouble LT ops/s 788418.7889 574.263314 789271.7448 51.838991 1.001081 testCompareMaskNotDouble LE ops/s 789144.8431 45.334181 789326.1963 84.148011 1.000229 testCompareMaskNotDouble GT ops/s 788690.8485 662.950083 789246.9812 99.060588 1.000705 testCompareMaskNotDouble GE ops/s 789421.2387 94.012868 789166.4717 111.772533 0.999677 testCompareMaskNotFloat EQ ops/s 1816132.864 1298.2187 1816461.601 311.706275 1.000181 testCompareMaskNotFloat NE ops/s 1550767.697 1142.987761 2301429.148 159.71525 1.484057 testCompareMaskNotFloat LT ops/s 1815531.685 1370.868745 1817187.121 761.68401 1.000911 testCompareMaskNotFloat LE ops/s 1817937.722 484.638134 1817703.209 625.275639 0.999871 testCompareMaskNotFloat GT ops/s 1818618.89 724.324392 1817977.851 481.152488 0.999647 testCompareMaskNotFloat GE ops/s 1815118.411 1327.945736 1817476.414 510.712942 1.001299 testCompareMaskNotByte EQ ops/s 6489599.571 5127.815254 6535895.286 17029.15534 1.007133 testCompareMaskNotByte NE ops/s 9089974.523 4069.346579 15945662.17 22867.48282 1.754203 testCompareMaskNotByte LT ops/s 6499040.898 1250.085336 15939338.57 17451.05939 2.452567 testCompareMaskNotByte LE ops/s 6493612.339 4928.466061 15926355.01 27249.57103 2.452618 testCompareMaskNotByte GT ops/s 6494486.565 5229.4598 15957497.14 6893.237334 2.457083 testCompareMaskNotByte GE ops/s 6499295.661 1030.044749 15903755.01 46454.70992 2.446996 testCompareMaskNotByte ULT ops/s 6494212.684 5194.712704 15944816.71 3467.818892 2.455234 testCompareMaskNotByte ULE ops/s 6493882.576 5092.839387 15936419.25 22755.34523 2.454066 testCompareMaskNotByte UGT ops/s 6493479.899 4678.096391 15958133.18 3483.353667 2.457562 testCompareMaskNotByte UGE ops/s 6500338.419 709.344957 15968155.27 14020.47085 2.456511 testCompareMaskNotInt EQ ops/s 1830787.273 237.597163 1878452.588 142.728192 1.026035 testCompareMaskNotInt NE ops/s 1615081.395 1219.871461 2360913.712 199.556675 1.461792 testCompareMaskNotInt LT ops/s 1827819.867 1360.728526 2360561.422 248.025925 1.291462 testCompareMaskNotInt LE ops/s 1830975.648 416.987529 2360703.924 194.958346 1.289314 testCompareMaskNotInt GT ops/s 1830633.964 301.849017 2360552.203 224.908655 1.289472 testCompareMaskNotInt GE ops/s 1829476.495 1348.361278 2360673.736 137.538696 1.290354 testCompareMaskNotInt ULT ops/s 1829137.773 1285.55232 2360615.95 162.876291 1.290562 testCompareMaskNotInt ULE ops/s 1828107.468 1360.867847 2360790.337 297.267481 1.291384 testCompareMaskNotInt UGT ops/s 1829659.222 1459.098806 2361025.107 266.158075 1.290417 testCompareMaskNotInt UGE ops/s 1829548.187 1427.266787 2360941.943 242.380469 1.29045 testCompareMaskNotLong EQ ops/s 810439.9121 82.577412 802287.4993 73.462086 0.98994 testCompareMaskNotLong NE ops/s 681643.6089 485.657471 932324.6973 158.28799 1.367759 testCompareMaskNotLong LT ops/s 809850.546 680.71673 931404.3219 685.591444 1.150094 testCompareMaskNotLong LE ops/s 810584.5191 115.234753 932234.2412 105.451172 1.150076 testCompareMaskNotLong GT ops/s 810593.5376 117.947863 931879.1829 553.397713 1.149625 testCompareMaskNotLong GE ops/s 810435.8405 81.88737 931833.0348 177.765694 1.149792 testCompareMaskNotLong ULT ops/s 810429.8459 90.005329 932127.5278 74.443387 1.150164 testCompareMaskNotLong ULE ops/s 809740.842 411.655134 932231.6607 76.044104 1.151271 testCompareMaskNotLong UGT ops/s 810493.4369 52.024062 932239.1709 143.915229 1.150211 testCompareMaskNotLong UGE ops/s 810442.0661 64.064396 932361.567 119.570287 1.150435 testCompareMaskNotShort EQ ops/s 4786426.182 299.050738 4694123.013 482.608634 0.980715 testCompareMaskNotShort NE ops/s 3808932.807 2993.590606 5672255.469 6262.526335 1.489198 testCompareMaskNotShort LT ops/s 4782535.485 3699.104322 5668474.071 11101.86452 1.185244 testCompareMaskNotShort LE ops/s 4782896.891 3338.57484 5669188.434 6309.723399 1.185304 testCompareMaskNotShort GT ops/s 4778532.318 3571.547653 5680482.703 10427.66734 1.18875 testCompareMaskNotShort GE ops/s 4786150.851 794.769881 5664644.919 6542.434538 1.183549 testCompareMaskNotShort ULT ops/s 4783623.78 3582.962421 5668267.123 17841.44773 1.184931 testCompareMaskNotShort ULE ops/s 4782752.125 3610.296618 5666231.302 6964.505363 1.184721 testCompareMaskNotShort UGT ops/s 4782469.332 2913.37576 5655837.96 6494.608864 1.182618 testCompareMaskNotShort UGE ops/s 4782606.35 3491.774067 5667295.182 14176.96543 1.18498 On AMD EPYC 9124 16-Core Processor: With option `-XX:UseAVX=3`: Benchmark COMPARISON_OP Unit Before Score Error After Score Error Uplift testCompareMaskNotDouble EQ ops/s 2166357.886 27577.51358 2920183.192 38491.49083 1.347968 testCompareMaskNotDouble NE ops/s 2177325.341 32771.27023 2965747.932 39271.62615 1.362106 testCompareMaskNotDouble LT ops/s 2123834.711 22890.39919 2197099.169 29107.41329 1.034496 testCompareMaskNotDouble LE ops/s 2172931.681 32912.05647 2121686.057 34927.37781 0.976416 testCompareMaskNotDouble GT ops/s 2164924.662 30925.91899 2124062.892 37135.0458 0.981125 testCompareMaskNotDouble GE ops/s 2150619.038 35515.09022 2192636.533 38672.85716 1.019537 testCompareMaskNotFloat EQ ops/s 4518378.764 74733.72389 6724589.409 50424.63568 1.488274 testCompareMaskNotFloat NE ops/s 4522823.224 78138.66727 6907565.257 203953.3299 1.527268 testCompareMaskNotFloat LT ops/s 4587473.545 62621.25938 4431658.918 52760.23989 0.966034 testCompareMaskNotFloat LE ops/s 4472078.986 79338.23304 4472390.043 66247.285 1.000069 testCompareMaskNotFloat GT ops/s 4451744.39 220787.9755 4440866.486 58674.19154 0.997556 testCompareMaskNotFloat GE ops/s 4459601.349 57873.05167 4481398.426 76819.69285 1.004887 testCompareMaskNotByte EQ ops/s 19415317.92 356367.4937 20649319.86 240515.9459 1.063558 testCompareMaskNotByte NE ops/s 19401162.58 362571.8103 21010358.2 71221.35255 1.082943 testCompareMaskNotByte LT ops/s 19175612.37 273080.6175 20235838.72 396190.6101 1.05529 testCompareMaskNotByte LE ops/s 19036831.33 121135.0491 20674528.84 248839.9471 1.086027 testCompareMaskNotByte GT ops/s 19008302.3 124633.9182 20671390.89 271644.5576 1.087492 testCompareMaskNotByte GE ops/s 19590753.42 429156.452 20491615.07 332912.82 1.045984 testCompareMaskNotByte ULT ops/s 19431604.06 421396.5487 20575805.9 248466.2368 1.058883 testCompareMaskNotByte ULE ops/s 19060425.47 98309.75469 20774930.43 206596.0422 1.089951 testCompareMaskNotByte UGT ops/s 19266788.04 362893.3051 20861521.87 106977.3707 1.082771 testCompareMaskNotByte UGE ops/s 19127964.33 447774.3747 20791221.56 254458.0132 1.086954 testCompareMaskNotInt EQ ops/s 4473402.48 84902.77154 7191777.028 94315.13878 1.607674 testCompareMaskNotInt NE ops/s 4583165.363 73491.79073 7249884.988 80028.31191 1.581851 testCompareMaskNotInt LT ops/s 4618634.192 81869.82512 7242567.732 71211.3697 1.568118 testCompareMaskNotInt LE ops/s 4650524.195 72302.56692 7154948.491 83057.90635 1.538525 testCompareMaskNotInt GT ops/s 4534752.486 94449.20198 7004428.251 38365.18576 1.54461 testCompareMaskNotInt GE ops/s 4540777.389 86331.11847 7129527.341 74343.06996 1.570111 testCompareMaskNotInt ULT ops/s 4528175.644 114213.6504 7220013.98 82850.22587 1.594464 testCompareMaskNotInt ULE ops/s 4619335.448 74203.98889 7118543.128 54457.43284 1.541031 testCompareMaskNotInt UGT ops/s 4572521.254 122912.75 7154797.741 98858.3477 1.564737 testCompareMaskNotInt UGE ops/s 4579627.842 80558.04554 7179020.593 99239.23499 1.567599 testCompareMaskNotLong EQ ops/s 2103965.347 17059.28178 2997338.009 32388.42725 1.424613 testCompareMaskNotLong NE ops/s 2174434.633 36011.24708 2984460.593 29074.42994 1.372522 testCompareMaskNotLong LT ops/s 2110937.378 56642.0052 3020690.893 31167.62537 1.430971 testCompareMaskNotLong LE ops/s 2153414.166 31280.20562 2971696.162 31176.24605 1.379992 testCompareMaskNotLong GT ops/s 2166028.207 49432.18925 3008018.282 26534.78551 1.388725 testCompareMaskNotLong GE ops/s 2178206.136 35757.6799 2933186.687 19824.26727 1.346606 testCompareMaskNotLong ULT ops/s 2104344.728 31405.7728 2964354.007 26871.18289 1.408682 testCompareMaskNotLong ULE ops/s 2210232.578 21993.95777 3032635.261 25545.43656 1.372088 testCompareMaskNotLong UGT ops/s 2167177.931 44896.90807 2996245.236 34153.68941 1.382556 testCompareMaskNotLong UGE ops/s 2117175.328 26131.1893 2977492.164 23227.65519 1.406351 testCompareMaskNotShort EQ ops/s 8131234.179 185997.1777 12414378.38 122648.1579 1.526752 testCompareMaskNotShort NE ops/s 8506016.656 236481.383 12720442.64 322747.8776 1.495464 testCompareMaskNotShort LT ops/s 8487868.819 244943.6097 12150479.62 244300.5456 1.431511 testCompareMaskNotShort LE ops/s 8549184.557 286833.466 12358019.06 136683.2112 1.44552 testCompareMaskNotShort GT ops/s 8375447.45 221237.073 12602058.97 385690.3318 1.504643 testCompareMaskNotShort GE ops/s 8123474.548 127727.1461 12799747.64 197940.1001 1.575649 testCompareMaskNotShort ULT ops/s 8491650.422 313124.2425 12751186.59 255845.1653 1.501614 testCompareMaskNotShort ULE ops/s 8363009.676 203670.1995 12675908.7 279496.9925 1.515711 testCompareMaskNotShort UGT ops/s 8332268.933 279787.2503 12279451.4 436971.6582 1.473722 testCompareMaskNotShort UGE ops/s 8931588.505 203962.9257 12324437.67 330723.3066 1.37987 ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3291304777 From duke at openjdk.org Mon Sep 15 09:58:20 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 09:58:20 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2] In-Reply-To: References: Message-ID: > The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. > > This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: > 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. > 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. > > This pull request introduces the following changes: > 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. > 2. Eliminates unnecessary compress operations for partial subword type cases. > 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. > > Benchmark results demonstrate that these changes significantly improve performance. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > > Benchmark Unit Before Error After Error Uplift > Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 > Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 > Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 > Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 > > > This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge branch 'master' into JDK-8366333-compress - 8366333: AArch64: Enhance SVE subword type implementation of vector compress The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. This pull request introduces the following changes: 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. 2. Eliminates unnecessary compress operations for partial subword type cases. 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. Benchmark results demonstrate that these changes significantly improve performance. Benchmarks on Nvidia Grace machine with 128-bit SVE: ``` Benchmark Unit Before Error After Error Uplift Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 ``` This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. ------------- Changes: https://git.openjdk.org/jdk/pull/27188/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27188&range=01 Stats: 414 lines in 9 files changed: 297 ins; 24 del; 93 mod Patch: https://git.openjdk.org/jdk/pull/27188.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27188/head:pull/27188 PR: https://git.openjdk.org/jdk/pull/27188 From duke at openjdk.org Mon Sep 15 10:01:18 2025 From: duke at openjdk.org (erifan) Date: Mon, 15 Sep 2025 10:01:18 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 06:07:42 GMT, Galder Zamarre?o wrote: >> The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. >> >> This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: >> 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. >> 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. >> >> This pull request introduces the following changes: >> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. >> 2. Eliminates unnecessary compress operations for partial subword type cases. >> 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. >> >> Benchmark results demonstrate that these changes significantly improve performance. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE: >> >> Benchmark Unit Before Error After Error Uplift >> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 >> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 >> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 >> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 >> >> >> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. > > Would it make sense to additionally run the relevant benchmarks on other popular aarch64 platforms such as Graviton, to make sure the improvements are seen there as well? @galderz Yeah, absolutely. This is the test results on an **AWS graviton3 V1 machine**, we can see similar performance gain. Benchmark | Units | Before | Error | After | Error | Uplift -- | -- | -- | -- | -- | -- | -- Byte128Vector.compress | ops/ms | 2405.511 | 0.763 | 6116.85 | 17.699 | 2.54284848 Byte64Vector.compress | ops/ms | 1151.662 | 11.262 | 5278.924 | 6.74 | 4.58374419 Double128Vector.compress | ops/ms | 4919.017 | 4.909 | 4940.232 | 20.143 | 1.00431285 Double64Vector.compress | ops/ms | 37.071 | 0.778 | 37.109 | 0.945 | 1.00102506 Float128Vector.compress | ops/ms | 9580.312 | 48.341 | 9586.499 | 74.934 | 1.0006458 Float64Vector.compress | ops/ms | 4943.728 | 7.361 | 4941.917 | 5.871 | 0.99963368 Int128Vector.compress | ops/ms | 9496.991 | 34.972 | 9515.122 | 29.204 | 1.00190913 Int64Vector.compress | ops/ms | 4940.23 | 7.141 | 4941.815 | 5.077 | 1.00032084 Long128Vector.compress | ops/ms | 4918.142 | 14.835 | 4917.148 | 9.05 | 0.99979789 Long64Vector.compress | ops/ms | 36.58 | 0.426 | 36.574 | 0.431 | 0.99983598 Short128Vector.compress | ops/ms | 3343.878 | 0.898 | 6813.421 | 4.143 | 2.03758062 Short64Vector.compress | ops/ms | 1595.358 | 3.37 | 3390.959 | 3.55 | 2.12551603 ------------- PR Comment: https://git.openjdk.org/jdk/pull/27188#issuecomment-3291355148 From qxing at openjdk.org Mon Sep 15 10:18:25 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Mon, 15 Sep 2025 10:18:25 GMT Subject: RFR: 8347499: C2: Make `PhaseIdealLoop` eliminate more redundant safepoints in loops [v2] In-Reply-To: References: Message-ID: On Wed, 2 Apr 2025 07:22:13 GMT, Emanuel Peter wrote: >> The second question: >> >>> If we now removed safepoints in places where we would actually have needed them: how would we find out? I suppose we would get longer time to safepoint - higher latency in some cases. How would we catch this with our tests? >> >> I tried running tier1 tests with `JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+SafepointTimeout -XX:+AbortVMOnSafepointTimeout -XX:SafepointTimeoutDelay=1000`, and there were no failures. >> >> Running with `-XX:SafepointTimeoutDelay=500` caused 1 random JDK test case to fail. But then I tried to build a JDK without this patch, and it still had the random failure with this option. > > @MaxXSoft Would you mind improving the documentation comments, so that they are easier to understand? Maybe you can even add more comments around your code change, to "prove" why it is ok to do what we would do with your change? Hi @eme64, this PR is now ready for further reviews. Could you please continue to review this PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23057#issuecomment-3291428642 From galder at openjdk.org Mon Sep 15 10:32:14 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 15 Sep 2025 10:32:14 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 09:58:20 GMT, erifan wrote: >> The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. >> >> This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: >> 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. >> 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. >> >> This pull request introduces the following changes: >> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. >> 2. Eliminates unnecessary compress operations for partial subword type cases. >> 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. >> >> Benchmark results demonstrate that these changes significantly improve performance. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE: >> >> Benchmark Unit Before Error After Error Uplift >> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 >> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 >> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 >> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 >> >> >> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. > > erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' into JDK-8366333-compress > - 8366333: AArch64: Enhance SVE subword type implementation of vector compress > > The AArch64 SVE and SVE2 architectures lack an instruction suitable for > subword-type `compress` operations. Therefore, the current implementation > uses the 32-bit SVE `compact` instruction to compress subword types by > first widening the high and low parts to 32 bits, compressing them, and > then narrowing them back to their original type. Finally, the high and > low parts are merged using the `index + tbl` instructions. > > This approach is significantly slower compared to architectures with native > support. After evaluating all available AArch64 SVE instructions and > experimenting with various implementations?such as looping over the active > elements, extraction, and insertion?I confirmed that the existing algorithm > is optimal given the instruction set. However, there is still room for > optimization in the following two aspects: > 1. Merging with `index + tbl` is suboptimal due to the high latency of > the `index` instruction. > 2. For partial subword types, operations to the highest half are unnecessary > because those bits are invalid. > > This pull request introduces the following changes: > 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which > offer lower latency and higher throughput. > 2. Eliminates unnecessary compress operations for partial subword type cases. > 3. For `sve_compress_byte`, one less temporary register is used to alleviate > potential register pressure. > > Benchmark results demonstrate that these changes significantly improve performance. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > ``` > Benchmark Unit Before Error After Error Uplift > Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 > Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 > Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 > Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 > ``` > > This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, > and all tests passed. Marked as reviewed by galder (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/27188#pullrequestreview-3223928525 From dlunden at openjdk.org Mon Sep 15 11:46:55 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 15 Sep 2025 11:46:55 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v8] In-Reply-To: References: Message-ID: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Rename set_infinite to set_infinite_stack ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27215/files - new: https://git.openjdk.org/jdk/pull/27215/files/cf247cd2..79ebf2c3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27215&range=06-07 Stats: 4 lines in 3 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/27215.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27215/head:pull/27215 PR: https://git.openjdk.org/jdk/pull/27215 From dlunden at openjdk.org Mon Sep 15 11:46:57 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 15 Sep 2025 11:46:57 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v7] In-Reply-To: <4TEpZlUksghJKcxz5Vd0kxvlt13fr-Z-4xdHD91NFtQ=.1ab10a84-1bb7-4704-a86b-dadbda0e38f4@github.com> References: <4TEpZlUksghJKcxz5Vd0kxvlt13fr-Z-4xdHD91NFtQ=.1ab10a84-1bb7-4704-a86b-dadbda0e38f4@github.com> Message-ID: On Sat, 13 Sep 2025 04:46:28 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/regmask.hpp line 173: >> >>> 171: } >>> 172: >>> 173: void set_infinite() { >> >> Suggestion: >> >> void set_infinite_stack() { >> >> For consistency with `is_infinite_stack()`. > > Yes, it should be `set_infinite_stack` in parallel with `is_infinite_stack`, nice catch! Good catch, now updated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27215#discussion_r2348729018 From epeter at openjdk.org Mon Sep 15 12:12:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 12:12:25 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v8] In-Reply-To: References: Message-ID: <3cfLDG8pb2LJyTfZBsd4euG-I-MPmP7jFuj4cColG10=.54cdec43-ea4e-4302-b4e6-e652ba754e77@github.com> On Mon, 15 Sep 2025 11:46:55 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Rename set_infinite to set_infinite_stack Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27215#pullrequestreview-3224298424 From qamai at openjdk.org Mon Sep 15 12:55:12 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 15 Sep 2025 12:55:12 GMT Subject: RFR: 8356779: IGV: dump the index of the SafePointNode containing the current JVMS during parsing In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 05:22:00 GMT, Saranya Natarajan wrote: > This PR prints index of the SafePointNode containing the current JVMS during parsing in IGV. As stated in JBS the reason for this is that there are a lot of nodes during parsing, it would be nice to know what are the current nodes in the local slots or in the stack when looking at a graph. > > IGV screenshot of before fix > Screenshot 2025-09-15 at 11 56 54 > > IGV screenshot of after fix > Screenshot 2025-09-15 at 11 54 55 Marked as reviewed by qamai (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27083#pullrequestreview-3224467839 From rcastanedalo at openjdk.org Mon Sep 15 14:01:28 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 15 Sep 2025 14:01:28 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v8] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 11:46:55 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Rename set_infinite to set_infinite_stack Looks good! ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27215#pullrequestreview-3224780646 From shade at openjdk.org Mon Sep 15 14:08:57 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 15 Sep 2025 14:08:57 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode [v2] In-Reply-To: References: Message-ID: > I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: > > 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. > 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. > > I think we should be running CTW tests in AWT headless mode to begin with. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into JDK-8367313-ctw-headless-mode - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27187/files - new: https://git.openjdk.org/jdk/pull/27187/files/c4684176..75df3054 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27187&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27187&range=00-01 Stats: 29954 lines in 1034 files changed: 14950 ins; 9185 del; 5819 mod Patch: https://git.openjdk.org/jdk/pull/27187.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27187/head:pull/27187 PR: https://git.openjdk.org/jdk/pull/27187 From shade at openjdk.org Mon Sep 15 14:08:58 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 15 Sep 2025 14:08:58 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 08:11:43 GMT, Aleksey Shipilev wrote: > I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: > > 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. > 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. > > I think we should be running CTW tests in AWT headless mode to begin with. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` Friendly reminder; @TobiHartmann, maybe? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27187#issuecomment-3292351557 From roland at openjdk.org Mon Sep 15 14:25:02 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 15 Sep 2025 14:25:02 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v9] In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26424/files - new: https://git.openjdk.org/jdk/pull/26424/files/ec28714e..ed103c23 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=07-08 Stats: 7 lines in 1 file changed: 6 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26424.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26424/head:pull/26424 PR: https://git.openjdk.org/jdk/pull/26424 From roland at openjdk.org Mon Sep 15 14:25:03 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 15 Sep 2025 14:25:03 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v4] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> <6dWR-SxhuKd9-T3q313I6at4vTBcYlufyCBNjGGopv4=.cae3abea-0752-4191-ac08-890476489af3@github.com> Message-ID: On Mon, 15 Sep 2025 09:39:21 GMT, Emanuel Peter wrote: >> The `@requires` is there because the test run needs command line options that are c2 specific. > > @rwestrel @eme64 is the new commit what you had in mind? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2349176886 From shade at openjdk.org Mon Sep 15 14:27:28 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 15 Sep 2025 14:27:28 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability [v2] In-Reply-To: References: Message-ID: > See the bug for discussion what issues current machinery has. > > This PR executes the plan outlined in the bug: > 1. Common the receiver type profiling code in interpreter and C1 > 2. Rewrite receiver type profiling code to only do atomic receiver slot installations > 3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed > > This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler/` > - [ ] Linux x86_64 server fastdebug, `all` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls - Drop atomic counters - Initial version ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25305/files - new: https://git.openjdk.org/jdk/pull/25305/files/e078cfb1..f934435b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25305&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25305&range=00-01 Stats: 29954 lines in 1034 files changed: 14950 ins; 9185 del; 5819 mod Patch: https://git.openjdk.org/jdk/pull/25305.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25305/head:pull/25305 PR: https://git.openjdk.org/jdk/pull/25305 From epeter at openjdk.org Mon Sep 15 15:12:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 15:12:42 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v9] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Mon, 15 Sep 2025 14:25:02 GMT, Roland Westrelin wrote: >> A node in a pre loop only has uses out of the loop dominated by the >> loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control >> to the loop exit projection. A range check in the main loop has this >> node as input (through a chain of some other nodes). Range check >> elimination needs to update the exit condition of the pre loop with an >> expression that depends on the node pinned on its exit: that's >> impossible and the assert fires. This is a variant of 8314024 (this >> one was for a node with uses out of the pre loop on multiple paths). I >> propose the same fix: leave the node with control in the pre loop in >> this case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Yes, thanks for the updates ? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26424#pullrequestreview-3225140490 From jbhateja at openjdk.org Mon Sep 15 15:28:54 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 15 Sep 2025 15:28:54 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure In-Reply-To: References: Message-ID: <5xYVPgSuC3a9kqp_hRs3vgtBDoJzlmf9v6wgMa9XFJ4=.c8abf0f6-b563-4b3f-92c3-d902b6e59950@github.com> On Fri, 12 Sep 2025 19:14:18 GMT, Vladimir Ivanov wrote: > As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. > > Consider `FloatVector::lanewiseTemplate`: > > FloatVector lanewiseTemplate(VectorOperators.Unary op) { > if (opKind(op, VO_SPECIAL)) { > ... > else if (opKind(op, VO_MATHLIB)) { > return unaryMathOp(op); > } > } > int opc = opCode(op); > return VectorSupport.unaryOp(opc, ...); > } > > > At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. > > It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. > > The fix is to fail-fast intrinsification rather than crashing the VM. > > Testing: tier1 - tier4 LGTM Best Regards test/hotspot/jtreg/compiler/vectorapi/TestVectorMathLib.java line 40: > 38: * -XX:CompileCommand=compileonly,compiler.vectorapi.TestVectorMathLib::test* > 39: * -XX:+StressIncrementalInlining > 40: * compiler.vectorapi.TestVectorMathLib Suggestion: * -XX:+StressIncrementalInlining compiler.vectorapi.TestVectorMathLib ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27263#pullrequestreview-3225199836 PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2349369795 From epeter at openjdk.org Mon Sep 15 15:31:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 15:31:24 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 Message-ID: Demo from here: https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ Cleaned up and enhanced with a JTREG and IR test. I also added some additional "generated" normal maps from height functions. And I display the resulting image side-by-side with the normal map. I decided to put it in a new directory `compiler.galery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "galery") and that we may want to back up with other tests like IR testing. Here some snapshots: image image image image ------------- Commit messages: - more prints - comments - update - more details - documentation - IR rule - simplify timeout - handle timeouts, maybe a bit cluncky - fix path issuesg - add stand-alone - ... and 1 more: https://git.openjdk.org/jdk/compare/2826d170...1bdaf5fc Changes: https://git.openjdk.org/jdk/pull/27282/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367657 Stats: 659 lines in 4 files changed: 659 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27282/head:pull/27282 PR: https://git.openjdk.org/jdk/pull/27282 From eastigeevich at openjdk.org Mon Sep 15 15:43:14 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 15 Sep 2025 15:43:14 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v43] In-Reply-To: References: Message-ID: <_gl77pppG0Zcwu5LuuEHISHJ27TyQuIgvkQ_fovBYJ0=.2c65934b-cab9-45da-9c34-bd45d68d0ef6@github.com> On Tue, 26 Aug 2025 10:03:29 GMT, Robbin Ehn wrote: >> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix WB_RelocateNMethodFromAddr to not use stale nmethod pointer > > A side comment, which I don't find it discussed in JEP or in the issues. (maybe I just missed it) > There can also be a significant performance improvement using direct jumps versus using indirect jump and reduced memory pressure. E.g. a direct BL vs BL to LDR + BR + <8 byte address>. > > Hence it would be good to place hot methods within the hot area in "call sequences", as an application may have many hot methods totally unrelated to each other. This also means you really would like to have e.g. vtable stub in reach of BL in above case to get the most out of it. @robehn > A side comment, which I don't find it discussed in JEP or in the issues. (maybe I just missed it) There can also be a significant performance improvement using direct jumps versus using indirect jump and reduced memory pressure. E.g. a direct BL vs BL to LDR + BR + <8 byte address>. Java calls, which are indirect in the original nmethod, will become direct if callees get close to the copy. Java calls, which are direct in the original nmethod, will become indirect if callees get far from the copy. We can do this because we use trampolines for Java calls. Runtime calls, which are indirect in the original nmethod, will stay indirect. Whether runtime calls are direct or indirect depends on the static configuration of CodeCache not on a placement of nmethod in CodeCache. See `target_needs_far_branch()` in `src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp`. We don't use trampolines for runtime calls. Maybe it's worth to switch to use trampolines for runtime calls as well. We have a mechanism of shared trampolines. Runtime calls are always direct for the default CodeCache configuration: 240MB, three code heaps. > > Hence it would be good to place hot methods within the hot area in "call sequences", as an application may have many hot methods totally unrelated to each other. We are working on an algorithm to identify candidates to be placed together in the hot code heap. It can consider putting together a caller and its callees in the hot code heap. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3292817263 From duke at openjdk.org Mon Sep 15 15:44:21 2025 From: duke at openjdk.org (Thomas Zimmermann) Date: Mon, 15 Sep 2025 15:44:21 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 06:01:46 GMT, Emanuel Peter wrote: > Demo from here: > https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ > > Cleaned up and enhanced with a JTREG and IR test. > I also added some additional "generated" normal maps from height functions. > And I display the resulting image side-by-side with the normal map. > > I decided to put it in a new directory `compiler.galery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "galery") and that we may want to back up with other tests like IR testing. > > There is a **stand-alone** way to run the demo: > `java test/hotspot/jtreg/compiler/galery/NormalMapping.java` > (though it may only run with JDK22+, probably due some amber features) > > Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: > image > image > image > image Shouldn't it be "gallery" or am I missing a joke? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27282#issuecomment-3292820860 From cslucas at openjdk.org Mon Sep 15 16:48:42 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 15 Sep 2025 16:48:42 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v2] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 12:37:54 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/share/opto/escape.cpp line 3135: >> >>> 3133: Node* phi = use->ideal_node(); >>> 3134: if (phi->Opcode() == Op_Phi && reducible_merges.member(phi)) { >>> 3135: if (!can_reduce_phi(phi->as_Phi())) { >> >> Drive-by comment: I think the ifs should be merged > > @JohnTortugo: this comment is marked as resolved in the PR but I cannot see any reply or actual code change, did you perhaps forget pushing the requested change? Apologies, I clicked resolve and didn't see it later on. I'll push it as soon as I have some time. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27063#discussion_r2349577410 From epeter at openjdk.org Mon Sep 15 17:27:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 17:27:58 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v2] In-Reply-To: References: Message-ID: > Demo from here: > https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ > > Cleaned up and enhanced with a JTREG and IR test. > I also added some additional "generated" normal maps from height functions. > And I display the resulting image side-by-side with the normal map. > > I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. > > There is a **stand-alone** way to run the demo: > `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` > (though it may only run with JDK22+, probably due some amber features) > > Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: > image > image > image > image Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: galery -> gallery ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27282/files - new: https://git.openjdk.org/jdk/pull/27282/files/1bdaf5fc..40f1f38f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=00-01 Stats: 5 lines in 3 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27282/head:pull/27282 PR: https://git.openjdk.org/jdk/pull/27282 From epeter at openjdk.org Mon Sep 15 17:41:34 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 17:41:34 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: References: Message-ID: > Demo from here: > https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ > > Cleaned up and enhanced with a JTREG and IR test. > I also added some additional "generated" normal maps from height functions. > And I display the resulting image side-by-side with the normal map. > > I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. > > There is a **stand-alone** way to run the demo: > `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` > (though it may only run with JDK22+, probably due some amber features) > > Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: > image > image > image > image Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix inlining ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27282/files - new: https://git.openjdk.org/jdk/pull/27282/files/40f1f38f..47aa0c7d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=01-02 Stats: 3 lines in 2 files changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27282/head:pull/27282 PR: https://git.openjdk.org/jdk/pull/27282 From dlunden at openjdk.org Mon Sep 15 17:47:34 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 15 Sep 2025 17:47:34 GMT Subject: RFR: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp [v8] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 11:46:55 GMT, Daniel Lund?n wrote: >> Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. >> >> ### Changeset >> >> - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. >> - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. >> - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. >> - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) >> - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Rename set_infinite to set_infinite_stack Thanks for the reviews everyone! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27215#issuecomment-3293261684 From dlunden at openjdk.org Mon Sep 15 17:47:36 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 15 Sep 2025 17:47:36 GMT Subject: Integrated: 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 10:04:47 GMT, Daniel Lund?n wrote: > Some names in `regmask.hpp` and `regmask.cpp` are unclear and should be improved. > > ### Changeset > > - Rename `RM_SIZE` to `RM_SIZE_IN_INTS` and `_RM_I` to `_RM_INT` to make it clear that these refer to integer-sized (32-bit) array elements. > - Rename `_RM_SIZE` to `_RM_SIZE_IN_WORDS` and `_RM_UP` to `_RM_WORD` to make it clear that these refer to machine-word-sized (32 or 64 bits depending on platform) array elements. > - Rename `_RM_MAX` to `_RM_WORD_MAX_INDEX` for clarity. > - Rename `is_AllStack` to `is_infinite` (and related resulting changes in comments and local variables). The old terminology "all-stack", referring to the infinite register mask bits, is misleading (as pointed out by @eme64 in https://github.com/openjdk/jdk/pull/20404#discussion_r2316234008). The reason is that the infinite bits do not represent *all* stack bits. Some stack bits are instead part of the non-infinite bits of the register mask. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/17638365968) > - `tier1` and HotSpot parts of `tier2` and `tier3` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. This pull request has now been integrated. Changeset: 60930a3e Author: Daniel Lund?n URL: https://git.openjdk.org/jdk/commit/60930a3e196088e239c902216de07e1cce8407e4 Stats: 135 lines in 12 files changed: 15 ins; 0 del; 120 mod 8367397: Improve naming and terminology in regmask.hpp and regmask.cpp Reviewed-by: epeter, rcastanedalo, dlong ------------- PR: https://git.openjdk.org/jdk/pull/27215 From sparasa at openjdk.org Mon Sep 15 17:54:31 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 15 Sep 2025 17:54:31 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: Message-ID: <7RTXYjdRF7b27DdNVQHUQx0vmUhr-sqm8XU1cIAoLLo=.638d7f26-d065-47d3-ae08-1e36f75463d5@github.com> On Thu, 11 Sep 2025 16:25:32 GMT, Srinivas Vamsi Parasa wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> undo new match rules for RegMemReg for commutative operations > > Hi Emanuel (@eme64), > > Could you please run the tests for this PR? > > Thanks, > Vamsi > @vamsi-parasa Quickly scanned the patch, looks reasonable. Launching tests ? Hi Emanuel (@eme64), could you please let me know if the tests passed? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26997#issuecomment-3293301474 From roland at openjdk.org Mon Sep 15 18:16:05 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 15 Sep 2025 18:16:05 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v4] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: <-9gD2_d_l2sz1el2iIeJc_GiWA0gmSwdeCygX5aqHbk=.938ce734-f23d-4275-b710-9635e8ce6e2e@github.com> On Tue, 5 Aug 2025 09:43:35 GMT, Manuel H?ssig wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8361702 >> - Update src/hotspot/share/opto/loopopts.cpp >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java >> >> Co-authored-by: Christian Hagedorn >> - Update src/hotspot/share/opto/loopopts.cpp >> >> Co-authored-by: Christian Hagedorn >> - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java >> >> Co-authored-by: Christian Hagedorn >> - tests >> - fix > > Thank you for working on this, @rwestrel. It looks good to me. @mhaessig @chhagedorn @eme64 thanks for the reviews ------------- PR Comment: https://git.openjdk.org/jdk/pull/26424#issuecomment-3293366705 From roland at openjdk.org Mon Sep 15 18:16:07 2025 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 15 Sep 2025 18:16:07 GMT Subject: Integrated: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Tue, 22 Jul 2025 08:25:08 GMT, Roland Westrelin wrote: > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. This pull request has now been integrated. Changeset: f8ba02f2 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/f8ba02f2296f0ef0227f90e0e1ed116121e68231 Stats: 184 lines in 4 files changed: 166 ins; 7 del; 11 mod 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? Reviewed-by: epeter, chagedorn, mhaessig ------------- PR: https://git.openjdk.org/jdk/pull/26424 From epeter at openjdk.org Mon Sep 15 20:25:11 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 15 Sep 2025 20:25:11 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 17:41:34 GMT, Emanuel Peter wrote: >> Demo from here: >> https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ >> >> Cleaned up and enhanced with a JTREG and IR test. >> I also added some additional "generated" normal maps from height functions. >> And I display the resulting image side-by-side with the normal map. >> >> I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. >> >> There is a **stand-alone** way to run the demo: >> `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` >> (though it may only run with JDK22+, probably due some amber features) >> >> Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: >> image >> image >> image >> image > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix inlining @grfrost In case you missed it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27282#issuecomment-3293800874 From dlong at openjdk.org Mon Sep 15 23:00:37 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 15 Sep 2025 23:00:37 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> Message-ID: On Thu, 11 Sep 2025 18:24:46 GMT, Vladimir Ivanov wrote: >> @iwanowww Let me know whenever this is ready to review again ? > > @eme64 I think I addressed/answered all your suggestions/questions. Please, take another look. Thanks! @iwanowww , do you have a test that shows constant oops are a problem? My initial impression is that PreserveReachabilityFencesOnConstants shouldn't be needed, because any oops referenced during the compile should go into the ciEnv metadata[] and then into the nmethod oops. So GC can't reclaim these oops because the nmethod keeps references to them. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3294254199 From dlong at openjdk.org Tue Sep 16 01:27:22 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 16 Sep 2025 01:27:22 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Fri, 12 Sep 2025 13:39:03 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/callGenerator.cpp line 620: >> >>> 618: // Inlining logic doesn't expect any extra edges past debug info and fails with >>> 619: // an assert in SafePointNode::grow_stack. >>> 620: assert(endoff == call->req(), "reachability edges not supported"); >> >> Could we trip over this assert by modifying the reproducer, and add some method somewhere that gets inlined late? > > Could we also bail out here? Or what would happen now in production if there is a RF edge? We also use this area past endoff() for storing the "ex_oop" (see for example GraphKit::has_saved_ex_oop()). Are ex_oop and reachability edges mutually exclusive? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2350439892 From cslucas at openjdk.org Tue Sep 16 02:35:01 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 16 Sep 2025 02:35:01 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v4] In-Reply-To: References: Message-ID: > Please, review this patch to fix issue that may occur when reducing allocation merge. > > As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. > > The change in `revisit_reducible_phi_status` is just a clean-up. > The real fix is in `find_scalar_replaceable_allocs`. > > Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. Cesar Soares Lucas has updated the pull request incrementally with two additional commits since the last revision: - Merge remote-tracking branch 'refs/remotes/origin/ram-non-reducible' into ram-non-reducible - Merge consecutive ifs ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27063/files - new: https://git.openjdk.org/jdk/pull/27063/files/28d9432e..2236348b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27063&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27063&range=02-03 Stats: 7 lines in 1 file changed: 0 ins; 2 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27063.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27063/head:pull/27063 PR: https://git.openjdk.org/jdk/pull/27063 From epeter at openjdk.org Tue Sep 16 05:38:13 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 05:38:13 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 00:45:45 GMT, Srinivas Vamsi Parasa wrote: >> This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. >> >> Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). >> >> For example: >> `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding >> `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > undo new match rules for RegMemReg for commutative operations Not reviewed in detail, but looks reasonable. Tests pass :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26997#pullrequestreview-3227427375 From epeter at openjdk.org Tue Sep 16 06:20:26 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 06:20:26 GMT Subject: RFR: 8347499: C2: Make `PhaseIdealLoop` eliminate more redundant safepoints in loops [v3] In-Reply-To: References: Message-ID: On Thu, 22 May 2025 07:53:39 GMT, Qizheng Xing wrote: >> In `PhaseIdealLoop`, `IdealLoopTree::check_safepts` method checks if any call that is guaranteed to have a safepoint dominates the tail of the loop. In the previous implementation, `check_safepts` would stop if it found a local non-call safepoint. At this time, if there was a call before the safepoint in the dom-path, this safepoint would not be eliminated. >> >> loop-safepoint >> >> This patch changes the behavior of `check_safepts` to not stop when it finds a non-local safepoint. This makes simple loops with one method call ~3.8% faster (on aarch64). >> >> >> Benchmark Mode Cnt Score Error Units >> LoopSafepoint.loopVar avgt 15 208296.259 ? 1350.409 ns/op # baseline >> LoopSafepoint.loopVar avgt 15 200692.874 ? 616.770 ns/op # this patch >> >> >> Testing: tier1-2 on x86_64 and aarch64. > > Qizheng Xing has updated the pull request incrementally with one additional commit since the last revision: > > Improve documentation comments Wow, this took me way too long to have a look at. But I feel like now I understand what's going on, so thanks for the additional changes on the documentation. The approach seems very reasonable. I'd lilke to see a few more tests though. Maybe you can instead point me to tests that already exist, that would be fine too. I'm soon going on vacation, so it may take me even more time to re-review. But I'd suggest that @rwestrel look at this PR, since he last worked on this code. @MaxXSoft Can you please merge with master as well? I think we should also run some larger benchmarking on this patch, just to see if there are any surprises (I'd expect mostly improvements, but we shall see). src/hotspot/share/opto/loopnode.cpp line 3818: > 3816: // / | | > 3817: // v +--+ > 3818: // exit 4 This drawing seems a bit confusing. There seem to be 3 edges coming out of 2. Do you think you could fix it too, just to create more clarity in the code? src/hotspot/share/opto/loopnode.cpp line 3830: > 3828: // > 3829: // The insights into the problem: > 3830: // A) Counted loops are okay What does it mean to be "okay"? Why are they "okay"? src/hotspot/share/opto/loopnode.cpp line 3832: > 3830: // A) Counted loops are okay > 3831: // B) Innermost loops are okay because there's no inner loops can delete > 3832: // their ncsfpts Suggestion: // B) Innermost loops are okay because there's no inner loops that can // delete their ncsfpts Missing `that`. I feel that we are now losing information. The previous comment made a promise that your comment does not make any more. Is that intentional? It seems the logic was: only outer loops need to mark safepoints for protection, because only loops further in can remove safepoints. Is that still correct? src/hotspot/share/opto/loopnode.cpp line 3840: > 3838: // inside any nested loop, then that loop is okay > 3839: // E) Otherwise, if an outer loop's ncsfpt on the idom-path is nested in > 3840: // an inner loop, we need to prevent the inner loop from deleting it Nice, that's indeed an improvement :) test/hotspot/jtreg/compiler/c2/irTests/TestLoopSafepoint.java line 24: > 22: */ > 23: > 24: package compiler.c2.irTests; We'd like to get away from putting all IR tests in `irTests`, and we'd rather put them into thematic directories. Proposal: `compiler/loopopts/TestRedundantSafePointElimination.java` test/hotspot/jtreg/compiler/c2/irTests/TestLoopSafepoint.java line 33: > 31: * @summary Tests that redundant safepoints can be eliminated in loops. > 32: * @library /test/lib / > 33: * @requires vm.compiler2.enabled Is this `@requires` strictly required? If not, remove it so we can run these tests also with C1 and other compilers. test/hotspot/jtreg/compiler/c2/irTests/TestLoopSafepoint.java line 66: > 64: empty(); > 65: } > 66: } So these do not end up being CountedLoop? test/hotspot/jtreg/compiler/c2/irTests/TestLoopSafepoint.java line 84: > 82: empty(); > 83: } > 84: } All of the cases here are only single loops, right? But is the algorithm not mostly dealing with nested loops, where we have to make sure that in some cases the `SafePoint` is not eliminated? Could you add some extra cases for that? test/micro/org/openjdk/bench/vm/compiler/LoopSafepoint.java line 76: > 74: } > 75: return sum; > 76: } I think it would be nice if you made the examples in the JMH and the JTREG as similar as possible. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23057#pullrequestreview-3227461101 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350903845 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350907448 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350916805 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350922211 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350951501 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350952358 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350961237 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350965087 PR Review Comment: https://git.openjdk.org/jdk/pull/23057#discussion_r2350967582 From epeter at openjdk.org Tue Sep 16 06:28:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 06:28:24 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v13] In-Reply-To: References: Message-ID: <2DmjQ5OVBp_bHyQl1gMIB6-vNn8AgXjHbK2Geu9pWr8=.c08a85a0-c63e-4d31-9140-64c44c7b8cd6@github.com> On Mon, 15 Sep 2025 05:43:11 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Add an IR rule for vector mask cast operation @erifan Nice work on the benchmark refactor! And thanks for the other updates. I'll run some testing now, should take about 24h. ------------- PR Review: https://git.openjdk.org/jdk/pull/24674#pullrequestreview-3227664072 From epeter at openjdk.org Tue Sep 16 06:31:34 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 06:31:34 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v9] In-Reply-To: <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> Message-ID: On Sun, 14 Sep 2025 14:44:02 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove unused parameter Testing looks good. Minor changes should be ok, as long as GitHub Actions passes. Thanks for all the work @SirYwell ! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25254#pullrequestreview-3227698583 From epeter at openjdk.org Tue Sep 16 06:49:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 06:49:20 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 19:14:18 GMT, Vladimir Ivanov wrote: > As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. > > Consider `FloatVector::lanewiseTemplate`: > > FloatVector lanewiseTemplate(VectorOperators.Unary op) { > if (opKind(op, VO_SPECIAL)) { > ... > else if (opKind(op, VO_MATHLIB)) { > return unaryMathOp(op); > } > } > int opc = opCode(op); > return VectorSupport.unaryOp(opc, ...); > } > > > At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. > > It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. > > The fix is to fail-fast intrinsification rather than crashing the VM. > > Testing: tier1 - tier4 Changes requested by epeter (Reviewer). test/hotspot/jtreg/compiler/vectorapi/TestVectorMathLib.java line 33: > 31: * @test > 32: * @bug 8367333 > 33: * @requires vm.compiler2.enabled Do you need this `@requires`? It might be nice to be able to run this with other compilers too. test/hotspot/jtreg/compiler/vectorapi/TestVectorMathLib.java line 40: > 38: * -XX:CompileCommand=compileonly,compiler.vectorapi.TestVectorMathLib::test* > 39: * -XX:+StressIncrementalInlining > 40: * compiler.vectorapi.TestVectorMathLib Like @jatin-bhateja mentioned: alignment is off. I'd also like to see a run without flags, maybe with only `-XX:CompileCommand=compileonly,compiler.vectorapi.TestVectorMathLib::test*` ------------- PR Review: https://git.openjdk.org/jdk/pull/27263#pullrequestreview-3227803098 PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2351063466 PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2351069465 From hgreule at openjdk.org Tue Sep 16 06:50:28 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 16 Sep 2025 06:50:28 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Fri, 12 Sep 2025 12:12:21 GMT, Emanuel Peter wrote: >> @merykitty thanks, I hopefully addressed your comments :) >> >> @eme64 do you want to re-run the tests once again? > > @SirYwell Launching tests ? Thanks @eme64! Do I need another re-approval from @merykitty or are we ready to integrate? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3295932705 From epeter at openjdk.org Tue Sep 16 06:51:38 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 06:51:38 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 14:08:57 GMT, Aleksey Shipilev wrote: >> I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: >> >> 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. >> 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. >> >> I think we should be running CTW tests in AWT headless mode to begin with. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8367313-ctw-headless-mode > - Fix @TobiHartmann is on vacation. Maybe @vnkozlov ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27187#issuecomment-3295941040 From epeter at openjdk.org Tue Sep 16 06:54:35 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 06:54:35 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode [v2] In-Reply-To: References: Message-ID: <5OMky7jrDAsxrSC50xEfkVP1mFvoFQ0VB2trl46a7i8=.bda7d7f9-35b5-4553-b0ff-b26776cfef57@github.com> On Mon, 15 Sep 2025 14:08:57 GMT, Aleksey Shipilev wrote: >> I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: >> >> 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. >> 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. >> >> I think we should be running CTW tests in AWT headless mode to begin with. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8367313-ctw-headless-mode > - Fix Looks reasonable. I'll run some internal testing, takes about 24h. ------------- PR Review: https://git.openjdk.org/jdk/pull/27187#pullrequestreview-3227841960 From manc at openjdk.org Tue Sep 16 06:57:16 2025 From: manc at openjdk.org (Man Cao) Date: Tue, 16 Sep 2025 06:57:16 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed Message-ID: Hi, Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. ------------- Commit messages: - 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed Changes: https://git.openjdk.org/jdk/pull/27306/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27306&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367613 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27306.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27306/head:pull/27306 PR: https://git.openjdk.org/jdk/pull/27306 From chagedorn at openjdk.org Tue Sep 16 07:09:03 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 16 Sep 2025 07:09:03 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 17:41:34 GMT, Emanuel Peter wrote: >> Demo from here: >> https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ >> >> Cleaned up and enhanced with a JTREG and IR test. >> I also added some additional "generated" normal maps from height functions. >> And I display the resulting image side-by-side with the normal map. >> >> I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. >> >> There is a **stand-alone** way to run the demo: >> `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` >> (though it may only run with JDK22+, probably due some amber features) >> >> Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: >> image >> image >> image >> image > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix inlining Great work and thanks for sharing it! A few small suggestions, otherwise, it looks good to me! test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 40: > 38: import java.awt.geom.Path2D; > 39: import javax.swing.JPanel; > 40: import java.awt.Font; Some unused imports (double check again after removing): Suggestion: import java.awt.Graphics; import java.awt.Graphics2D; import java.awt.Color; import java.awt.image.BufferedImage; import java.awt.image.DataBufferInt; import java.io.IOException; import java.util.Random; import javax.swing.JPanel; import java.awt.Font; test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 88: > 86: System.out.println("Welcome to the Normal Mapping Demo!"); > 87: // Create an applicateion state with 5 lights. > 88: State state = new State(5); I suggest to put `5` into a named constant. This invites to play around with different number of lights. test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 93: > 91: System.out.println("Setting up Window..."); > 92: MyDrawingPanel panel = new MyDrawingPanel(state); > 93: JFrame frame = new JFrame("Normal Mapping Demo (Auto Vectorization)"); Suggestion: JFrame frame = new JFrame("Normal Mapping Demo (Auto-Vectorization)"); test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 121: > 119: } > 120: > 121: public static File getLocalFile(String name) { Isn't `name` always constant (i.e. `normal_map.png`)? Then you could also extract that to a constant and use it here directly. test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 149: > 147: } > 148: > 149: public static class Light { Maybe add a quick comment what this class does since it's a demo and one might want to better understand what's going on. Same for `State` class below. test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 170: > 168: dy *= 0.99; > 169: dx += RANDOM.nextFloat() * 0.001 - 0.0005;; > 170: dy += RANDOM.nextFloat() * 0.001 - 0.0005;; Suggestion: dx += RANDOM.nextFloat() * 0.001 - 0.0005; dy += RANDOM.nextFloat() * 0.001 - 0.0005; test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 244: > 242: > 243: public void nextNormals() { > 244: switch(nextNormalsId) { Suggestion: switch (nextNormalsId) { test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 299: > 297: interface HeightFunction { > 298: // x and y should be in [0..1] > 299: public double call(double x, double y); Implicit: Suggestion: double call(double x, double y); test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 310: > 308: > 309: // A selection of "height functions": > 310: return switch(name) { Suggestion: return switch (name) { test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 314: > 312: case "heart" -> { > 313: double heart = Math.abs(Math.pow(x*x + y*y - 1, 3) - x*x * Math.pow(-y, 3)); > 314: double decay = Math.exp(-(x*x + y*y)); Suggestion: double heart = Math.abs(Math.pow(x * x + y * y - 1, 3) - x * x * Math.pow(-y, 3)); double decay = Math.exp(-(x * x + y * y)); test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 318: > 316: } > 317: case "hill" -> 0.5 * Math.exp(-(x*x + y*y)); > 318: case "ripple" -> 0.01 * Math.sin(x*x + y*y); Suggestion: case "hill" -> 0.5 * Math.exp(-(x * x + y * y)); case "ripple" -> 0.01 * Math.sin(x * x + y * y); test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 411: > 409: for (int i = 0; i < lights.length; i++) { > 410: lights[i].update(); > 411: } As below, you could use enhanced-for: Suggestion: for (Light light : lights) { light.update(); } test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 417: > 415: Arrays.fill(outputArray, 0); > 416: > 417: // Add inn the contribution of each light Suggestion: // Add in the contribution of each light test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 463: > 461: float luminosity = Math.max(0, dotProduct / d3) * luminosityCorrection; > 462: > 463: // Now we we compute the color values that hopefully end up in the range Suggestion: // Now we compute the color values that hopefully end up in the range test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 480: > 478: > 479: // This is a bit of a horrible hack, but it mostly works. > 480: // Essencially, it tries to solve the "exposure" problem: Suggestion: // Essentially, it tries to solve the "exposure" problem: test/hotspot/jtreg/compiler/gallery/TestNormalMapping.java line 29: > 27: * @summary Visual example of auto vectorization: normal mapping. > 28: * @library /test/lib / > 29: * @run main compiler.gallery.TestNormalMapping ir This should be `driver` because otherwise, we will be stressing the driver VM when run with `Xcomp` etc. Suggestion: * @run driver compiler.gallery.TestNormalMapping ir ------------- PR Review: https://git.openjdk.org/jdk/pull/27282#pullrequestreview-3227811932 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351083668 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351084533 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351069089 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351078912 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351088352 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351085454 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351093619 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351098053 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351098915 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351101451 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351102209 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351110603 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351104989 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351112255 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351112981 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351131694 From epeter at openjdk.org Tue Sep 16 07:09:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:09:58 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 09:58:20 GMT, erifan wrote: >> The AArch64 SVE and SVE2 architectures lack an instruction suitable for subword-type `compress` operations. Therefore, the current implementation uses the 32-bit SVE `compact` instruction to compress subword types by first widening the high and low parts to 32 bits, compressing them, and then narrowing them back to their original type. Finally, the high and low parts are merged using the `index + tbl` instructions. >> >> This approach is significantly slower compared to architectures with native support. After evaluating all available AArch64 SVE instructions and experimenting with various implementations?such as looping over the active elements, extraction, and insertion?I confirmed that the existing algorithm is optimal given the instruction set. However, there is still room for optimization in the following two aspects: >> 1. Merging with `index + tbl` is suboptimal due to the high latency of the `index` instruction. >> 2. For partial subword types, operations to the highest half are unnecessary because those bits are invalid. >> >> This pull request introduces the following changes: >> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which offer lower latency and higher throughput. >> 2. Eliminates unnecessary compress operations for partial subword type cases. >> 3. For `sve_compress_byte`, one less temporary register is used to alleviate potential register pressure. >> >> Benchmark results demonstrate that these changes significantly improve performance. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE: >> >> Benchmark Unit Before Error After Error Uplift >> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 >> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 >> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 >> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 >> >> >> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, and all tests passed. > > erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' into JDK-8366333-compress > - 8366333: AArch64: Enhance SVE subword type implementation of vector compress > > The AArch64 SVE and SVE2 architectures lack an instruction suitable for > subword-type `compress` operations. Therefore, the current implementation > uses the 32-bit SVE `compact` instruction to compress subword types by > first widening the high and low parts to 32 bits, compressing them, and > then narrowing them back to their original type. Finally, the high and > low parts are merged using the `index + tbl` instructions. > > This approach is significantly slower compared to architectures with native > support. After evaluating all available AArch64 SVE instructions and > experimenting with various implementations?such as looping over the active > elements, extraction, and insertion?I confirmed that the existing algorithm > is optimal given the instruction set. However, there is still room for > optimization in the following two aspects: > 1. Merging with `index + tbl` is suboptimal due to the high latency of > the `index` instruction. > 2. For partial subword types, operations to the highest half are unnecessary > because those bits are invalid. > > This pull request introduces the following changes: > 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which > offer lower latency and higher throughput. > 2. Eliminates unnecessary compress operations for partial subword type cases. > 3. For `sve_compress_byte`, one less temporary register is used to alleviate > potential register pressure. > > Benchmark results demonstrate that these changes significantly improve performance. > > Benchmarks on Nvidia Grace machine with 128-bit SVE: > ``` > Benchmark Unit Before Error After Error Uplift > Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 > Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 > Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 > Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 > ``` > > This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, > and all tests passed. Drive-by comments, going on vacation soon so don't depend on me fully reviewing this any time soon ;) src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2287: > 2285: sve_compress_short(dst, vtmp1, ptmp, vtmp2, vtmp3, pgtmp, extended_size > MaxVectorSize ? MaxVectorSize : extended_size); > 2286: // Narrow the result back to type BYTE. > 2287: // dst = 0 0 0 0 0 0 0 0 0 0 0 0 0 g c a Can you make sure that your examples are all nicely aligned? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2315: > 2313: // Combine the compressed low with the compressed high. > 2314: // dst = 0 0 0 0 0 0 0 0 0 0 0 p i g c a > 2315: sve_splice(dst, B, ptmp, vtmp1); Alignment of examples would be nice test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 36: > 34: * @key randomness > 35: * @library /test/lib / > 36: * @summary AArch64: Enhance SVE subword type implementation of vector compress I would change the summary to something a bit more generic, since the test is not only good for aarch64 / SVE. Suggestion: * @summary IR test for VectorAPI compress test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 214: > 212: > 213: @Test > 214: @IR(counts = { IRNode.COMPRESS_VD, "= 1" }, applyIfCPUFeature = { "sve", "true" }) Could you please change this so that the `applyIfCPUFeature` is on a new line? That would make it easier to add more platforms later :) test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 228: > 226: .start(); > 227: } > 228: } Question: is there already another test that checks `compress`? ------------- PR Review: https://git.openjdk.org/jdk/pull/27188#pullrequestreview-3227854355 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351095704 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351097303 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351125031 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351129802 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2351138273 From epeter at openjdk.org Tue Sep 16 07:18:37 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:18:37 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v8] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 08:20:35 GMT, Jatin Bhateja wrote: >> This patch optimizes PopCount value transforms using KnownBits information. >> Following are the results of the micro-benchmark included with the patch >> >> >> >> System: 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s >> >> Withopt: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Extending the random ranges Changes requested by epeter (Reviewer). test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 46: > 44: static final int SIZE = 4096; > 45: > 46: static int rand_numI = G.uniformInts(Integer.MIN_VALUE, Integer.MAX_VALUE).next(); Why not just take `G.ints().next()`? test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 56: > 54: static final long rand_bndL2 = G.uniformLongs(-0xFFFFFFL, 0xFFFFFF).next(); > 55: static final long rand_popcL1 = G.uniformLongs(0, 4).next(); > 56: static final long rand_popcL2 = G.uniformLongs(0, 32).next(); Can you please give us some code comments why you are doing: - only uniform distribution. Is that needed? Generators generates special values more often for a good reason: it creates interesting edge cases, especially for bit operations like this here. - Why are you restricting the ranges? There could always be surprises outside the ranges you pick, and it would be a shame to not generate those. Unless you are absolutely sure they are not needed. Or if extending the range would mean we would generate interesting cases with a probability that is too small, that could be another reason to restrict the ranges. ------------- PR Review: https://git.openjdk.org/jdk/pull/27075#pullrequestreview-3227945975 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2351152244 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2351166568 From epeter at openjdk.org Tue Sep 16 07:22:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:22:36 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: On Tue, 16 Sep 2025 06:47:43 GMT, Hannes Greule wrote: >> @SirYwell Launching tests ? > > Thanks @eme64! Do I need another re-approval from @merykitty or are we ready to integrate? @SirYwell @merykitty Let's give him 24h. If he does not respond, you can integrate in my opinion. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3296167247 From epeter at openjdk.org Tue Sep 16 07:33:22 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:33:22 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 09:26:31 GMT, erifan wrote: >> The algorithm description here is great. Please paste all of it from "Since there are" to "but with different instructions where appropriate." into this PR, before the vector expand implementation. > > @theRealAph @e1iu could you help take another look of this PR, thanks ! @erifan I'm seeing `gtest/GTestWrapper.java` fail on `aarch64` machines. Looks like this: [ RUN ] AssemblerAArch64.validate_vm [12.324s][warning][os] Loading hsdis library failed .../test/hotspot/gtest/aarch64/test_assembler_aarch64.cpp:49: Failure Expected equality of these values: insns[i] Which is: 335545527 insns1[i] Which is: 335545526 Ours: Loading hsdis library failed, undisassembled code is shown in MachCode section [MachCode] 0x0000ffff97c20548: b604 0014 [/MachCode] Theirs: Loading hsdis library failed, undisassembled code is shown in MachCode section [MachCode] 0x0000ffff853eb0a8: b704 0014 [/MachCode] Could this be related? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3296237445 From epeter at openjdk.org Tue Sep 16 07:41:47 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:41:47 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v4] In-Reply-To: References: Message-ID: > Demo from here: > https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ > > Cleaned up and enhanced with a JTREG and IR test. > I also added some additional "generated" normal maps from height functions. > And I display the resulting image side-by-side with the normal map. > > I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. > > There is a **stand-alone** way to run the demo: > `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` > (though it may only run with JDK22+, probably due some amber features) > > Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: > image > image > image > image Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27282/files - new: https://git.openjdk.org/jdk/pull/27282/files/47aa0c7d..00416267 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=02-03 Stats: 19 lines in 2 files changed: 0 ins; 3 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/27282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27282/head:pull/27282 PR: https://git.openjdk.org/jdk/pull/27282 From epeter at openjdk.org Tue Sep 16 07:41:50 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:41:50 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: References: Message-ID: <9WNJ1t-LUT2EmkBkPRLkQOjep3EkiMHFWovt9VnJUmA=.8da3417f-5781-4823-a2f5-35392cd2f8df@github.com> On Tue, 16 Sep 2025 06:49:08 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix inlining > > test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 121: > >> 119: } >> 120: >> 121: public static File getLocalFile(String name) { > > Isn't `name` always constant (i.e. `normal_map.png`)? Then you could also extract that to a constant and use it here directly. I would like to allow the user to add their own images. I used to have multiple, but the file sizes are a bit of an issue. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351248006 From epeter at openjdk.org Tue Sep 16 07:42:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:42:42 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 06:50:50 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix inlining > > test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 88: > >> 86: System.out.println("Welcome to the Normal Mapping Demo!"); >> 87: // Create an applicateion state with 5 lights. >> 88: State state = new State(5); > > I suggest to put `5` into a named constant. This invites to play around with different number of lights. Nice idea! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351262820 From epeter at openjdk.org Tue Sep 16 07:45:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:45:51 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 06:51:59 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix inlining > > test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 149: > >> 147: } >> 148: >> 149: public static class Light { > > Maybe add a quick comment what this class does since it's a demo and one might want to better understand what's going on. Same for `State` class below. good idea! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351272790 From chagedorn at openjdk.org Tue Sep 16 07:51:42 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 16 Sep 2025 07:51:42 GMT Subject: RFR: 8365570: C2 fails assert(false) failed: Unexpected node in SuperWord truncation: CastII [v2] In-Reply-To: References: Message-ID: On Thu, 21 Aug 2025 15:21:48 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This is a quick patch for the assert failure in superword truncation with CastII. I've added a check for all constraint cast nodes, and attached a reduced version of the fuzzer test. Thanks! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Update comment for constraint casts The fix looks good to me, too! I only have one comment about the test. test/hotspot/jtreg/compiler/vectorization/TestSubwordTruncation.java line 431: > 429: } > 430: > 431: @Test(compLevel = CompLevel.C2) Any particular reason you've chosen `C2` here and not let the IR framework handle it? (by default it's `ANY` which will compile at the highest available tier). I'm also wondering if this test would fail if someone ran the test with a build without C2. ------------- PR Review: https://git.openjdk.org/jdk/pull/26827#pullrequestreview-3228138075 PR Review Comment: https://git.openjdk.org/jdk/pull/26827#discussion_r2351288090 From epeter at openjdk.org Tue Sep 16 07:57:38 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 07:57:38 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v5] In-Reply-To: References: Message-ID: <2gGUfvVlIaLGOd5iJUN3-oi9jlytrkULE3WZRUX1x78=.c0da1562-8cac-4215-9ae4-5cb248c89c0b@github.com> > Demo from here: > https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ > > Cleaned up and enhanced with a JTREG and IR test. > I also added some additional "generated" normal maps from height functions. > And I display the resulting image side-by-side with the normal map. > > I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. > > There is a **stand-alone** way to run the demo: > `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` > (though it may only run with JDK22+, probably due some amber features) > > Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: > image > image > image > image Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: more for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27282/files - new: https://git.openjdk.org/jdk/pull/27282/files/00416267..806c9379 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=03-04 Stats: 20 lines in 1 file changed: 16 ins; 3 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27282/head:pull/27282 PR: https://git.openjdk.org/jdk/pull/27282 From epeter at openjdk.org Tue Sep 16 08:00:47 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 08:00:47 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 07:05:14 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix inlining > > Great work and thanks for sharing it! A few small suggestions, otherwise, it looks good to me! @chhagedorn Thanks a lot for reviewing! I addressed all your suggestions / comments ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27282#issuecomment-3296399618 From chagedorn at openjdk.org Tue Sep 16 08:13:10 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 16 Sep 2025 08:13:10 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 06:48:23 GMT, Man Cao wrote: > Hi, > > Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). > > For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. When looking at the test, it seems that we want to verify that `shortMethod()` is compiled while `hugeSwitch()` is not. When running with `-Xcomp`, we will immediately compile `main()` and directly inline `shortMethod()` with C1 (with C2 we fail to inline with "failed initial checks" and thus will compile `shortMethod()` separately when calling it the first time). Therefore, with C1, we will not compile `shortMethod()` separately and the test fails. Excluding `-Xcomp` looks reasonable. An alternative would be to exclude `main()` from compilation. But I think for the purpose of this test, excluding `-Xcomp` seems better. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27306#pullrequestreview-3228249028 From duke at openjdk.org Tue Sep 16 08:27:40 2025 From: duke at openjdk.org (erifan) Date: Tue, 16 Sep 2025 08:27:40 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v4] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 05:55:43 GMT, erifan wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge branch 'master' into JDK-8363989 > - Align code example data for better reading > - Merge branch 'master' into JDK-8363989 > - Improve the comment of the vector expand implementation > - Merge branch 'master' into JDK-8363989 > - 8363989: AArch64: Add missing backend support of VectorAPI expand operation > > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified > for 32-bit and 64-bit types only when SVE2 is available. In the following > cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. > This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` > in these cases, this patch mainly leverages the `TBL` instruction to > implement `expand`. To compute the index input for `TBL`, the prefix sum > algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. > Take a 128-bit byte vector on SVE2 as an example: > ``` > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > ``` > Step 1: calculate the index input of the TBL instruction. > ``` > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 > > dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 > ... I'm not sure, I can pass all local tests, I'll take a look. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3296543668 From chagedorn at openjdk.org Tue Sep 16 08:30:54 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 16 Sep 2025 08:30:54 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v5] In-Reply-To: <2gGUfvVlIaLGOd5iJUN3-oi9jlytrkULE3WZRUX1x78=.c0da1562-8cac-4215-9ae4-5cb248c89c0b@github.com> References: <2gGUfvVlIaLGOd5iJUN3-oi9jlytrkULE3WZRUX1x78=.c0da1562-8cac-4215-9ae4-5cb248c89c0b@github.com> Message-ID: On Tue, 16 Sep 2025 07:57:38 GMT, Emanuel Peter wrote: >> Demo from here: >> https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ >> >> Cleaned up and enhanced with a JTREG and IR test. >> I also added some additional "generated" normal maps from height functions. >> And I display the resulting image side-by-side with the normal map. >> >> I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. >> >> There is a **stand-alone** way to run the demo: >> `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` >> (though it may only run with JDK22+, probably due some amber features) >> >> **Quick Perforance Numbers**, running on my avx512 laptop. >> default / AVX3: 105 FPS >> AVX2: 82 FPS >> AVX1: 50 FPS >> No vectorization: 19 FPS >> GraalJIT: 13 FPS (`jdk-26-ea+5` - probably issue with vectorization / inlining?) >> >> Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: >> image >> image >> image >> image > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > more for Christian Looks good (minus two typos), thanks for the updates! test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 151: > 149: /** > 150: * This class represents the lights that are located on the normal map, > 151: * move around randomyl, and shine their color of light on the scene. Suggestion: * moved around randomly, and shine their color of light on the scene. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27282#pullrequestreview-3228335413 PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351429960 From chagedorn at openjdk.org Tue Sep 16 08:30:56 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 16 Sep 2025 08:30:56 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v3] In-Reply-To: <9WNJ1t-LUT2EmkBkPRLkQOjep3EkiMHFWovt9VnJUmA=.8da3417f-5781-4823-a2f5-35392cd2f8df@github.com> References: <9WNJ1t-LUT2EmkBkPRLkQOjep3EkiMHFWovt9VnJUmA=.8da3417f-5781-4823-a2f5-35392cd2f8df@github.com> Message-ID: On Tue, 16 Sep 2025 07:36:34 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/gallery/NormalMapping.java line 121: >> >>> 119: } >>> 120: >>> 121: public static File getLocalFile(String name) { >> >> Isn't `name` always constant (i.e. `normal_map.png`)? Then you could also extract that to a constant and use it here directly. > > I would like to allow the user to add their own images. I used to have multiple, but the file sizes are a bit of an issue. I see, do you want to add a comment somewhere to suggest to play around with multiple image? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2351440198 From dfenacci at openjdk.org Tue Sep 16 08:49:24 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 16 Sep 2025 08:49:24 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed In-Reply-To: References: Message-ID: <4826eQblg2rlidW-mkVYXEDgccQNUBD0xbFpluJHlCA=.d2706893-10a9-4a62-b9c9-ae7407c70856@github.com> On Tue, 16 Sep 2025 06:48:23 GMT, Man Cao wrote: > Hi, > > Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). > > For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. Marginal thing: since the issue happens with `-Xcomp` and `-XX:TieredStopAtLevel=1` it might be good to add the latter to `@requires` to restrict it as much as possible. Also you might want to add this bug to the `@bug` tag. ------------- PR Review: https://git.openjdk.org/jdk/pull/27306#pullrequestreview-3228463710 From dlunden at openjdk.org Tue Sep 16 08:52:57 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 08:52:57 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: - Clarify comments in regmask.hpp - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates - Address review comments (renaming on the way in a separate PR) - Update src/hotspot/share/opto/regmask.hpp Co-authored-by: Emanuel Peter - Restore modified java/lang/invoke tests - Sort includes (new requirement) - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates - Add clarifying comments at definitions of register mask sizes - Fix implicit zero and nullptr checks - Add deep copy comment - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 ------------- Changes: https://git.openjdk.org/jdk/pull/20404/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=26 Stats: 2852 lines in 29 files changed: 2289 ins; 289 del; 274 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From dlunden at openjdk.org Tue Sep 16 09:09:30 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 09:09:30 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: <8L3IGg5YYgi2EjlC-v5U3FkkWvK1swESQFAMwX02I84=.d597910f-0aca-4eb2-b68c-fbe565e73291@github.com> Message-ID: On Tue, 2 Sep 2025 14:05:25 GMT, Emanuel Peter wrote: >> Sure, we can rename them. I think `RM_SIZE_IN_INTS` and `RM_SIZE_IN_WORDS` would be most suitable. I avoided such a change in this changeset to not make it bigger than it already is. Isn't it easier to do the renaming in a follow-up RFE though, instead of before this PR? I'm fine with both though, not that much extra work to do it before. > > I think it would be easier to review if you do it first. > That PR won't be super controversial, and just makes the code nicer. > And then when we come back here, we may even be able to drop some comments, or be able to catch bugs just because the reviewers understand better what's going on ;) Closing this thread as https://github.com/openjdk/jdk/pull/27215 is now integrated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351579319 From dlunden at openjdk.org Tue Sep 16 09:09:31 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 09:09:31 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> Message-ID: On Tue, 9 Sep 2025 08:35:37 GMT, Emanuel Peter wrote: >> Ah, I think I now better understand your question. `rm_up` is a low-level method for internal use in `regmask.hpp` and `regmask.cpp` only (perhaps I should prepend it with an underscore?). It basically makes it so that we can regard the backing storage (`_RM_UP` and `_RM_UP_EXT`) as one contiguous array. `Member` is exposed externally and so needs the offset logic. > > Makes sense. Maybe we can make that a bit more clear in the renaming. > Maybe we can make a clear distinction between the two mappings somehow? Do you think this is good enough now after the renaming? To me, the distinction it is already quite clear (different argument types and method visibility). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351592165 From dlunden at openjdk.org Tue Sep 16 09:09:33 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 09:09:33 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v17] In-Reply-To: References: <4JpU1sh_7wBfEZG3sJ8z-dWz-Wpk7osUjYZByvqetgc=.acf93145-620b-42e0-af57-ddf20875fd96@github.com> Message-ID: On Mon, 23 Jun 2025 14:27:48 GMT, Daniel Lund?n wrote: >> Alright. Well sure, we don't have to do a full renaming now. Though I do need to understand what is what to be able to review. Is there a good definition somewhere of what is what? > > I added comments at definition points of the various sizes. Let me know if something is still confusing. Resolving this thread as https://github.com/openjdk/jdk/pull/27215 is now integrated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351570887 From epeter at openjdk.org Tue Sep 16 09:14:50 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 09:14:50 GMT Subject: RFR: 8364757: Missing Store nodes caused by bad wiring in PhaseIdealLoop::insert_post_loop In-Reply-To: References: Message-ID: <6G3nK9S5d9u3_esVm6W6hXK3QTDEzTscMlmWPmtp4yU=.21c1289b-83a5-485e-83ad-b30646dfbb89@github.com> On Thu, 11 Sep 2025 13:05:21 GMT, Beno?t Maillard wrote: > This PR introduces a fix for wrong results caused by missing `Store` nodes in C2 IR due to incorrect wiring in `PhaseIdealLoop::insert_post_loop`. > > ### Context > > The issue was initially found by the fuzzer. After some trial and error, and with the help of @chhagedorn I was able to reduce the reproducer to something very simple. After being compiled by C2, the execution of the following method led to the last statement (`x = 0`) to be ignored: > > > static public void test() { > x = 0; > for (int i = 0; i < 20000; i++) { > x += i; > } > x = 0; > } > > > After some investigation and discussions with @robcasloz and @chhagedorn, it appeared that this issue is linked to how safepoints are inserted into long running loops, causing the loop to be transformed into a nested loop with an `OuterStripMinedLoop` node. `Store` node are moved out of the inner loop when encountering this pattern, and the associated `Phi` nodes are removed in order to avoid inhibiting loop optimizations taking place later. This was initially adressed in [JDK-8356708](https://bugs.openjdk.org/browse/JDK-8356708) by making the necessary corrections in macro expansion. As explained in the next section, this is not enough here as macro expansion happens too late. > > This PR aims at addressing the specific case of the wrong wiring of `Store` nodes in _post_ loops, but on the longer term further investigations into the missing `Phi` node issue are necessary, as they are likely to cause other issues (cf. related JBS issues). > > ### Detailed Analysis > > In `PhaseIdealLoop::create_outer_strip_mined_loop`, a simple `CountedLoop` is turned into a nested loop with an `OuterStripMinedLoop`. The body of the initial loop remains in the inner loop, but the safepoint is moved to the outer loop. Later, we attempt to move `Store` nodes after the inner loop in `PhaseIdealLoop::try_move_store_after_loop`. When the `Store` node is moved to the outer loop, we also get rid of its input `Phi` node in order not to confuse loop optimizations happening later. > > This only becomes a problem in `PhaseIdealLoop::insert_post_loop`, where we clone the body of the inner/outer loop for the iterations remaining after unrolling. There, we use `Phi` nodes to do the necessary rewiring between the original body and the cloned one. Because we do not have `Phi` nodes for the moved `Store` nodes, their memory inputs may end up being incorrect. > > This is what the IR looks like after the creation of the post lo... Thanks for working on this @benoitmaillard ! And thanks for all the explanations. It seems the missing Phi at the OuterStripMinedLoop are a decision that implies that Stores will just sort of "hang" between loop exit and SafePoint. That is now the new "invariant". Fine for now, but we may want to reconsider adding the Phi for the OuterStripMinedLoop eventually. I have read through the PR, and was a little confused about names, so bear with my comments ? On the algo level I was wondering if it is possible to have a chain of stores between the exit and SafePoint? Do you have such examples? src/hotspot/share/opto/loopTransform.cpp line 1679: > 1677: Node* next = out->fast_out(l); > 1678: if (next->is_Mem() && next->in(MemNode::Memory) == out) { > 1679: IdealLoopTree* output_loop = get_loop(get_ctrl(next)); I would keep the names for `next` and `output_loop` consistent. Maybe `next_loop`? Or just call them `use` and `use_loop`? src/hotspot/share/opto/loopTransform.cpp line 1692: > 1690: } > 1691: return out; > 1692: } Note from later me: I was quite confused here. I thought this was going to be some general function that should handle all sorts of memory flow in the loop, but that is not the case. I'll leave all my comments here just to show you what I as the reader thought when reading it ;) Below, in a code comment you say that this method does: `Find the last memory node in the loop when following memory usages` What happens here if we hit an if-diamond (or more complicated), where there can be multiple memory uses, that are then merged again by a memory phi? store | +--------+ | | store store | | +---+ +--+ | | phi | store -> the last one in the loop I wonder if this is somehow possible. There are surely some IGVN optimizations that would common the stores here, and so the graph would probably have to be even more complicated. But I'm simply wondering if it could be possible that we would have branches / phis in the memory graph. Or what guarantees us that the graph is really linear here? I'm also not sure how to parse the method name: `find_mem_out_outer_strip_mined` - find "mem out" outer-strip-mined - find mem outside of outer-strip-mined loop? src/hotspot/share/opto/loopTransform.cpp line 1788: > 1786: // right after the execution of the inner CountedLoop. > 1787: // We have to make sure that such stores in the post loop have the right memory inputs from the main loop > 1788: if (loop->tail()->in(0)->is_BaseCountedLoopEnd()) { Out of curiosity: when would this condition be false? src/hotspot/share/opto/loopTransform.cpp line 1793: > 1791: for (DUIterator j = if_false->outs(); if_false->has_out(j); j++) { > 1792: Node* store = if_false->out(j)->isa_Store(); > 1793: // We don't make changes if the memory input is in the loop body as well Why? I suppose that is because there must be a Phi in the loop then, right? Maybe state that in the comment here. src/hotspot/share/opto/loopTransform.cpp line 1794: > 1792: Node* store = if_false->out(j)->isa_Store(); > 1793: // We don't make changes if the memory input is in the loop body as well > 1794: if (store && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) { Suggestion: if (store != nullptr && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) { No implicit null or zero checks, see hotspot style guide ;) src/hotspot/share/opto/loopTransform.cpp line 1797: > 1795: Node* mem_out = find_mem_out_outer_strip_mined(store, outer_loop); > 1796: Node* store_new = old_new[store->_idx]; > 1797: store_new->set_req(MemNode::Memory, mem_out); Could it be that there are multiple stores in a chain after the loop exit and before the SafePoint? Loop Exit store1 store2 store3 SafePoint If so, they all have the same control, namely at the `if_false`. Their memory state should be ordered, where store2 depends on store1 and store3 on store2. Only store1 should then really have its memory input updated. Your code now finds the `store_new` for each of store1, store2 and store3, and sets all of their memory inputs to `mem_out`. But that means that the "new" stores all have the same memory input, and are not in a chain any more. Did I see this right? Is that ok? src/hotspot/share/opto/loopnode.hpp line 1384: > 1382: > 1383: // Find the last memory node in the loop when following memory usages > 1384: Node *find_mem_out_outer_strip_mined(Node* store, IdealLoopTree* outer_loop); The name of the method is a bit confusing. And the comment seems to suggest something different than what the code says. test/hotspot/jtreg/compiler/loopstripmining/MissingStoreAfterOuterStripMinedLoop.java line 77: > 75: a1.field = 0; > 76: a2.field = 0; > 77: } Do the field stores both float out of the loop, and end up in a chain between exit and safepoint? Might be nice to add some comments to these tests so we can see what examples you already cover and if we might need some more. ------------- PR Review: https://git.openjdk.org/jdk/pull/27225#pullrequestreview-3228308675 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351475787 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351447559 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351489419 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351520064 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351496858 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351551611 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351410690 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351608304 From epeter at openjdk.org Tue Sep 16 09:14:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 09:14:52 GMT Subject: RFR: 8364757: Missing Store nodes caused by bad wiring in PhaseIdealLoop::insert_post_loop In-Reply-To: <6G3nK9S5d9u3_esVm6W6hXK3QTDEzTscMlmWPmtp4yU=.21c1289b-83a5-485e-83ad-b30646dfbb89@github.com> References: <6G3nK9S5d9u3_esVm6W6hXK3QTDEzTscMlmWPmtp4yU=.21c1289b-83a5-485e-83ad-b30646dfbb89@github.com> Message-ID: On Tue, 16 Sep 2025 08:30:05 GMT, Emanuel Peter wrote: >> This PR introduces a fix for wrong results caused by missing `Store` nodes in C2 IR due to incorrect wiring in `PhaseIdealLoop::insert_post_loop`. >> >> ### Context >> >> The issue was initially found by the fuzzer. After some trial and error, and with the help of @chhagedorn I was able to reduce the reproducer to something very simple. After being compiled by C2, the execution of the following method led to the last statement (`x = 0`) to be ignored: >> >> >> static public void test() { >> x = 0; >> for (int i = 0; i < 20000; i++) { >> x += i; >> } >> x = 0; >> } >> >> >> After some investigation and discussions with @robcasloz and @chhagedorn, it appeared that this issue is linked to how safepoints are inserted into long running loops, causing the loop to be transformed into a nested loop with an `OuterStripMinedLoop` node. `Store` node are moved out of the inner loop when encountering this pattern, and the associated `Phi` nodes are removed in order to avoid inhibiting loop optimizations taking place later. This was initially adressed in [JDK-8356708](https://bugs.openjdk.org/browse/JDK-8356708) by making the necessary corrections in macro expansion. As explained in the next section, this is not enough here as macro expansion happens too late. >> >> This PR aims at addressing the specific case of the wrong wiring of `Store` nodes in _post_ loops, but on the longer term further investigations into the missing `Phi` node issue are necessary, as they are likely to cause other issues (cf. related JBS issues). >> >> ### Detailed Analysis >> >> In `PhaseIdealLoop::create_outer_strip_mined_loop`, a simple `CountedLoop` is turned into a nested loop with an `OuterStripMinedLoop`. The body of the initial loop remains in the inner loop, but the safepoint is moved to the outer loop. Later, we attempt to move `Store` nodes after the inner loop in `PhaseIdealLoop::try_move_store_after_loop`. When the `Store` node is moved to the outer loop, we also get rid of its input `Phi` node in order not to confuse loop optimizations happening later. >> >> This only becomes a problem in `PhaseIdealLoop::insert_post_loop`, where we clone the body of the inner/outer loop for the iterations remaining after unrolling. There, we use `Phi` nodes to do the necessary rewiring between the original body and the cloned one. Because we do not have `Phi` nodes for the moved `Store` nodes, their memory inputs may end up being incorrect. >> >> This is wh... > > src/hotspot/share/opto/loopTransform.cpp line 1692: > >> 1690: } >> 1691: return out; >> 1692: } > > Note from later me: I was quite confused here. I thought this was going to be some general function that should handle all sorts of memory flow in the loop, but that is not the case. I'll leave all my comments here just to show you what I as the reader thought when reading it ;) > > Below, in a code comment you say that this method does: > `Find the last memory node in the loop when following memory usages` > > What happens here if we hit an if-diamond (or more complicated), where there can be multiple memory uses, that are then merged again by a memory phi? > > > store > | > +--------+ > | | > store store > | | > +---+ +--+ > | | > phi > | > store -> the last one in the loop > > I wonder if this is somehow possible. There are surely some IGVN optimizations that would common the stores here, and so the graph would probably have to be even more complicated. But I'm simply wondering if it could be possible that we would have branches / phis in the memory graph. Or what guarantees us that the graph is really linear here? > > I'm also not sure how to parse the method name: > `find_mem_out_outer_strip_mined` > - find "mem out" outer-strip-mined > - find mem outside of outer-strip-mined loop? I suppose we would trigger your assert if we found a branch: `assert(unique_next == nullptr, "memory node should only have one usage in the loop body");` Now we usually only do pre-main-post for relatively small loop bodies, see `LoopUnrollLimit`. But I wonder if we ever decided to increase this limit, would we then encounter such more complicated memory graphs? > src/hotspot/share/opto/loopTransform.cpp line 1794: > >> 1792: Node* store = if_false->out(j)->isa_Store(); >> 1793: // We don't make changes if the memory input is in the loop body as well >> 1794: if (store && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) { > > Suggestion: > > if (store != nullptr && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) { > > No implicit null or zero checks, see hotspot style guide ;) The loop nesting check looks a bit convoluted. Consider refactoring a little. Could you get rid of the `!` by swapping things around? `get_loop(get_ctrl(store->in(MemNode::Memory))))->is_member(outer_loop)` Does not look that much better either... hmm. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351468893 PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351511009 From epeter at openjdk.org Tue Sep 16 09:14:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 09:14:53 GMT Subject: RFR: 8364757: Missing Store nodes caused by bad wiring in PhaseIdealLoop::insert_post_loop In-Reply-To: References: <6G3nK9S5d9u3_esVm6W6hXK3QTDEzTscMlmWPmtp4yU=.21c1289b-83a5-485e-83ad-b30646dfbb89@github.com> Message-ID: On Tue, 16 Sep 2025 08:34:48 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopTransform.cpp line 1692: >> >>> 1690: } >>> 1691: return out; >>> 1692: } >> >> Note from later me: I was quite confused here. I thought this was going to be some general function that should handle all sorts of memory flow in the loop, but that is not the case. I'll leave all my comments here just to show you what I as the reader thought when reading it ;) >> >> Below, in a code comment you say that this method does: >> `Find the last memory node in the loop when following memory usages` >> >> What happens here if we hit an if-diamond (or more complicated), where there can be multiple memory uses, that are then merged again by a memory phi? >> >> >> store >> | >> +--------+ >> | | >> store store >> | | >> +---+ +--+ >> | | >> phi >> | >> store -> the last one in the loop >> >> I wonder if this is somehow possible. There are surely some IGVN optimizations that would common the stores here, and so the graph would probably have to be even more complicated. But I'm simply wondering if it could be possible that we would have branches / phis in the memory graph. Or what guarantees us that the graph is really linear here? >> >> I'm also not sure how to parse the method name: >> `find_mem_out_outer_strip_mined` >> - find "mem out" outer-strip-mined >> - find mem outside of outer-strip-mined loop? > > I suppose we would trigger your assert if we found a branch: > `assert(unique_next == nullptr, "memory node should only have one usage in the loop body");` > > Now we usually only do pre-main-post for relatively small loop bodies, see `LoopUnrollLimit`. But I wonder if we ever decided to increase this limit, would we then encounter such more complicated memory graphs? Ok, I think I have been misled by the names / comments. You are really looking for the last store in the `outer_loop`. And we do have the guarantee of a linear memory graph because it is the one between `if_false` and SafePoint. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351568383 From epeter at openjdk.org Tue Sep 16 09:14:53 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 09:14:53 GMT Subject: RFR: 8364757: Missing Store nodes caused by bad wiring in PhaseIdealLoop::insert_post_loop In-Reply-To: References: <6G3nK9S5d9u3_esVm6W6hXK3QTDEzTscMlmWPmtp4yU=.21c1289b-83a5-485e-83ad-b30646dfbb89@github.com> Message-ID: On Tue, 16 Sep 2025 08:58:18 GMT, Emanuel Peter wrote: >> I suppose we would trigger your assert if we found a branch: >> `assert(unique_next == nullptr, "memory node should only have one usage in the loop body");` >> >> Now we usually only do pre-main-post for relatively small loop bodies, see `LoopUnrollLimit`. But I wonder if we ever decided to increase this limit, would we then encounter such more complicated memory graphs? > > Ok, I think I have been misled by the names / comments. > You are really looking for the last store in the `outer_loop`. And we do have the guarantee of a linear memory graph because it is the one between `if_false` and SafePoint. I think a better method name would help a lot ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27225#discussion_r2351569804 From duke at openjdk.org Tue Sep 16 09:43:54 2025 From: duke at openjdk.org (lusou-zhangquan) Date: Tue, 16 Sep 2025 09:43:54 GMT Subject: RFR: 8367706: Remove redundant register used by cmove in C1 LIR generation Message-ID: This PR removes redundant temp register used by cmove in C1 LIRGenerator::do_LookupSwitch and LIRGenerator::do_TableSwitch. The issue [8367706](https://bugs.openjdk.org/browse/JDK-8367706) is reported by me and it's my pleasure to fix it. ------------- Commit messages: - 8367706: Remove redundant register used by cmove in C1 LIR generation Changes: https://git.openjdk.org/jdk/pull/27307/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27307&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367706 Stats: 8 lines in 1 file changed: 2 ins; 6 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27307.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27307/head:pull/27307 PR: https://git.openjdk.org/jdk/pull/27307 From qamai at openjdk.org Tue Sep 16 09:47:14 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 16 Sep 2025 09:47:14 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v9] In-Reply-To: <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> Message-ID: On Sun, 14 Sep 2025 14:44:02 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove unused parameter Marked as reviewed by qamai (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25254#pullrequestreview-3228765828 From rcastanedalo at openjdk.org Tue Sep 16 09:55:04 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 16 Sep 2025 09:55:04 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT [v4] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 02:35:01 GMT, Cesar Soares Lucas wrote: >> Please, review this patch to fix issue that may occur when reducing allocation merge. >> >> As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. >> >> The change in `revisit_reducible_phi_status` is just a clean-up. >> The real fix is in `find_scalar_replaceable_allocs`. >> >> Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. > > Cesar Soares Lucas has updated the pull request incrementally with two additional commits since the last revision: > > - Merge remote-tracking branch 'refs/remotes/origin/ram-non-reducible' into ram-non-reducible > - Merge consecutive ifs Looks good, thanks! Please consider addressing [JDK-8367367](https://bugs.openjdk.org/browse/JDK-8367367) as follow-up work, while the context is still available in the higher levels of our memory hierarchy ;) ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27063#pullrequestreview-3228800991 From duke at openjdk.org Tue Sep 16 10:14:53 2025 From: duke at openjdk.org (erifan) Date: Tue, 16 Sep 2025 10:14:53 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 07:30:20 GMT, Emanuel Peter wrote: >> @theRealAph @e1iu could you help take another look of this PR, thanks ! > > @erifan I'm seeing `gtest/GTestWrapper.java` fail on `aarch64` machines. > > Looks like this: > > [ RUN ] AssemblerAArch64.validate_vm > [12.324s][warning][os] Loading hsdis library failed > .../test/hotspot/gtest/aarch64/test_assembler_aarch64.cpp:49: Failure > Expected equality of these values: > insns[i] > Which is: 335545527 > insns1[i] > Which is: 335545526 > Ours: > > Loading hsdis library failed, undisassembled code is shown in MachCode section > [MachCode] > 0x0000ffff97c20548: b604 0014 > [/MachCode] > Theirs: > > Loading hsdis library failed, undisassembled code is shown in MachCode section > [MachCode] > 0x0000ffff853eb0a8: b704 0014 > [/MachCode] > > Could this be related? Hi @eme64 I can't reproduce the test failure on my local and Jenkins test environments. I see this from the above error log: Loading hsdis library failed, undisassembled code is shown in MachCode section Not sure if this is related. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26740#issuecomment-3297410878 From fandreuzzi at openjdk.org Tue Sep 16 10:23:28 2025 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Tue, 16 Sep 2025 10:23:28 GMT Subject: RFR: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp Message-ID: This is the content of assembler.inline.hpp: https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 Most of the `assembler_.inline.hpp` include it: https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 They should probably include `assembler.hpp` instead. Testing: tier1 in GHA ------------- Commit messages: - cc Changes: https://git.openjdk.org/jdk/pull/27311/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27311&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367740 Stats: 5 lines in 5 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/27311.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27311/head:pull/27311 PR: https://git.openjdk.org/jdk/pull/27311 From rcastanedalo at openjdk.org Tue Sep 16 10:23:23 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 16 Sep 2025 10:23:23 GMT Subject: RFR: 8367728: IGV: dump node address type Message-ID: This changeset dumps the address type of each node (`Node::adr_type()`), when not null, into the IGV graphs. This should improve the visibility and diagnosability of C2 type inconsistencies, see e.g. [JDK-8367667](https://bugs.openjdk.org/browse/JDK-8367667). #### Testing - tier1 (windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64; release and debug mode). - Tested IGV manually on a few selected graphs. Tested automatically that dumping thousands of graphs does not trigger any assertion failure (by running `java -Xcomp -XX:PrintIdealGraphLevel=1`). ------------- Commit messages: - Dump address type Changes: https://git.openjdk.org/jdk/pull/27310/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27310&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367728 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27310.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27310/head:pull/27310 PR: https://git.openjdk.org/jdk/pull/27310 From mchevalier at openjdk.org Tue Sep 16 10:42:21 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 16 Sep 2025 10:42:21 GMT Subject: RFR: 8367728: IGV: dump node address type In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:11:50 GMT, Roberto Casta?eda Lozano wrote: > This changeset dumps the address type of each node (`Node::adr_type()`), when not null, into the IGV graphs. This should improve the visibility and diagnosability of C2 type inconsistencies, see e.g. [JDK-8367667](https://bugs.openjdk.org/browse/JDK-8367667). > > #### Testing > - tier1 (windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64; release and debug mode). > - Tested IGV manually on a few selected graphs. Tested automatically that dumping thousands of graphs does not trigger any assertion failure (by running `java -Xcomp -XX:PrintIdealGraphLevel=1`). I'm happy. src/hotspot/share/opto/idealGraphPrinter.cpp line 452: > 450: } > 451: if (n->adr_type() != nullptr) { > 452: stringStream adr_type_stream; Other stringStream around are using a preallocated buffer. Would it be a good idea here too? ------------- Marked as reviewed by mchevalier (Committer). PR Review: https://git.openjdk.org/jdk/pull/27310#pullrequestreview-3229121349 PR Review Comment: https://git.openjdk.org/jdk/pull/27310#discussion_r2351915650 From epeter at openjdk.org Tue Sep 16 11:07:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 11:07:57 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> Message-ID: <8EDUG032a2-wepy1MeWd6n3Gfxr3_sajeRf07BbI0Wk=.ee7db86d-fd41-4beb-9a68-79812187466e@github.com> On Tue, 16 Sep 2025 10:44:26 GMT, Emanuel Peter wrote: >> Do you think this is good enough now after the renaming? To me, the distinction it is already quite clear (different argument types and method visibility). > > @dlunde It could be helpful to see a small example to see what maps to what if there are multiple views. Why not move the field down to its explanation? Or move the explanation to the field? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351952889 From epeter at openjdk.org Tue Sep 16 11:07:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 11:07:57 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:52:53 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/regmask.hpp line 267: > >> 265: >> 266: // Where to extend the register mask >> 267: Arena* _arena; > > Usually, we try to keep all fields at the top. Just to keep the overview. > src/hotspot/share/opto/regmask.hpp line 464: > >> 462: copy(rm); >> 463: return *this; >> 464: } > > You could also delete this one, and use the `copy` explicitly at the use site. That would make the allocations a bit more explicit. What do you think? > Whenever possible, it is nice to be able to declare a type `NONCOPYABLE`. Especially if it does allocations where copy is non-trivial. You already removed some assignments, like this one, which is good: https://github.com/openjdk/jdk/pull/20404/files#diff-344e52fd6be79f1d97a33d7ebbf131148df90bb52e3b33952340e8d37a3849d8L1501-R1512 Generally, there are a lot of constructors here. All of them are public, none explicit. Maybe that is just how it has to be, but maybe you can simplify a little. > src/hotspot/share/opto/regmask.hpp line 659: > >> 657: >> 658: // Fill a register mask with 1's starting from the given register. >> 659: void Set_All_From(OptoReg::Name reg) { > > Oh boy, we have a mixture of `lower_case` and `Strange_Case` method names. > We missed those in the renaming RFE :/ Or is there a particular logic behind it? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351970696 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351760214 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351870907 From epeter at openjdk.org Tue Sep 16 11:07:56 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 11:07:56 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 08:52:57 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: > > - Clarify comments in regmask.hpp > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Address review comments (renaming on the way in a separate PR) > - Update src/hotspot/share/opto/regmask.hpp > > Co-authored-by: Emanuel Peter > - Restore modified java/lang/invoke tests > - Sort includes (new requirement) > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Add clarifying comments at definitions of register mask sizes > - Fix implicit zero and nullptr checks > - Add deep copy comment > - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 A first batch of comments before lunch (made it half way through `regmask.hpp)` src/hotspot/share/adlc/formsopt.cpp line 174: > 172: // The array of Register Mask bits should be large enough to cover all the > 173: // machine registers and usually all parameters that need to be passed on the > 174: // stack (stack registers) up to some interesting limit. On Intel, the limit What do you mean by `usually`? It could be misunderstood that sometimes it may not be large enough to even cover those "upt to some interesting limit". Consider rephrasing for clarity ;) src/hotspot/share/opto/chaitin.cpp line 645: > 643: if (C->failing()) { > 644: return; > 645: } What can fail here? src/hotspot/share/opto/chaitin.hpp line 151: > 149: _mask_size = _mask.rm_size_in_bits(); > 150: return _mask.rollover(); > 151: } Subjective: I would have kept the one-liner approach consistently here, since that is what surrounding code does. src/hotspot/share/opto/ifg.cpp line 732: > 730: > 731: // Remove bound register(s) from 'l's choices > 732: old = interfering_lrg.mask(); Just checking: This is an implicit `copy` case, right? src/hotspot/share/opto/locknode.cpp line 43: > 41: > 42: BoxLockNode::BoxLockNode(int slot) > 43: : Node(Compile::current()->root()), _slot(slot), Suggestion: : Node(Compile::current()->root()), _slot(slot), I would put all on separate lines. Optional. src/hotspot/share/opto/locknode.cpp line 55: > 53: } > 54: init_class_id(Class_BoxLock); > 55: init_flags(Flag_rematerialize); Any reason why you moved these after the bailout? Maybe that's fine, but I don't know what the implications might be. Do you? src/hotspot/share/opto/machnode.hpp line 758: > 756: public: > 757: MachProjNode(Node* multi, uint con, const RegMask& out, uint ideal_reg) > 758: : ProjNode(multi, con), _rout(out, Compile::current()->comp_arena()), Suggestion: : ProjNode(multi, con), _rout(out, Compile::current()->comp_arena()), Optional. Either list horizontally or vertically is my opinion ;) src/hotspot/share/opto/postaloc.cpp line 681: > 679: for (int l = 1; l < n_regs; l++) { > 680: OptoReg::Name ureg_lo = OptoReg::add(ureg,-l); > 681: bool is_reg = OptoReg::is_reg(ureg_lo); Only needed in assert. Do you really need to give it a separate name? Subjective, your choice. Does it have a side-effect? src/hotspot/share/opto/postaloc.cpp line 685: > 683: assert(is_adjacent || is_reg, > 684: "only registers can be non-adjacent"); > 685: if (!value[ureg_lo] && is_adjacent) { // Nearly always adjacent `value[ureg_lo]` returns a `Node*`, right? Then that would make this an implicit null check, not allowed by style guide ;) src/hotspot/share/opto/regmask.hpp line 122: > 120: // the machine registers and usually all parameters that need to be passed > 121: // on the stack (stack registers) up to some interesting limit. On Intel, > 122: // the limit is something like 90+ parameters. You may say that that in the "unusual" case, we have to use `_rm_word_ext`. Just so the reader knows what the ominous "usually" refers to ;) src/hotspot/share/opto/regmask.hpp line 217: > 215: // are included in the register mask. Depending on the value of > 216: // _infinite_stack (denoted with as), {s10, s11, ...} are all included (as=1) > 217: // or excluded (as=0). Note that all registers/stack locations under _lwm Do you want to rename `as` now that it does not refer to `all_stack` but `infinite_stack`? src/hotspot/share/opto/regmask.hpp line 267: > 265: > 266: // Where to extend the register mask > 267: Arena* _arena; Usually, we try to keep all fields at the top. src/hotspot/share/opto/regmask.hpp line 270: > 268: > 269: // Grow the register mask to ensure it can fit at least min_size words. > 270: void grow(unsigned int min_size, bool init = true) { Suggestion: void grow(unsigned int min_size, bool initialize_... = true) { I would spell out what it means. `init` could mean lots of things. src/hotspot/share/opto/regmask.hpp line 285: > 283: assert(_original_ext_address == &_rm_word_ext, "clone sanity check"); > 284: _rm_word_ext = REALLOC_ARENA_ARRAY(_arena, uintptr_t, _rm_word_ext, > 285: old_ext_size, new_ext_size); Suggestion: old_ext_size, new_ext_size); src/hotspot/share/opto/regmask.hpp line 450: > 448: Insert(reg); > 449: } > 450: RegMask(OptoReg::Name reg) : RegMask(reg, nullptr) {} You may want to add `explicit`, so nobody accidentally converts them ;) src/hotspot/share/opto/regmask.hpp line 458: > 456: } > 457: > 458: RegMask(const RegMask& rm) : RegMask(rm, nullptr) {} Do you want to add `explicit` here? This is a shallow copy, right? Maybe add a comment for that. src/hotspot/share/opto/regmask.hpp line 464: > 462: copy(rm); > 463: return *this; > 464: } You could also delete this one, and use the `copy` explicitly at the use site. That would make the allocations a bit more explicit. What do you think? Whenever possible, it is nice to be able to declare a type `NONCOPYABLE`. Especially if it does allocations where copy is non-trivial. src/hotspot/share/opto/regmask.hpp line 659: > 657: > 658: // Fill a register mask with 1's starting from the given register. > 659: void Set_All_From(OptoReg::Name reg) { Oh boy, we have a mixture of `lower_case` and `Strange_Case` method names. We missed those in the renaming RFE :/ ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-3228765757 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351720167 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351729163 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351797677 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351813653 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351821908 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351831266 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351835412 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351891004 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351885171 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351923440 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351946036 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351970184 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351976938 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351972811 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351741579 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352007574 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351754230 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351864011 From epeter at openjdk.org Tue Sep 16 11:07:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 11:07:57 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> Message-ID: On Tue, 16 Sep 2025 09:05:06 GMT, Daniel Lund?n wrote: >> Makes sense. Maybe we can make that a bit more clear in the renaming. >> Maybe we can make a clear distinction between the two mappings somehow? > > Do you think this is good enough now after the renaming? To me, the distinction it is already quite clear (different argument types and method visibility). @dlunde It could be helpful to see a small example to see what maps to what if there are multiple views. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2351938661 From dlunden at openjdk.org Tue Sep 16 11:31:04 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 11:31:04 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 09:45:07 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/adlc/formsopt.cpp line 174: > >> 172: // The array of Register Mask bits should be large enough to cover all the >> 173: // machine registers and usually all parameters that need to be passed on the >> 174: // stack (stack registers) up to some interesting limit. On Intel, the limit > > What do you mean by `usually`? It could be misunderstood that sometimes it may not be large enough to even cover those "upt to some interesting limit". Consider rephrasing for clarity ;) Sure, I'll rephrase it ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352096087 From dlunden at openjdk.org Tue Sep 16 11:34:09 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 11:34:09 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 09:48:15 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/chaitin.cpp line 645: > >> 643: if (C->failing()) { >> 644: return; >> 645: } > > What can fail here? This bailout, added in this changeset: https://github.com/openjdk/jdk/blob/c1f41288c7f75b5abd6055fbc032cf4447532548/src/hotspot/share/opto/chaitin.cpp#L1664-L1672 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352106256 From rcastanedalo at openjdk.org Tue Sep 16 11:52:03 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 16 Sep 2025 11:52:03 GMT Subject: RFR: 8367728: IGV: dump node address type In-Reply-To: References: Message-ID: <6CdmgzpYWNBs__UtcZADKa23joZZS2ELePE8tkGCwAQ=.7cc5b8fe-c080-4c11-a3a8-1d8db6482633@github.com> On Tue, 16 Sep 2025 10:38:13 GMT, Marc Chevalier wrote: >> This changeset dumps the address type of each node (`Node::adr_type()`), when not null, into the IGV graphs. This should improve the visibility and diagnosability of C2 type inconsistencies, see e.g. [JDK-8367667](https://bugs.openjdk.org/browse/JDK-8367667). >> >> #### Testing >> - tier1 (windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64; release and debug mode). >> - Tested IGV manually on a few selected graphs. Tested automatically that dumping thousands of graphs does not trigger any assertion failure (by running `java -Xcomp -XX:PrintIdealGraphLevel=1`). > > src/hotspot/share/opto/idealGraphPrinter.cpp line 452: > >> 450: } >> 451: if (n->adr_type() != nullptr) { >> 452: stringStream adr_type_stream; > > Other stringStream around are using a preallocated buffer. Would it be a good idea here too? Thanks for bringing this up. I did not use the pre-allocated buffer for simplicity, which I think I is more important than efficiency in this code - as long as the efficiency is not bad enough to turn into a usability problem. We should probably investigate (separately) simplifying all other uses of `stringStream` in the IGV dumping logic. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27310#discussion_r2352179908 From dlunden at openjdk.org Tue Sep 16 11:55:32 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 11:55:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 09:52:21 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/regmask.hpp line 450: > >> 448: Insert(reg); >> 449: } >> 450: RegMask(OptoReg::Name reg) : RegMask(reg, nullptr) {} > > You may want to add `explicit`, so nobody accidentally converts them ;) Thanks, good point ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352183022 From dlunden at openjdk.org Tue Sep 16 11:55:36 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 11:55:36 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: <-NoNYKu9VID7gHzEAObPa7adchdpdL5CaNLclPBERVI=.3c469bcd-2497-4285-89f1-d62e5fdaf3d3@github.com> On Tue, 16 Sep 2025 09:58:11 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/regmask.hpp line 464: >> >>> 462: copy(rm); >>> 463: return *this; >>> 464: } >> >> You could also delete this one, and use the `copy` explicitly at the use site. That would make the allocations a bit more explicit. What do you think? >> Whenever possible, it is nice to be able to declare a type `NONCOPYABLE`. Especially if it does allocations where copy is non-trivial. > > You already removed some assignments, like this one, which is good: > https://github.com/openjdk/jdk/pull/20404/files#diff-344e52fd6be79f1d97a33d7ebbf131148df90bb52e3b33952340e8d37a3849d8L1501-R1512 > > Generally, there are a lot of constructors here. All of them are public, none explicit. Maybe that is just how it has to be, but maybe you can simplify a little. I agree with you in principle, but the copy constructor and assignment operator are heavily used by the ADLC-generated code. I prefer not touching it, at least in this PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352191833 From mchevalier at openjdk.org Tue Sep 16 12:00:38 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 16 Sep 2025 12:00:38 GMT Subject: RFR: 8367728: IGV: dump node address type In-Reply-To: <6CdmgzpYWNBs__UtcZADKa23joZZS2ELePE8tkGCwAQ=.7cc5b8fe-c080-4c11-a3a8-1d8db6482633@github.com> References: <6CdmgzpYWNBs__UtcZADKa23joZZS2ELePE8tkGCwAQ=.7cc5b8fe-c080-4c11-a3a8-1d8db6482633@github.com> Message-ID: On Tue, 16 Sep 2025 11:49:11 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/share/opto/idealGraphPrinter.cpp line 452: >> >>> 450: } >>> 451: if (n->adr_type() != nullptr) { >>> 452: stringStream adr_type_stream; >> >> Other stringStream around are using a preallocated buffer. Would it be a good idea here too? > > Thanks for bringing this up. I did not use the pre-allocated buffer for simplicity, which I think I is more important than efficiency in this code - as long as the efficiency is not bad enough to turn into a usability problem. We should probably investigate (separately) simplifying all other uses of `stringStream` in the IGV dumping logic. I think that makes sense. Thanks. No strong opinion whether we should change what is already there, as long as we don't add more. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27310#discussion_r2352218330 From dlunden at openjdk.org Tue Sep 16 12:07:29 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:07:29 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:07:52 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/chaitin.hpp line 151: > >> 149: _mask_size = _mask.rm_size_in_bits(); >> 150: return _mask.rollover(); >> 151: } > > Subjective: I would have kept the one-liner approach consistently here, since that is what surrounding code does. Fair enough, I'll update! > src/hotspot/share/opto/ifg.cpp line 732: > >> 730: >> 731: // Remove bound register(s) from 'l's choices >> 732: old = interfering_lrg.mask(); > > Just checking: This is an implicit `copy` case, right? Right, it is equivalent to make `copy` public and call it directly. However, see my other comment for why I think we (for now) should keep the `operator=` around. > src/hotspot/share/opto/locknode.cpp line 43: > >> 41: >> 42: BoxLockNode::BoxLockNode(int slot) >> 43: : Node(Compile::current()->root()), _slot(slot), > > Suggestion: > > : Node(Compile::current()->root()), > _slot(slot), > > I would put all on separate lines. Optional. Sure, I'll update > src/hotspot/share/opto/locknode.cpp line 55: > >> 53: } >> 54: init_class_id(Class_BoxLock); >> 55: init_flags(Flag_rematerialize); > > Any reason why you moved these after the bailout? Maybe that's fine, but I don't know what the implications might be. Do you? No reason that I can remember. I'll move them before the bailout! > src/hotspot/share/opto/machnode.hpp line 758: > >> 756: public: >> 757: MachProjNode(Node* multi, uint con, const RegMask& out, uint ideal_reg) >> 758: : ProjNode(multi, con), _rout(out, Compile::current()->comp_arena()), > > Suggestion: > > : ProjNode(multi, con), > _rout(out, Compile::current()->comp_arena()), > > Optional. Either list horizontally or vertically is my opinion ;) Sure, updated ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352218757 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352225778 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352228859 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352236989 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352240162 From dlunden at openjdk.org Tue Sep 16 12:07:32 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:07:32 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:26:07 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/regmask.hpp line 659: >> >>> 657: >>> 658: // Fill a register mask with 1's starting from the given register. >>> 659: void Set_All_From(OptoReg::Name reg) { >> >> Oh boy, we have a mixture of `lower_case` and `Strange_Case` method names. >> We missed those in the renaming RFE :/ > > Or is there a particular logic behind it? No other logic than keeping the same style as the surrounding old code. I can update it to use up-to-date style, but then we increase the scope of this PR. Is a follow-up PR OK? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352247193 From dlunden at openjdk.org Tue Sep 16 12:11:51 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:11:51 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:29:58 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/postaloc.cpp line 685: > >> 683: assert(is_adjacent || is_reg, >> 684: "only registers can be non-adjacent"); >> 685: if (!value[ureg_lo] && is_adjacent) { // Nearly always adjacent > > `value[ureg_lo]` returns a `Node*`, right? Then that would make this an implicit null check, not allowed by style guide ;) Here I'll argue not touching this in this PR (I did not introduce this), as this is the style of the surrounding code. Should be addressed in a follow-up PR though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352261538 From dlunden at openjdk.org Tue Sep 16 12:17:27 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:17:27 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:31:33 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/postaloc.cpp line 681: > >> 679: for (int l = 1; l < n_regs; l++) { >> 680: OptoReg::Name ureg_lo = OptoReg::add(ureg,-l); >> 681: bool is_reg = OptoReg::is_reg(ureg_lo); > > Only needed in assert. Do you really need to give it a separate name? Subjective, your choice. > > Does it have a side-effect? Giving it a name is only for clarity, mirroring the style of `is_adjacent` in the `assert`. I'll inline it, no problem. No side-effect. > src/hotspot/share/opto/regmask.hpp line 122: > >> 120: // the machine registers and usually all parameters that need to be passed >> 121: // on the stack (stack registers) up to some interesting limit. On Intel, >> 122: // the limit is something like 90+ parameters. > > You may say that that in the "unusual" case, we have to use `_rm_word_ext`. Just so the reader knows what the ominous "usually" refers to ;) Yes, thanks. I'll update this comment to reflect the new register mask features. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352277426 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352285497 From dlunden at openjdk.org Tue Sep 16 12:25:02 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:25:02 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v24] In-Reply-To: <8EDUG032a2-wepy1MeWd6n3Gfxr3_sajeRf07BbI0Wk=.ee7db86d-fd41-4beb-9a68-79812187466e@github.com> References: <_a6JVBA326t8l1U3ZI8C-J3Ju5jm-RklBFGtnR7fbyY=.70638135-7577-44dc-a212-fe5e39b1f5fa@github.com> <8EDUG032a2-wepy1MeWd6n3Gfxr3_sajeRf 07BbI0Wk=.ee7db86d-fd41-4beb-9a68-79812187466e@github.com> Message-ID: On Tue, 16 Sep 2025 10:48:14 GMT, Emanuel Peter wrote: >> @dlunde It could be helpful to see a small example to see what maps to what if there are multiple views. > > Why not move the field down to its explanation? Or move the explanation to the field? I think my comment about multiple views was misleading, rephrased it a bit now. I also moved the field down to just before the example illustrating it, good suggestion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352318746 From dlunden at openjdk.org Tue Sep 16 12:25:04 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:25:04 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:53:04 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/regmask.hpp line 267: >> >>> 265: >>> 266: // Where to extend the register mask >>> 267: Arena* _arena; >> >> Usually, we try to keep all fields at the top. > > Just to keep the overview. Sure, moved next to `_rm_word_ext` now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352323443 From dlunden at openjdk.org Tue Sep 16 12:30:06 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:30:06 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 11:03:28 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/regmask.hpp line 458: > >> 456: } >> 457: >> 458: RegMask(const RegMask& rm) : RegMask(rm, nullptr) {} > > Do you want to add `explicit` here? > This is a shallow copy, right? Maybe add a comment for that. The ADLC-generated code relies on using the constructor implicitly, so I prefer not touching it in this changeset at least. All the copies are deep, clarified now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352336557 From hgreule at openjdk.org Tue Sep 16 12:38:33 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 16 Sep 2025 12:38:33 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v9] In-Reply-To: <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> Message-ID: On Sun, 14 Sep 2025 14:44:02 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove unused parameter Thanks everyone for the patience and the reviews :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/25254#issuecomment-3298520392 From hgreule at openjdk.org Tue Sep 16 12:38:35 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 16 Sep 2025 12:38:35 GMT Subject: Integrated: 8356813: Improve Mod(I|L)Node::Value In-Reply-To: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> Message-ID: <8KSnYgDRvdBlvE0hx2hmWRZaKZ9_XfLHMqqKYFDFRmU=.fd29e8aa-2511-4f96-8976-2c3bcf6c2450@github.com> On Thu, 15 May 2025 15:13:18 GMT, Hannes Greule wrote: > This change improves the precision of the `Mod(I|L)Node::Value()` functions. > > I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. > The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. > > ### Monotonicity > > Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). > > ### Testing > > I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). > > Please review and let me know what you think. > > ### Other > > The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. > > During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: > - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? > - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. This pull request has now been integrated. Changeset: c7f014ed Author: Hannes Greule URL: https://git.openjdk.org/jdk/commit/c7f014ed494409cdf9fc925fe98de08346606408 Stats: 695 lines in 3 files changed: 630 ins; 50 del; 15 mod 8356813: Improve Mod(I|L)Node::Value Reviewed-by: epeter, qamai ------------- PR: https://git.openjdk.org/jdk/pull/25254 From dlunden at openjdk.org Tue Sep 16 12:39:13 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:39:13 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v28] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update after comments from Emanuel ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/c1f41288..fe69f5a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=27 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=26-27 Stats: 84 lines in 7 files changed: 26 ins; 29 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From ayang at openjdk.org Tue Sep 16 12:48:21 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Tue, 16 Sep 2025 12:48:21 GMT Subject: RFR: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp In-Reply-To: References: Message-ID: <9nxdJBFv1YD2q97EhtgKjQaSWXLGiNOGnEuE-B4_q1w=.b0ca98e7-34b0-4e45-bd23-4e7d70e62b4d@github.com> On Tue, 16 Sep 2025 10:15:06 GMT, Francesco Andreuzzi wrote: > This is the content of assembler.inline.hpp: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 > > Most of the `assembler_.inline.hpp` include it: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 > > They should probably include `assembler.hpp` instead. > > Testing: tier1 in GHA Some background on this: https://github.com/openjdk/jdk/pull/27189#discussion_r2344516463, just fyi for others ------------- PR Comment: https://git.openjdk.org/jdk/pull/27311#issuecomment-3298600193 From epeter at openjdk.org Tue Sep 16 12:49:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 12:49:00 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 08:52:57 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: > > - Clarify comments in regmask.hpp > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Address review comments (renaming on the way in a separate PR) > - Update src/hotspot/share/opto/regmask.hpp > > Co-authored-by: Emanuel Peter > - Restore modified java/lang/invoke tests > - Sort includes (new requirement) > - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates > - Add clarifying comments at definitions of register mask sizes > - Fix implicit zero and nullptr checks > - Add deep copy comment > - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 Alright, sprinted through the end. I really appreciate that you added extensive `gtest`s, thanks for that ? And thanks for using the Template Framework, I'm curious to hear if you have any feedback on it :) src/hotspot/share/opto/chaitin.cpp line 1656: > 1654: > 1655: // Check if a color is available and if so pick the color > 1656: OptoReg::Name reg = choose_color(*lrg); Accidental find: why is this assert commented out? src/hotspot/share/opto/chaitin.cpp line 1663: > 1661: if (!OptoReg::is_valid(reg) && is_infinite_stack) { > 1662: // Bump register mask up to next stack chunk > 1663: bool success = lrg->rollover(); Can you add a comment that explains what this does / means? Do we start spilling to the stack slots instead of using registers? src/hotspot/share/opto/regmask.hpp line 241: > 239: // \_______________________________________________________________________________/ > 240: // | > 241: // _rm_size_in_words=_offset=5 Can you please add some concise comment why we need `rollover`? Does that happen during register allocation, and if we have rollover then we start spilling instead of keeping values in registers? src/hotspot/share/opto/regmask.hpp line 837: > 835: // ---------------------------------------------------------------------- > 836: // The methods below are only for testing purposes (see test_regmask.cpp) > 837: // ---------------------------------------------------------------------- I wonder if it could be solved with `friend` instead, so it does not have to be public and get accidentally used somehow. Or maybe some `gtest_` prefix? Not sure. test/hotspot/jtreg/compiler/arguments/TestMethodArguments.java line 51: > 49: static final int INPUT_SIZE = 100; > 50: > 51: public static Template.ZeroArgs generateTest(PrimitiveType t, int numberOfArguments) { You should write out `type` instead of `t`, would make it consistent with your `let` below. test/hotspot/jtreg/compiler/arguments/TestMethodArguments.java line 120: > 118: Template.let("classpath", comp.getEscapedClassPathOfCompiledClasses()), > 119: """ > 120: import java.util.Arrays; Personally, I would not indent this deeply. I know that the generated code will not have proper indentation, but that's no so bad. Readability of the Templates is more important I think. Subjective though. test/hotspot/jtreg/compiler/arguments/TestMethodArguments.java line 146: > 144: } > 145: return array; > 146: } Seems like we need to add some convenience "fill" methods to the template library. We'll get there eventually, just keep this for now. ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-3229588008 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352217532 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352235070 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352231339 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352309374 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352361517 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352382893 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352378439 From epeter at openjdk.org Tue Sep 16 12:49:02 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 12:49:02 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 11:57:55 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/chaitin.cpp line 1656: > >> 1654: >> 1655: // Check if a color is available and if so pick the color >> 1656: OptoReg::Name reg = choose_color(*lrg); > > Accidental find: why is this assert commented out? `//assert(is_infinite_stack == lrg->mask().is_infinite_stack(), "nbrs must not change InfiniteStackedness");` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352218812 From dlunden at openjdk.org Tue Sep 16 12:57:52 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 16 Sep 2025 12:57:52 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 11:58:15 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/chaitin.cpp line 1656: >> >>> 1654: >>> 1655: // Check if a color is available and if so pick the color >>> 1656: OptoReg::Name reg = choose_color(*lrg); >> >> Accidental find: why is this assert commented out? > > `//assert(is_infinite_stack == lrg->mask().is_infinite_stack(), "nbrs must not change InfiniteStackedness");` No idea, sorry (it has been that way since initial load). I just touched it to change from all_stack to infinite_stack. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352417426 From dfenacci at openjdk.org Tue Sep 16 13:18:29 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 16 Sep 2025 13:18:29 GMT Subject: RFR: 8367728: IGV: dump node address type In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:11:50 GMT, Roberto Casta?eda Lozano wrote: > This changeset dumps the address type of each node (`Node::adr_type()`), when not null, into the IGV graphs. This should improve the visibility and diagnosability of C2 type inconsistencies, see e.g. [JDK-8367667](https://bugs.openjdk.org/browse/JDK-8367667). > > #### Testing > - tier1 (windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64; release and debug mode). > - Tested IGV manually on a few selected graphs. Tested automatically that dumping thousands of graphs does not trigger any assertion failure (by running `java -Xcomp -XX:PrintIdealGraphLevel=1`). Strange that it wasn't already printed ? Thanks for adding this @robcasloz! LGTM ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/27310#pullrequestreview-3229947806 From mhaessig at openjdk.org Tue Sep 16 13:37:14 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 13:37:14 GMT Subject: RFR: 8366775: TestCompileTaskTimeout should use timeoutFactor In-Reply-To: References: Message-ID: On Tue, 9 Sep 2025 14:02:01 GMT, Matthias Baesken wrote: >> `TestCompileTaskTimeout.java` employs a timeout to test that methods compiled faster than a specified `CompileTaskTimeout`. However, it does not make use of the jtreg timeout factor, which lead to #26963 increasing the timeout to 2 s. This PR remedies this, by using the timeout factor and reducing the default timeout to 500 ms. >> >> Testing: >> - [x] Github Actions >> - [x] tier1, tier2 linux-x64-debug, linux-x64, linux-aarch64-debug, linux-aarch64 > > Looks good, the adjustments seem to work for us. Thank you for reviewing @MBaesken, @robcasloz, and @chhagedorn! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27094#issuecomment-3298734770 From mhaessig at openjdk.org Tue Sep 16 13:37:28 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 13:37:28 GMT Subject: Integrated: 8366775: TestCompileTaskTimeout should use timeoutFactor In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 13:26:22 GMT, Manuel H?ssig wrote: > `TestCompileTaskTimeout.java` employs a timeout to test that methods compiled faster than a specified `CompileTaskTimeout`. However, it does not make use of the jtreg timeout factor, which lead to #26963 increasing the timeout to 2 s. This PR remedies this, by using the timeout factor and reducing the default timeout to 500 ms. > > Testing: > - [x] Github Actions > - [x] tier1, tier2 linux-x64-debug, linux-x64, linux-aarch64-debug, linux-aarch64 This pull request has now been integrated. Changeset: c82070e6 Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/c82070e6357a1b49f2887ab22267393ba87d9352 Stats: 7 lines in 1 file changed: 6 ins; 0 del; 1 mod 8366775: TestCompileTaskTimeout should use timeoutFactor Reviewed-by: chagedorn, rcastanedalo, mbaesken ------------- PR: https://git.openjdk.org/jdk/pull/27094 From dfenacci at openjdk.org Tue Sep 16 13:56:26 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 16 Sep 2025 13:56:26 GMT Subject: RFR: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:15:06 GMT, Francesco Andreuzzi wrote: > This is the content of assembler.inline.hpp: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 > > Most of the `assembler_.inline.hpp` include it: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 > > They should probably include `assembler.hpp` instead. > > Testing: tier1 in GHA It looks like there were a few include cycles. Thanks for fixing this @fandreuz. Running tier1-3+ tests... ------------- PR Comment: https://git.openjdk.org/jdk/pull/27311#issuecomment-3298879857 From mhaessig at openjdk.org Tue Sep 16 13:57:41 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 13:57:41 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation [v2] In-Reply-To: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> Message-ID: > When running a debug JVM on Linux with a compile task timeout and repeated compilation, the execution will time out almost always because the timeout does not reset for repetitions of a compilation. The core of the compile task timeout is to limit the amount of time a single compilation can take. Thus, this PR resets the `CompileTaskTimeout` for every compilation when running with `-XX:RepeatCompilation=` for n > 1. > > This PR is stacked on top of #27094. > > Testing: > - [x] Github Actions (failures are unrelated) > - [x] tier1, tier2, tier3 plus some additional internal testing Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27120/files - new: https://git.openjdk.org/jdk/pull/27120/files/cfe842c7..cfe842c7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27120&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27120&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27120/head:pull/27120 PR: https://git.openjdk.org/jdk/pull/27120 From epeter at openjdk.org Tue Sep 16 14:24:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 14:24:51 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:08:25 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/postaloc.cpp line 685: >> >>> 683: assert(is_adjacent || is_reg, >>> 684: "only registers can be non-adjacent"); >>> 685: if (!value[ureg_lo] && is_adjacent) { // Nearly always adjacent >> >> `value[ureg_lo]` returns a `Node*`, right? Then that would make this an implicit null check, not allowed by style guide ;) > > Here I'll argue not touching this in this PR (I did not introduce this), as this is the style of the surrounding code. Should be addressed in a follow-up PR though. I'd say this is not just formatting/naming, but code style. We usually fix these cases when we touch the code ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352685909 From epeter at openjdk.org Tue Sep 16 14:24:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 14:24:52 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:05:09 GMT, Daniel Lund?n wrote: >> Or is there a particular logic behind it? > > No other logic than keeping the same style as the surrounding old code. I can update it to use up-to-date style, but then we increase the scope of this PR. Is a follow-up PR OK? Follow up is fine :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352682181 From epeter at openjdk.org Tue Sep 16 14:30:49 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 14:30:49 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:26:53 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/regmask.hpp line 458: >> >>> 456: } >>> 457: >>> 458: RegMask(const RegMask& rm) : RegMask(rm, nullptr) {} >> >> Do you want to add `explicit` here? >> This is a shallow copy, right? Maybe add a comment for that. > > The ADLC-generated code relies on using the constructor implicitly, so I prefer not touching it in this changeset at least. All the copies are deep, clarified now. Ok, I understand. Can you show me an example, so I can understand a little better? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2352706428 From epeter at openjdk.org Tue Sep 16 14:39:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 14:39:20 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v28] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:39:13 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after comments from Emanuel You seem to have a build failure: In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/compile.hpp:43, from /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:29, from /home/runner/work/jdk/jdk/test/hotspot/gtest/opto/test_rangeinference.cpp:26: /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp: In constructor ?RegMask::RegMask(Arena*)?: /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp:441:53: error: class ?RegMask? does not have any field named ?_read_only? 441 | : _rm_word() DEBUG_ONLY(COMMA _arena(arena)), _read_only(read_only), | ^~~~~~~~~~ /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp:441:64: error: ?read_only? was not declared in this scope 441 | : _rm_word() DEBUG_ONLY(COMMA _arena(arena)), _read_only(read_only), | ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-3299064676 From epeter at openjdk.org Tue Sep 16 14:39:22 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 16 Sep 2025 14:39:22 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v23] In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 10:13:01 GMT, Daniel Lund?n wrote: >>> For reference, here is now the changeset adding an IFG bailout: #26118 >> >> Since that is now integrated: do we need to make any changes to the patch here? I thought the goal was to use the bailouts instead of increasing `MaxNodeLimit`. >> >> Because looking at the discussions above: we were worried that there could be compile-time regressions - even if quite rare. But they were in the range of 40s which is quite scary. Are these now gone? > > @eme64 I have now addressed your comments (the renaming is in https://github.com/openjdk/jdk/pull/27215, as requested). Please have a look and let me know if I've missed something. @dlunde Thanks for the swift updates! I have in the meantime added some more comments, just making sure you don't miss them :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-3299067317 From mhaessig at openjdk.org Tue Sep 16 15:38:12 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 15:38:12 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation [v3] In-Reply-To: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> Message-ID: <6ijTgwXUpwm8C_U7oOsN7RScv-caCal0U67UXFZ6VmY=.5550cf2f-2c57-4fc0-a2cd-3df6627485a2@github.com> > When running a debug JVM on Linux with a compile task timeout and repeated compilation, the execution will time out almost always because the timeout does not reset for repetitions of a compilation. The core of the compile task timeout is to limit the amount of time a single compilation can take. Thus, this PR resets the `CompileTaskTimeout` for every compilation when running with `-XX:RepeatCompilation=` for n > 1. > > This PR is stacked on top of #27094. > > Testing: > - [x] Github Actions (failures are unrelated) > - [x] tier1, tier2, tier3 plus some additional internal testing Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into JDK-8366875-repeat-comp-to - Reset timeout on repeated compilations - Add regression test - Use timeuot factor ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27120/files - new: https://git.openjdk.org/jdk/pull/27120/files/cfe842c7..f9a170b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27120&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27120&range=01-02 Stats: 31864 lines in 1079 files changed: 16371 ins; 9354 del; 6139 mod Patch: https://git.openjdk.org/jdk/pull/27120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27120/head:pull/27120 PR: https://git.openjdk.org/jdk/pull/27120 From mhaessig at openjdk.org Tue Sep 16 15:38:14 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 15:38:14 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation [v2] In-Reply-To: References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> Message-ID: On Tue, 16 Sep 2025 13:57:41 GMT, Manuel H?ssig wrote: >> When running a debug JVM on Linux with a compile task timeout and repeated compilation, the execution will time out almost always because the timeout does not reset for repetitions of a compilation. The core of the compile task timeout is to limit the amount of time a single compilation can take. Thus, this PR resets the `CompileTaskTimeout` for every compilation when running with `-XX:RepeatCompilation=` for n > 1. >> >> This PR is stacked on top of #27094. >> >> Testing: >> - [x] Github Actions (failures are unrelated) >> - [x] tier1, tier2, tier3 plus some additional internal testing > > Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. Merged master and fixed conflicts. I am currently rerunning testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27120#issuecomment-3299321385 From mhaessig at openjdk.org Tue Sep 16 15:42:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 15:42:39 GMT Subject: RFR: 8366878: Improve flags of compiler/loopopts/superword/TestAlignVectorFuzzer.java [v2] In-Reply-To: References: Message-ID: > The test definitions of `TestAlignVectorFuzzer.java` all contain `printcompilation` directives. These are redundant and slow down the test execution of a test that already often times out. @eme64 also suggested adding a `compileonly` directive to one of the four tests. > > Testing: > - [ ] Github Actions > - [ ] tier1 and stress testing (features `TestAlignVectorFuzzer.java`) Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8366878-align-fuzz-flags - Make compileonly a separate run - Fix flags ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27122/files - new: https://git.openjdk.org/jdk/pull/27122/files/d2db1697..3aa62f9e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27122&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27122&range=00-01 Stats: 31883 lines in 1081 files changed: 16389 ins; 9354 del; 6140 mod Patch: https://git.openjdk.org/jdk/pull/27122.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27122/head:pull/27122 PR: https://git.openjdk.org/jdk/pull/27122 From mhaessig at openjdk.org Tue Sep 16 15:42:40 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 15:42:40 GMT Subject: RFR: 8366878: Improve flags of compiler/loopopts/superword/TestAlignVectorFuzzer.java In-Reply-To: References: Message-ID: <9G7_44gEH9_LlCOwXzcSvKgGaF_TibDsAb_anH9ot34=.caedf4a0-560a-4b60-ad56-29b3c9e35bd0@github.com> On Fri, 5 Sep 2025 16:46:09 GMT, Manuel H?ssig wrote: > The test definitions of `TestAlignVectorFuzzer.java` all contain `printcompilation` directives. These are redundant and slow down the test execution of a test that already often times out. @eme64 also suggested adding a `compileonly` directive to one of the four tests. > > Testing: > - [ ] Github Actions > - [ ] tier1 and stress testing (features `TestAlignVectorFuzzer.java`) Merged master and addressed @eme64's comment. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27122#issuecomment-3299328398 From mhaessig at openjdk.org Tue Sep 16 15:42:44 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 16 Sep 2025 15:42:44 GMT Subject: RFR: 8366878: Improve flags of compiler/loopopts/superword/TestAlignVectorFuzzer.java [v2] In-Reply-To: <5PWmoHhlhYHDD7WBje51yGzGHr1Dq3QCDRNApA64MmY=.ed2e0b11-e144-4e24-97dd-7a7ccdd208c0@github.com> References: <5PWmoHhlhYHDD7WBje51yGzGHr1Dq3QCDRNApA64MmY=.ed2e0b11-e144-4e24-97dd-7a7ccdd208c0@github.com> Message-ID: On Mon, 8 Sep 2025 05:53:32 GMT, Emanuel Peter wrote: >> Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8366878-align-fuzz-flags >> - Make compileonly a separate run >> - Fix flags > > test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVectorFuzzer.java line 35: > >> 33: * -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestAlignVectorFuzzer::* >> 34: * compiler.loopopts.superword.TestAlignVectorFuzzer >> 35: */ > > I think it would be good if we also had the same run but without the compileonly. That's what I meant by duplication ;) I added a separate run in a new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27122#discussion_r2352913724 From dfenacci at openjdk.org Tue Sep 16 16:29:59 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 16 Sep 2025 16:29:59 GMT Subject: RFR: 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows Message-ID: ## Problem After [JDK-8260555](https://bugs.openjdk.org/browse/JDK-8260555) changed the default TIMEOUT_FACTOR from 4 to 1, the test compiler/startup/StartupOutput.java can occasionally slightly exceed the 2-minute timeout on Windows. ## Change Rather than increasing the timeout, this change reduces the number of VM runs with randomly generated near-minimum code cache sizes from 200 to 50. This should still provide sufficient coverage while keeping execution well within the timeout. ## Testing: Tiers 1-3+ ------------- Commit messages: - JDK-8367278: reduce loop to 50 cycles - JDK-8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows Changes: https://git.openjdk.org/jdk/pull/27254/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27254&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367278 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27254.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27254/head:pull/27254 PR: https://git.openjdk.org/jdk/pull/27254 From sparasa at openjdk.org Tue Sep 16 17:42:18 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 16 Sep 2025 17:42:18 GMT Subject: RFR: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 [v5] In-Reply-To: References: Message-ID: <5QOi2GBheGqa8c_Hc9yfuq0DTm8UsD1QshPkVdgdFDc=.079d9af8-0ccd-4540-ae10-3ae359a9a6d9@github.com> On Tue, 16 Sep 2025 05:35:44 GMT, Emanuel Peter wrote: > Not reviewed in detail, but looks reasonable. Tests pass :) Thank you Emanuel! :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26997#issuecomment-3299735699 From sparasa at openjdk.org Tue Sep 16 18:16:52 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 16 Sep 2025 18:16:52 GMT Subject: Integrated: 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 In-Reply-To: References: Message-ID: <5YUTgEkfPuRs8PZq3uH2XbK9KxOZIQLzuDZ4Lz9VYSg=.1f1e53a5-a1d7-4f56-8a9d-6490a07022ee@github.com> On Thu, 28 Aug 2025 21:09:03 GMT, Srinivas Vamsi Parasa wrote: > This change extends Extended EVEX (EEVEX) to REX2/REX demotion for Intel APX NDD instructions to handle commutative operations when the destination register and the second source register (src2) are the same. > > Currently, EEVEX to REX2/REX demotion is only enabled when the first source (src1) and the destination are the same. This enhancement allows additional cases of valid demotion for commutative instructions (add, imul, and, or, xor). > > For example: > `eaddl r18, r25, r18` can be encoded as `addl r18, r25` using APX REX2 encoding > `eaddl r2, r7, r2` can be encoded as `addl r2, r7` using non-APX legacy encoding This pull request has now been integrated. Changeset: c41add8d Author: Srinivas Vamsi Parasa URL: https://git.openjdk.org/jdk/commit/c41add8d3e24be5f469f18cfbf0f476f2baf63a6 Stats: 3085 lines in 4 files changed: 518 ins; 169 del; 2398 mod 8354348: Enable Extended EVEX to REX2/REX demotion for commutative operations with same dst and src2 Reviewed-by: jbhateja, epeter, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/26997 From kvn at openjdk.org Tue Sep 16 19:32:45 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 16 Sep 2025 19:32:45 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 14:08:57 GMT, Aleksey Shipilev wrote: >> I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: >> >> 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. >> 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. >> >> I think we should be running CTW tests in AWT headless mode to begin with. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8367313-ctw-headless-mode > - Fix Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27187#pullrequestreview-3231402860 From vlivanov at openjdk.org Tue Sep 16 20:09:18 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 16 Sep 2025 20:09:18 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: References: Message-ID: > As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. > > Consider `FloatVector::lanewiseTemplate`: > > FloatVector lanewiseTemplate(VectorOperators.Unary op) { > if (opKind(op, VO_SPECIAL)) { > ... > else if (opKind(op, VO_MATHLIB)) { > return unaryMathOp(op); > } > } > int opc = opCode(op); > return VectorSupport.unaryOp(opc, ...); > } > > > At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. > > It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. > > The fix is to fail-fast intrinsification rather than crashing the VM. > > Testing: tier1 - tier4 Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: review feedback ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27263/files - new: https://git.openjdk.org/jdk/pull/27263/files/66892f1d..f63e76ce Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27263&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27263&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27263.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27263/head:pull/27263 PR: https://git.openjdk.org/jdk/pull/27263 From vlivanov at openjdk.org Tue Sep 16 20:09:18 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 16 Sep 2025 20:09:18 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 20:05:48 GMT, Vladimir Ivanov wrote: >> As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. >> >> Consider `FloatVector::lanewiseTemplate`: >> >> FloatVector lanewiseTemplate(VectorOperators.Unary op) { >> if (opKind(op, VO_SPECIAL)) { >> ... >> else if (opKind(op, VO_MATHLIB)) { >> return unaryMathOp(op); >> } >> } >> int opc = opCode(op); >> return VectorSupport.unaryOp(opc, ...); >> } >> >> >> At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. >> >> It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. >> >> The fix is to fail-fast intrinsification rather than crashing the VM. >> >> Testing: tier1 - tier4 > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > review feedback Thanks for the reviews. ------------- PR Review: https://git.openjdk.org/jdk/pull/27263#pullrequestreview-3231511953 From vlivanov at openjdk.org Tue Sep 16 20:09:22 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 16 Sep 2025 20:09:22 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: References: Message-ID: <3Cy6jhWxbaQeWwo22L9nxPnipY1-vHsGZEtk8IZUiq8=.bfefdef7-0137-422b-a7b0-e4fae2a5b282@github.com> On Tue, 16 Sep 2025 06:44:49 GMT, Emanuel Peter wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> review feedback > > test/hotspot/jtreg/compiler/vectorapi/TestVectorMathLib.java line 33: > >> 31: * @test >> 32: * @bug 8367333 >> 33: * @requires vm.compiler2.enabled > > Do you need this `@requires`? It might be nice to be able to run this with other compilers too. It's intended as C2-specific regression test and it relies on C2-specific VM flags. Vector API unit tests (under test/jdk/jdk/incubator/vector/) exercise the very same functionality, but don't specify flags required to trigger the bug. > test/hotspot/jtreg/compiler/vectorapi/TestVectorMathLib.java line 40: > >> 38: * -XX:CompileCommand=compileonly,compiler.vectorapi.TestVectorMathLib::test* >> 39: * -XX:+StressIncrementalInlining >> 40: * compiler.vectorapi.TestVectorMathLib > > Like @jatin-bhateja mentioned: alignment is off. > I'd also like to see a run without flags, maybe with only `-XX:CompileCommand=compileonly,compiler.vectorapi.TestVectorMathLib::test*` Again, IMO it doesn't make sense to run the regression test without stressing incremental inlining. Otherwise, it duplicates existing tests. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2353525672 PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2353527146 From vlivanov at openjdk.org Tue Sep 16 20:09:24 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 16 Sep 2025 20:09:24 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: <5xYVPgSuC3a9kqp_hRs3vgtBDoJzlmf9v6wgMa9XFJ4=.c8abf0f6-b563-4b3f-92c3-d902b6e59950@github.com> References: <5xYVPgSuC3a9kqp_hRs3vgtBDoJzlmf9v6wgMa9XFJ4=.c8abf0f6-b563-4b3f-92c3-d902b6e59950@github.com> Message-ID: On Mon, 15 Sep 2025 15:24:58 GMT, Jatin Bhateja wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> review feedback > > test/hotspot/jtreg/compiler/vectorapi/TestVectorMathLib.java line 40: > >> 38: * -XX:CompileCommand=compileonly,compiler.vectorapi.TestVectorMathLib::test* >> 39: * -XX:+StressIncrementalInlining >> 40: * compiler.vectorapi.TestVectorMathLib > > Suggestion: > > * -XX:+StressIncrementalInlining compiler.vectorapi.TestVectorMathLib Ok, fixed. I prefer to keep test class name on a separate line. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2353534340 From jcking at openjdk.org Tue Sep 16 21:40:04 2025 From: jcking at openjdk.org (Justin King) Date: Tue, 16 Sep 2025 21:40:04 GMT Subject: RFR: 8367789: AArch64 missing acquire in JNI_FastGetField::generate_fast_get_int_field0 Message-ID: Use a load-acquire to match the store-release used by C++ to update `safepoint_counter` during arming. ------------- Commit messages: - JDK-8367789: Use load-acquire instead of load Changes: https://git.openjdk.org/jdk/pull/27325/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27325&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367789 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27325.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27325/head:pull/27325 PR: https://git.openjdk.org/jdk/pull/27325 From duke at openjdk.org Tue Sep 16 21:54:04 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 16 Sep 2025 21:54:04 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v46] In-Reply-To: References: Message-ID: <1mM9usJWy-ZWYMEm1qxiHfxbO1jn6zpBS_t16Xr9i64=.5f4a3243-87e4-439d-b315-fcc4be60fcca@github.com> On Sat, 30 Aug 2025 00:32:02 GMT, Vladimir Kozlov wrote: >> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix NMethodRelocationTest.java logging race > > It failed on linux-x64 and linux-aarch64. > I tried locally on linux-x64 but it passed. @vnkozlov The bug you discovered has been fixed. Can you rerun your testing to confirm on your end? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3300462043 From manc at openjdk.org Tue Sep 16 21:59:10 2025 From: manc at openjdk.org (Man Cao) Date: Tue, 16 Sep 2025 21:59:10 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed [v2] In-Reply-To: References: Message-ID: <5eWiPUhybQOdBZAfm8LnEGLQ8ZwXHcqatCQEf8PVlgo=.ffcd2f7e-f734-49de-a2e4-1099bfb544f5@github.com> > Hi, > > Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). > > For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. Man Cao has updated the pull request incrementally with one additional commit since the last revision: Switch to disable inlining for shortMethod ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27306/files - new: https://git.openjdk.org/jdk/pull/27306/files/f460dc4d..93540e05 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27306&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27306&range=00-01 Stats: 6 lines in 1 file changed: 4 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27306.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27306/head:pull/27306 PR: https://git.openjdk.org/jdk/pull/27306 From manc at openjdk.org Tue Sep 16 22:09:24 2025 From: manc at openjdk.org (Man Cao) Date: Tue, 16 Sep 2025 22:09:24 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed [v2] In-Reply-To: <5eWiPUhybQOdBZAfm8LnEGLQ8ZwXHcqatCQEf8PVlgo=.ffcd2f7e-f734-49de-a2e4-1099bfb544f5@github.com> References: <5eWiPUhybQOdBZAfm8LnEGLQ8ZwXHcqatCQEf8PVlgo=.ffcd2f7e-f734-49de-a2e4-1099bfb544f5@github.com> Message-ID: On Tue, 16 Sep 2025 21:59:10 GMT, Man Cao wrote: >> Hi, >> >> Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). >> >> For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. > > Man Cao has updated the pull request incrementally with one additional commit since the last revision: > > Switch to disable inlining for shortMethod Thank you for the review and suggestions. @chhagedorn Thank you for the explanation. I switched to `-XX:CompileCommand=dontinline` for `shortMethod()`. It works for `-Xcomp -XX:TieredStopAtLevel=1`. The benefit of `dontinline` approach is that it allows the test run under `-Xcomp`, esp. `-XX:-TieredCompilation` with `-Xcomp`. It is also future-proof, in case C2 manages to inline `shortMethod()` into `main()` under `-Xcomp` in the future. Also added bug number. And excluding `-XX:TieredStopAtLevel=1` is no longer needed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27306#issuecomment-3300497116 From dlong at openjdk.org Wed Sep 17 01:09:39 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 17 Sep 2025 01:09:39 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Fri, 12 Sep 2025 14:00:53 GMT, Emanuel Peter wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> Minor fix > > src/hotspot/share/opto/reachability.cpp line 81: > >> 79: * (c) Unfortunately, it's not straightforward to stay with safepoint-attached representation till the very end, >> 80: * because information about derived oops is attached to safepoints in a similar way. So, for now RFs are >> 81: * rematerialized at safepoints before RA (phase #3). > > I still don't understand this. What is similar to what? And why is that a problem? Why don't we put RF edges somewhere else, so they don't look like derived oops? I was thinking they could go in the monitor area, or if that causes problems, we introduce a new area. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2353965620 From dlong at openjdk.org Wed Sep 17 02:14:42 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 17 Sep 2025 02:14:42 GMT Subject: RFR: 8367706: Remove redundant register used by cmove in C1 LIR generation In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 09:35:03 GMT, lusou-zhangquan wrote: > This PR removes redundant temp register used by cmove in C1 LIRGenerator::do_LookupSwitch and LIRGenerator::do_TableSwitch. The issue [8367706](https://bugs.openjdk.org/browse/JDK-8367706) is reported by me and it's my pleasure to fix it. Reversing the order of the two source arguments seems wrong. Please explain. ------------- Changes requested by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27307#pullrequestreview-3232251609 From xgong at openjdk.org Wed Sep 17 02:21:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 17 Sep 2025 02:21:42 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> Message-ID: On Tue, 9 Sep 2025 07:27:46 GMT, Emanuel Peter wrote: >>> To me a `false` means this: If we support gater/scalter, then we do not need a vector index, we can do without it. >>> >>> Is that correct? >> >> Thanks for your review! Actually gather/scatter always need an index input. What this function want to decide is how the index elements are passed to the operations. >> >> It doesn't take an assumption whether vector gather_load/scatter_store is supported or not in backend. It just checks whether the `index` input of such operations requires a vector register or an address which stores the indexes. Currently, on x86, it passes an array address for subword types (the indexes are then will be loaded one-by-one in backend codegen). However, on AArch64, we requires it a vector type for all types instead (the indexes have been loaded and saved into vector registers in IR level). >> >>> The current platform does not support vector gather-load or scatter-store at all. >> >> I'm sorry that I didn't clarify very clear about @fg1417 's second statement. Whether the current platform supports vector gather-load/scatter-store is still decided by `Matcher::match_rule_supported_vector()` like other operations. It return `false` here just because arm doesn't support any vector operations. Assume if it want to support a vector gather/scatter, the index input must not be a vector, right? > > Thanks for all the explanations, that was very helpful! > > Can you please adjust the comment so that all the relevant information is there? > We could also make the name of the method more precise / informative? > Maybe you could write something like this: > > // true -> if gather/scatter supported: require index in vector register > // false -> if gather/scatter supported: allows both index in vector register AND array address holding indices > > Then give more information about platform specific things that you mentioned about aarch64 and x86 in the relevant files ;) Hi @eme64 , regarding to the method name, is `gather_scatter_requires_index_in_vector()` fine to you? If so, I think I can change the name to it. Or please let me know if you have a better one. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2354066760 From duke at openjdk.org Wed Sep 17 02:47:21 2025 From: duke at openjdk.org (lusou-zhangquan) Date: Wed, 17 Sep 2025 02:47:21 GMT Subject: RFR: 8367706: Remove redundant register used by cmove in C1 LIR generation [v2] In-Reply-To: References: Message-ID: <58MAR1O9tGfnVcoCfbv17BI-IP2qC2BuYDYc3GZQ30Q=.3a60b666-cb80-405a-9a98-d46bf724f7c0@github.com> > This PR removes redundant temp register used by cmove in C1 LIRGenerator::do_LookupSwitch and LIRGenerator::do_TableSwitch. The issue [8367706](https://bugs.openjdk.org/browse/JDK-8367706) is reported by me and it's my pleasure to fix it. lusou-zhangquan has updated the pull request incrementally with one additional commit since the last revision: Fix wrong source register order ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27307/files - new: https://git.openjdk.org/jdk/pull/27307/files/233e7681..aeb9cfc4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27307&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27307&range=00-01 Stats: 4 lines in 1 file changed: 2 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27307.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27307/head:pull/27307 PR: https://git.openjdk.org/jdk/pull/27307 From duke at openjdk.org Wed Sep 17 03:16:35 2025 From: duke at openjdk.org (erifan) Date: Wed, 17 Sep 2025 03:16:35 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 07:02:23 GMT, Emanuel Peter wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: >> >> - Merge branch 'master' into JDK-8366333-compress >> - 8366333: AArch64: Enhance SVE subword type implementation of vector compress >> >> The AArch64 SVE and SVE2 architectures lack an instruction suitable for >> subword-type `compress` operations. Therefore, the current implementation >> uses the 32-bit SVE `compact` instruction to compress subword types by >> first widening the high and low parts to 32 bits, compressing them, and >> then narrowing them back to their original type. Finally, the high and >> low parts are merged using the `index + tbl` instructions. >> >> This approach is significantly slower compared to architectures with native >> support. After evaluating all available AArch64 SVE instructions and >> experimenting with various implementations?such as looping over the active >> elements, extraction, and insertion?I confirmed that the existing algorithm >> is optimal given the instruction set. However, there is still room for >> optimization in the following two aspects: >> 1. Merging with `index + tbl` is suboptimal due to the high latency of >> the `index` instruction. >> 2. For partial subword types, operations to the highest half are unnecessary >> because those bits are invalid. >> >> This pull request introduces the following changes: >> 1. Replaces `index + tbl` with the `whilelt + splice` instructions, which >> offer lower latency and higher throughput. >> 2. Eliminates unnecessary compress operations for partial subword type cases. >> 3. For `sve_compress_byte`, one less temporary register is used to alleviate >> potential register pressure. >> >> Benchmark results demonstrate that these changes significantly improve performance. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE: >> ``` >> Benchmark Unit Before Error After Error Uplift >> Byte128Vector.compress ops/ms 4846.97 26.23 6638.56 31.60 1.36 >> Byte64Vector.compress ops/ms 2447.69 12.95 7167.68 34.49 2.92 >> Short128Vector.compress ops/ms 7174.88 40.94 8398.45 9.48 1.17 >> Short64Vector.compress ops/ms 3618.72 3.04 8618.22 10.91 2.38 >> ``` >> >> This PR was tested on 128-bit, 256-bit, and 512-bit SVE environments, >> and all tests passed. > > test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 36: > >> 34: * @key randomness >> 35: * @library /test/lib / >> 36: * @summary AArch64: Enhance SVE subword type implementation of vector compress > > I would change the summary to something a bit more generic, since the test is not only good for aarch64 / SVE. > Suggestion: > > * @summary IR test for VectorAPI compress It seems that the summary and the PR title are usually consistent. Is there any convention or rule for this? > test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 228: > >> 226: .start(); >> 227: } >> 228: } > > Question: is there already another test that checks `compress`? Yes, just like `expand`, it's here https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5357 This test file is mainly for IR test. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354169473 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354167428 From dlong at openjdk.org Wed Sep 17 03:33:34 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 17 Sep 2025 03:33:34 GMT Subject: RFR: 8367706: Remove redundant register used by cmove in C1 LIR generation [v2] In-Reply-To: <58MAR1O9tGfnVcoCfbv17BI-IP2qC2BuYDYc3GZQ30Q=.3a60b666-cb80-405a-9a98-d46bf724f7c0@github.com> References: <58MAR1O9tGfnVcoCfbv17BI-IP2qC2BuYDYc3GZQ30Q=.3a60b666-cb80-405a-9a98-d46bf724f7c0@github.com> Message-ID: <6ZCa4q2sQbr59eHejCBQgdek27IHOuPQkdqln0OiFW8=.4251ab15-a8fa-4c49-8f03-8209be1a787d@github.com> On Wed, 17 Sep 2025 02:47:21 GMT, lusou-zhangquan wrote: >> This PR removes redundant temp register used by cmove in C1 LIRGenerator::do_LookupSwitch and LIRGenerator::do_TableSwitch. The issue [8367706](https://bugs.openjdk.org/browse/JDK-8367706) is reported by me and it's my pleasure to fix it. > > lusou-zhangquan has updated the pull request incrementally with one additional commit since the last revision: > > Fix wrong source register order Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27307#pullrequestreview-3232502452 From epeter at openjdk.org Wed Sep 17 06:01:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 06:01:42 GMT Subject: RFR: 8366878: Improve flags of compiler/loopopts/superword/TestAlignVectorFuzzer.java [v2] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 15:42:39 GMT, Manuel H?ssig wrote: >> The test definitions of `TestAlignVectorFuzzer.java` all contain `printcompilation` directives. These are redundant and slow down the test execution of a test that already often times out. @eme64 also suggested adding a `compileonly` directive to one of the four tests. >> >> Testing: >> - [ ] Github Actions >> - [ ] tier1 and stress testing (features `TestAlignVectorFuzzer.java`) > > Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8366878-align-fuzz-flags > - Make compileonly a separate run > - Fix flags Looks good :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27122#pullrequestreview-3232783938 From jwaters at openjdk.org Wed Sep 17 06:02:38 2025 From: jwaters at openjdk.org (Julian Waters) Date: Wed, 17 Sep 2025 06:02:38 GMT Subject: RFR: 8367706: Remove redundant register used by cmove in C1 LIR generation [v2] In-Reply-To: <58MAR1O9tGfnVcoCfbv17BI-IP2qC2BuYDYc3GZQ30Q=.3a60b666-cb80-405a-9a98-d46bf724f7c0@github.com> References: <58MAR1O9tGfnVcoCfbv17BI-IP2qC2BuYDYc3GZQ30Q=.3a60b666-cb80-405a-9a98-d46bf724f7c0@github.com> Message-ID: On Wed, 17 Sep 2025 02:47:21 GMT, lusou-zhangquan wrote: >> This PR removes redundant temp register used by cmove in C1 LIRGenerator::do_LookupSwitch and LIRGenerator::do_TableSwitch. The issue [8367706](https://bugs.openjdk.org/browse/JDK-8367706) is reported by me and it's my pleasure to fix it. > > lusou-zhangquan has updated the pull request incrementally with one additional commit since the last revision: > > Fix wrong source register order This appears to be causing all x64 JDKs to behave wrongly according to Actions ------------- PR Comment: https://git.openjdk.org/jdk/pull/27307#issuecomment-3301423528 From epeter at openjdk.org Wed Sep 17 06:07:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 06:07:45 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: <3Cy6jhWxbaQeWwo22L9nxPnipY1-vHsGZEtk8IZUiq8=.bfefdef7-0137-422b-a7b0-e4fae2a5b282@github.com> References: <3Cy6jhWxbaQeWwo22L9nxPnipY1-vHsGZEtk8IZUiq8=.bfefdef7-0137-422b-a7b0-e4fae2a5b282@github.com> Message-ID: On Tue, 16 Sep 2025 20:00:10 GMT, Vladimir Ivanov wrote: >> test/hotspot/jtreg/compiler/vectorapi/TestVectorMathLib.java line 33: >> >>> 31: * @test >>> 32: * @bug 8367333 >>> 33: * @requires vm.compiler2.enabled >> >> Do you need this `@requires`? It might be nice to be able to run this with other compilers too. > > It's intended as C2-specific regression test and it relies on C2-specific VM flags. Vector API unit tests (under test/jdk/jdk/incubator/vector/) exercise the very same functionality, but don't specify flags required to trigger the bug. I leave it up to you. You can always ignore unrecognized flags. And tests tend to diverge over time, so maybe a little duplication would not hurt. But up to you. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2354404792 From epeter at openjdk.org Wed Sep 17 06:11:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 06:11:42 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: References: Message-ID: <8j6oTk-ZdlV7VH7N9gfyWYaBLPezIdi11dy5r9892c8=.bb57a43e-2823-4c0b-8cde-c96d8fd3df4f@github.com> On Tue, 16 Sep 2025 20:09:18 GMT, Vladimir Ivanov wrote: >> As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. >> >> Consider `FloatVector::lanewiseTemplate`: >> >> FloatVector lanewiseTemplate(VectorOperators.Unary op) { >> if (opKind(op, VO_SPECIAL)) { >> ... >> else if (opKind(op, VO_MATHLIB)) { >> return unaryMathOp(op); >> } >> } >> int opc = opCode(op); >> return VectorSupport.unaryOp(opc, ...); >> } >> >> >> At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. >> >> It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. >> >> The fix is to fail-fast intrinsification rather than crashing the VM. >> >> Testing: tier1 - tier4 > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > review feedback Looks reasonable. I leave it up to you with the `@requires`. I was wondering why not just add the extra run with special flags to the original test? But I don't want to hold up the PR, so up to you ;) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27263#pullrequestreview-3232804061 From epeter at openjdk.org Wed Sep 17 06:11:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 06:11:44 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: References: <3Cy6jhWxbaQeWwo22L9nxPnipY1-vHsGZEtk8IZUiq8=.bfefdef7-0137-422b-a7b0-e4fae2a5b282@github.com> Message-ID: On Wed, 17 Sep 2025 06:05:11 GMT, Emanuel Peter wrote: >> It's intended as C2-specific regression test and it relies on C2-specific VM flags. Vector API unit tests (under test/jdk/jdk/incubator/vector/) exercise the very same functionality, but don't specify flags required to trigger the bug. > > I leave it up to you. You can always ignore unrecognized flags. And tests tend to diverge over time, so maybe a little duplication would not hurt. But up to you. And if it is a duplication, you should probably leave a comment linking it to the other one. Also: why not just add the extra run over at the original test? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2354410385 From epeter at openjdk.org Wed Sep 17 06:21:35 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 06:21:35 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> Message-ID: <8QfxQ-oTWaWPUIHkOODfYCHUyAzxhGksLwX56sKva10=.764e4cbe-4416-47c6-8c11-5d282249b017@github.com> On Wed, 17 Sep 2025 02:18:17 GMT, Xiaohong Gong wrote: >> Thanks for all the explanations, that was very helpful! >> >> Can you please adjust the comment so that all the relevant information is there? >> We could also make the name of the method more precise / informative? >> Maybe you could write something like this: >> >> // true -> if gather/scatter supported: require index in vector register >> // false -> if gather/scatter supported: allows both index in vector register AND array address holding indices >> >> Then give more information about platform specific things that you mentioned about aarch64 and x86 in the relevant files ;) > > Hi @eme64 , regarding to the method name, is `gather_scatter_requires_index_in_vector()` fine to you? If so, I think I can change the name to it. Or please let me know if you have a better one. Thanks! > > BTW, do you think it's better if I reverse the function of this method, such as `gather_scatter_requires_index_in_addr()`. Because gather/scatter is a vector operation. By default, accepting a vector input usually make sense. And this is true for all word and double-word types. The subword type loading which requires the indexes saved in an address on X86 is a corner case to me. If I understood right, some platforms only support addr, some only index, right? Are there any that support both? You could also have 2 methods, that say `allows` or maybe more idiomatically for hotspot C2 `implements / implemented`. Yet another alternative: `enum` with the different states. These are just some ideas. But from what you are telling me, it would really make sense to go with `gather_scatter_requires_index_in_addr`, since the addr case is indeed a corner-case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2354427486 From xgong at openjdk.org Wed Sep 17 06:25:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 17 Sep 2025 06:25:42 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: <8QfxQ-oTWaWPUIHkOODfYCHUyAzxhGksLwX56sKva10=.764e4cbe-4416-47c6-8c11-5d282249b017@github.com> References: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> <8QfxQ-oTWaWPUIHkOODfYCHUyAzxhGksLwX56sKva10=.764e4cbe-4416-47c6-8c11-5d282249b017@github.com> Message-ID: On Wed, 17 Sep 2025 06:18:30 GMT, Emanuel Peter wrote: > If I understood right, some platforms only support addr, some only index, right? Are there any that support both? Right. I don't think any arch support both the style. Either a vector index or an array address is enough. Besides, C2 has the helper `Matcher::match_rule_supported_vector()` which can check whether an op is implemented yet or not. > These are just some ideas. But from what you are telling me, it would really make sense to go with gather_scatter_requires_index_in_addr, since the addr case is indeed a corner-case. Yes, I think this would be better. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2354434754 From epeter at openjdk.org Wed Sep 17 06:27:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 06:27:36 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress [v2] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 03:14:02 GMT, erifan wrote: >> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 36: >> >>> 34: * @key randomness >>> 35: * @library /test/lib / >>> 36: * @summary AArch64: Enhance SVE subword type implementation of vector compress >> >> I would change the summary to something a bit more generic, since the test is not only good for aarch64 / SVE. >> Suggestion: >> >> * @summary IR test for VectorAPI compress > > It seems that the summary and the PR title are usually consistent. Is there any convention or rule for this? I think that people often just do whatever they feel like. But I think the summary should summarize the content of the test, give maybe a reason for the test. Sometimes the PR title captures the intent of the test, then I'm fine with that. But sometimes the PR title is not quite adequate, maybe too narrow like here. But it is not a big deal, just a little nit ;) >> test/hotspot/jtreg/compiler/vectorapi/VectorCompressTest.java line 228: >> >>> 226: .start(); >>> 227: } >>> 228: } >> >> Question: is there already another test that checks `compress`? > > Yes, just like `expand`, it's here https://github.com/openjdk/jdk/blob/986ecff5f9b16f1b41ff15ad94774d65f3a4631d/test/jdk/jdk/incubator/vector/Byte128VectorTests.java#L5357 > This test file is mainly for IR test. Nice, thanks! I forgot to search over there ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354439774 PR Review Comment: https://git.openjdk.org/jdk/pull/27188#discussion_r2354435654 From epeter at openjdk.org Wed Sep 17 07:02:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 07:02:43 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v13] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 05:43:11 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Add an IR rule for vector mask cast operation @erifan Thanks for the work! All tests pass on my side, patch looks good to me too :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/24674#pullrequestreview-3232943779 From mhaessig at openjdk.org Wed Sep 17 07:04:57 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 17 Sep 2025 07:04:57 GMT Subject: RFR: 8367721: Test compiler/arguments/TestCompileTaskTimeout.java crashed: SIGSEGV Message-ID: The test `TestCompileTaskTimeout.java` runs `java -Xcomp -XX:CompileTaskTimeout=1 --version` to demonstrate that the timeout works. Part of the timeout working involves it printing the method of the compile task. Inspecting the core file of the execution that failed with a `SIGSEGV` in the compile task timeout signal handler, the backtrace looks as follows: #n #n+1 CompilerThreadTimeoutLinux::signal_handler() #n+2 #n+3 timer_settime() #n+4 CompilerThreadTimeoutLinux::disarm() #n+5 CompileTaskWrapper::~CompileTaskWrapper() So, the compile task hit the timeout during destruction of the underlying `CompileTaskWrapper`. Since the timeout was disarmed only after setting the task to null in the destructor, the signal handler segfaulted when trying to access the method of the compile task to print it out. This PR addresses this issue by moving up the disarmament of the timeout to the top of the destructor. Because this issue can only be triggered with bad --- or good, depending on your view --- luck on timing, I could not devise a regression test. But this is not too big of an issue, since the CI already caught this issue. Testing: - [ ] Github Actions - [ ] tier1,tier2,tier3 plus stress testing on Oracle supported platforms ------------- Commit messages: - Move disarmament of timeout to the very beginning of destuctor Changes: https://git.openjdk.org/jdk/pull/27331/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27331&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367721 Stats: 5 lines in 1 file changed: 4 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/27331.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27331/head:pull/27331 PR: https://git.openjdk.org/jdk/pull/27331 From duke at openjdk.org Wed Sep 17 07:26:52 2025 From: duke at openjdk.org (erifan) Date: Wed, 17 Sep 2025 07:26:52 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v13] In-Reply-To: References: Message-ID: <27SoQ3ZhkmDXmpLXeRiBu3eJychQuq-BgZ9VEE5Ab_U=.82d70745-599b-4edf-ba8e-54c4956ea166@github.com> On Mon, 15 Sep 2025 05:43:11 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Add an IR rule for vector mask cast operation Thanks all for your help, I'll integrate the PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3301646116 From duke at openjdk.org Wed Sep 17 07:26:53 2025 From: duke at openjdk.org (duke) Date: Wed, 17 Sep 2025 07:26:53 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v13] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 05:43:11 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Add an IR rule for vector mask cast operation @erifan Your change (at version 56bb34ffe3ca104c8f838a41f33b1d90bb10b68b) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3301651892 From duke at openjdk.org Wed Sep 17 07:35:04 2025 From: duke at openjdk.org (erifan) Date: Wed, 17 Sep 2025 07:35:04 GMT Subject: Integrated: 8354242: VectorAPI: combine vector not operation with compare In-Reply-To: References: Message-ID: <9jXNL4s-eyJLY6-tYH6-4B5AFrZi-Kr_-J-S2H88Lmc=.8fe08e80-6ed1-4971-b23a-9e1a5b8a4916@github.com> On Wed, 16 Apr 2025 06:39:33 GMT, erifan wrote: > This patch optimizes the following patterns: > For integer types: > > (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) > => (VectorMaskCmp src1 src2 ncond) > (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) > => (VectorMaskCmp src1 src2 ncond) > > cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. > > For float and double types: > > (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > > cond can be eq or ne. > > Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: > > Benchmark Unit Before Score Error After Score Error Uplift > testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 > testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 > testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 > testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 > testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 > testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 > testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 > testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 > testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 > testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 > testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 > testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 > testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 > testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 > testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 > testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 > testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 > testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 > testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 > testCompareLTMaskNotInt ops/s 1672180.09 995.238142 2353757.863 853.774734 1.4 > testCompareLTMaskNotLong ops/s 856502.26... This pull request has now been integrated. Changeset: 45cc515f Author: erifan Committer: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/45cc515f451accfd1a0a36d17ccb38d428a5d035 Stats: 1635 lines in 7 files changed: 1634 ins; 0 del; 1 mod 8354242: VectorAPI: combine vector not operation with compare Reviewed-by: epeter, jbhateja, xgong ------------- PR: https://git.openjdk.org/jdk/pull/24674 From snatarajan at openjdk.org Wed Sep 17 07:50:37 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 17 Sep 2025 07:50:37 GMT Subject: RFR: 8356779: IGV: dump the index of the SafePointNode containing the current JVMS during parsing In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 05:22:00 GMT, Saranya Natarajan wrote: > This PR prints index of the SafePointNode containing the current JVMS during parsing in IGV. As stated in JBS the reason for this is that there are a lot of nodes during parsing, it would be nice to know what are the current nodes in the local slots or in the stack when looking at a graph. > > IGV screenshot of before fix > Screenshot 2025-09-15 at 11 56 54 > > IGV screenshot of after fix > Screenshot 2025-09-15 at 11 54 55 Thanks for the review. Please sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27083#issuecomment-3301740477 From mchevalier at openjdk.org Wed Sep 17 07:52:33 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 17 Sep 2025 07:52:33 GMT Subject: RFR: 8367721: Test compiler/arguments/TestCompileTaskTimeout.java crashed: SIGSEGV In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 06:57:29 GMT, Manuel H?ssig wrote: > The test `TestCompileTaskTimeout.java` runs `java -Xcomp -XX:CompileTaskTimeout=1 --version` to demonstrate that the timeout works. Part of the timeout working involves it printing the method of the compile task. Inspecting the core file of the execution that failed with a `SIGSEGV` in the compile task timeout signal handler, the backtrace looks as follows: > > #n > #n+1 CompilerThreadTimeoutLinux::signal_handler() > #n+2 > #n+3 timer_settime() > #n+4 CompilerThreadTimeoutLinux::disarm() > #n+5 CompileTaskWrapper::~CompileTaskWrapper() > > So, the compile task hit the timeout during destruction of the underlying `CompileTaskWrapper`. Since the timeout was disarmed only after setting the task to null in the destructor, the signal handler segfaulted when trying to access the method of the compile task to print it out. This PR addresses this issue by moving up the disarmament of the timeout to the top of the destructor. > > Because this issue can only be triggered with bad --- or good, depending on your view --- luck on timing, I could not devise a regression test. But this is not too big of an issue, since the CI already caught this issue. > > Testing: > - [ ] Github Actions > - [ ] tier1,tier2,tier3 plus stress testing on Oracle supported platforms That makes a lot of sense to me. If I understand well, it happens when the compile task is naturally terminating, when compilation is done in pretty much the delay granted by the timeout. It doesn't come from the concurrent handling of the timeout and some other kind of error? ------------- Marked as reviewed by mchevalier (Committer). PR Review: https://git.openjdk.org/jdk/pull/27331#pullrequestreview-3233140061 From shade at openjdk.org Wed Sep 17 08:02:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 17 Sep 2025 08:02:39 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode [v2] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 06:48:54 GMT, Emanuel Peter wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8367313-ctw-headless-mode >> - Fix > > @TobiHartmann is on vacation. Maybe @vnkozlov ? Thanks. @eme64 -- I assume testing came back clean? (There seems to be nothing to break here...) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27187#issuecomment-3301790464 From xgong at openjdk.org Wed Sep 17 08:48:16 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 17 Sep 2025 08:48:16 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6] In-Reply-To: References: Message-ID: > This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. > > ### Background > Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. > > ### Implementation > > #### Challenges > Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. > > For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: > - SPECIES_64: Single operation with mask (8 elements, 256-bit) > - SPECIES_128: Single operation, full register (16 elements, 512-bit) > - SPECIES_256: Two operations + merge (32 elements, 1024-bit) > - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) > > Use `ByteVector.SPECIES_512` as an example: > - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. > - It requires 4 times of vector gather-loads to finish the whole operation. > > > byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] > int[] idx = [0, 1, 2, 3, ..., 63, ...] > > 4 gather-load: > idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] > idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] > idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] > idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] > merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] > > > #### Solution > The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. > > Here is the main changes: > - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. > - Added `VectorSliceNode` for result merging. > - Added `VectorMaskWidenNode` for mask spliting and type conversion fo... Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - Add more comments for IRs and added method - Merge branch 'jdk:master' into JDK-8351623-sve - Merge 'jdk:master' into JDK-8351623-sve - Address review comments - Refine IR pattern and clean backend rules - Fix indentation issue and move the helper matcher method to header files - Merge branch jdk:master into JDK-8351623-sve - 8351623: VectorAPI: Add SVE implementation of subword gather load operation ------------- Changes: https://git.openjdk.org/jdk/pull/26236/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=05 Stats: 1070 lines in 20 files changed: 907 ins; 24 del; 139 mod Patch: https://git.openjdk.org/jdk/pull/26236.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236 PR: https://git.openjdk.org/jdk/pull/26236 From xgong at openjdk.org Wed Sep 17 08:48:21 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 17 Sep 2025 08:48:21 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v5] In-Reply-To: References: Message-ID: On Fri, 5 Sep 2025 10:49:44 GMT, Emanuel Peter wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: >> >> - Merge 'jdk:master' into JDK-8351623-sve >> - Address review comments >> - Refine IR pattern and clean backend rules >> - Fix indentation issue and move the helper matcher method to header files >> - Merge branch jdk:master into JDK-8351623-sve >> - 8351623: VectorAPI: Add SVE implementation of subword gather load operation > > Looks very interesting. I have a first series of questions / comments :) > > There is definitively a tradeoff between complexity in the backend and in the C2 IR. So I'm yet trying to wrap my head around that decision. I'm just afraid that adding more very specific C2 IR nodes makes things more complicated to do optimizations in the C2 IR. Hi @eme64 , I just push a commit which added more comments and assertion in the code. This is just a simple fixing to part of your comments. Regarding to the IR refinement, I need more time taking a look. So could you please take another look at the changes relative to method_rename/comment/assertion? Thanks a lot in advance! > src/hotspot/cpu/aarch64/aarch64_vector.ad line 6008: > >> 6006: // predicate and place in elements of twice their size within >> 6007: // the destination predicate. >> 6008: > > Suggestion: > > > unnecessary empty line This empty line is auto-generated by the m4 file. I tried some methods to clean it, but all fails. So I have to keep it as it is. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3301972528 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2354773033 From dlunden at openjdk.org Wed Sep 17 08:56:51 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 17 Sep 2025 08:56:51 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 14:21:34 GMT, Emanuel Peter wrote: >> Here I'll argue not touching this in this PR (I did not introduce this), as this is the style of the surrounding code. Should be addressed in a follow-up PR though. > > I'd say this is not just formatting/naming, but code style. We usually fix these cases when we touch the code ;) All right, I'll fix the two local occurrences for `value[ureg_lo]`. I'm sure there are more in `postaloc.cpp` though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2354804179 From mhaessig at openjdk.org Wed Sep 17 09:06:25 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 17 Sep 2025 09:06:25 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block In-Reply-To: References: Message-ID: On Thu, 11 Sep 2025 06:52:19 GMT, Emanuel Peter wrote: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > ------------------------------ > > **Goals** > - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) > - Remove `_nodes` from the vector vtnodes. > > **Details** > - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. > - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. > - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). > - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). > - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. > > I also made a lot of annotations in the code below, for easier review. > > **Suggested order for review** > - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. > - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. > - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` > - `VTransformApplyState`: how it now tracks the memory state. > - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) > - Then look at all the other details. Thank you for your continued effort on this, @eme64! The overall change looks good to me, but I have a few minor suggestions and questions. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 143: > 141: init_req_with_scalar(n, vtn, MemNode::ValueIn); > 142: add_memory_dependencies_of_node_to_vtnode(n, vtn, vtn_memory_dependencies); > 143: } else if (n->isa_CountedLoop()) { Suggestion: } else if (n->is_CountedLoop()) { This is an implicit `!= nullptr` otherwise. src/hotspot/share/opto/vectorization.cpp line 228: > 226: PhiNode* head = _heads.at(alias_idx); > 227: if (head == nullptr) { > 228: // We did not find a phi on this slice yet -> must be a slice with only loads. Could you elaborate for my understanding why this is? Could this not find the load before the phi? src/hotspot/share/opto/vtransform.hpp line 30: > 28: #include "opto/vectorization.hpp" > 29: #include "opto/vectornode.hpp" > 30: #include "utilities/debug.hpp" Am I missing something, because I cannot make out the use? ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27208#pullrequestreview-3233203020 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2354695284 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2354674224 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2354824440 From dlunden at openjdk.org Wed Sep 17 09:17:18 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 17 Sep 2025 09:17:18 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:01:22 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/regmask.hpp line 241: > >> 239: // \_______________________________________________________________________________/ >> 240: // | >> 241: // _rm_size_in_words=_offset=5 > > Can you please add some concise comment why we need `rollover`? Does that happen during register allocation, and if we have rollover then we start spilling instead of keeping values in registers? I'll update the comment above the definition of `_offset` (which I'll move just above this) to hopefully make this clearer. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2354855077 From dlunden at openjdk.org Wed Sep 17 09:17:19 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 17 Sep 2025 09:17:19 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 14:27:51 GMT, Emanuel Peter wrote: >> The ADLC-generated code relies on using the constructor implicitly, so I prefer not touching it in this changeset at least. All the copies are deep, clarified now. > > Ok, I understand. Can you show me an example, so I can understand a little better? Here is an example (compiled from different files). The constructor gets called when converting from reference to value at the return of `divI_proj_mask`. We could fix this by making the return type of `Matcher::divI_proj_mask` `RegMask&` and updating `UDivModINode::match` accordingly, but I regard this as out of scope for this already large PR. I'd be happy to have a look at this in a follow-up PR. const RegMask _INT_RAX_REG_mask( 0x100000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, false ); inline const RegMask &INT_RAX_REG_mask() { return _INT_RAX_REG_mask; } // Register for DIVI projection of divmodI RegMask Matcher::divI_proj_mask() { return INT_RAX_REG_mask(); } //------------------------------match------------------------------------------ // return result(s) along with their RegMask info Node* UDivModINode::match( const ProjNode *proj, const Matcher *match ) { uint ideal_reg = proj->ideal_reg(); RegMask rm; if (proj->_con == div_proj_num) { rm = match->divI_proj_mask(); } else { assert(proj->_con == mod_proj_num, "must be div or mod projection"); rm = match->modI_proj_mask(); } return new MachProjNode(this, proj->_con, rm, ideal_reg); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2354848956 From mhaessig at openjdk.org Wed Sep 17 09:38:02 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 17 Sep 2025 09:38:02 GMT Subject: RFR: 8367721: Test compiler/arguments/TestCompileTaskTimeout.java crashed: SIGSEGV In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 07:49:47 GMT, Marc Chevalier wrote: >> The test `TestCompileTaskTimeout.java` runs `java -Xcomp -XX:CompileTaskTimeout=1 --version` to demonstrate that the timeout works. Part of the timeout working involves it printing the method of the compile task. Inspecting the core file of the execution that failed with a `SIGSEGV` in the compile task timeout signal handler, the backtrace looks as follows: >> >> #n >> #n+1 CompilerThreadTimeoutLinux::signal_handler() >> #n+2 >> #n+3 timer_settime() >> #n+4 CompilerThreadTimeoutLinux::disarm() >> #n+5 CompileTaskWrapper::~CompileTaskWrapper() >> >> So, the compile task hit the timeout during destruction of the underlying `CompileTaskWrapper`. Since the timeout was disarmed only after setting the task to null in the destructor, the signal handler segfaulted when trying to access the method of the compile task to print it out. This PR addresses this issue by moving up the disarmament of the timeout to the top of the destructor. >> >> Because this issue can only be triggered with bad --- or good, depending on your view --- luck on timing, I could not devise a regression test. But this is not too big of an issue, since the CI already caught this issue. >> >> Testing: >> - [ ] Github Actions >> - [ ] tier1,tier2,tier3 plus stress testing on Oracle supported platforms > > That makes a lot of sense to me. > > If I understand well, it happens when the compile task is naturally terminating, when compilation is done in pretty much the delay granted by the timeout. It doesn't come from the concurrent handling of the timeout and some other kind of error? Thank you for taking a look, @marc-chevalier. > It doesn't come from the concurrent handling of the timeout and some other kind of error? There is only one timeout going on for every compile task and thus for every compiler thread and they are delivered directly to the native thread running the compiler thread with the timed out compile task. Since `CompileTask`s are thread local, this cannot come from concurrent timeouts. I can successfully exclude other kinds of errors, because the timeout signal handler does not have any other ways to segfault other than on accessing the task, because it does not use any of the pointers passed to it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27331#issuecomment-3302151009 From snatarajan at openjdk.org Wed Sep 17 09:48:36 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 17 Sep 2025 09:48:36 GMT Subject: Integrated: 8356779: IGV: dump the index of the SafePointNode containing the current JVMS during parsing In-Reply-To: References: Message-ID: On Thu, 4 Sep 2025 05:22:00 GMT, Saranya Natarajan wrote: > This PR prints index of the SafePointNode containing the current JVMS during parsing in IGV. As stated in JBS the reason for this is that there are a lot of nodes during parsing, it would be nice to know what are the current nodes in the local slots or in the stack when looking at a graph. > > IGV screenshot of before fix > Screenshot 2025-09-15 at 11 56 54 > > IGV screenshot of after fix > Screenshot 2025-09-15 at 11 54 55 This pull request has now been integrated. Changeset: 6df01178 Author: Saranya Natarajan Committer: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/6df01178c03968bee7994eddd187f790c74ba541 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8356779: IGV: dump the index of the SafePointNode containing the current JVMS during parsing Reviewed-by: epeter, chagedorn, qamai ------------- PR: https://git.openjdk.org/jdk/pull/27083 From dlunden at openjdk.org Wed Sep 17 09:52:03 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 17 Sep 2025 09:52:03 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:02:13 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > src/hotspot/share/opto/chaitin.cpp line 1663: > >> 1661: if (!OptoReg::is_valid(reg) && is_infinite_stack) { >> 1662: // Bump register mask up to next stack chunk >> 1663: bool success = lrg->rollover(); > > Can you add a comment that explains what this does / means? Do we start spilling to the stack slots instead of using registers? Sure, I'll expand on the existing comment. > src/hotspot/share/opto/regmask.hpp line 837: > >> 835: // ---------------------------------------------------------------------- >> 836: // The methods below are only for testing purposes (see test_regmask.cpp) >> 837: // ---------------------------------------------------------------------- > > I wonder if it could be solved with `friend` instead, so it does not have to be public and get accidentally used somehow. > > Or maybe some `gtest_` prefix? Not sure. I like adding a `gtest_` prefix, I'll do that. Not sure how to make gtests work with `friend`. > test/hotspot/jtreg/compiler/arguments/TestMethodArguments.java line 51: > >> 49: static final int INPUT_SIZE = 100; >> 50: >> 51: public static Template.ZeroArgs generateTest(PrimitiveType t, int numberOfArguments) { > > You should write out `type` instead of `t`, would make it consistent with your `let` below. Thanks, I'll fix it ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2354938220 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2354945861 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2354946296 From dlunden at openjdk.org Wed Sep 17 09:55:35 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 17 Sep 2025 09:55:35 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:42:06 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 39 commits: >> >> - Clarify comments in regmask.hpp >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Address review comments (renaming on the way in a separate PR) >> - Update src/hotspot/share/opto/regmask.hpp >> >> Co-authored-by: Emanuel Peter >> - Restore modified java/lang/invoke tests >> - Sort includes (new requirement) >> - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates >> - Add clarifying comments at definitions of register mask sizes >> - Fix implicit zero and nullptr checks >> - Add deep copy comment >> - ... and 29 more: https://git.openjdk.org/jdk/compare/60930a3e...c1f41288 > > test/hotspot/jtreg/compiler/arguments/TestMethodArguments.java line 120: > >> 118: Template.let("classpath", comp.getEscapedClassPathOfCompiledClasses()), >> 119: """ >> 120: import java.util.Arrays; > > Personally, I would not indent this deeply. I know that the generated code will not have proper indentation, but that's no so bad. Readability of the Templates is more important I think. Subjective though. No strong opinion here, I just went with the eclipse-jdtls autoformatter defaults. The generated code does have fairly OK indentation (the indentation in the code does not add any actual indentation in the generated code). Let me know what you prefer and I'll update it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2354957995 From shade at openjdk.org Wed Sep 17 09:56:23 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 17 Sep 2025 09:56:23 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 14:27:28 GMT, Aleksey Shipilev wrote: >> See the bug for discussion what issues current machinery has. >> >> This PR executes the plan outlined in the bug: >> 1. Common the receiver type profiling code in interpreter and C1 >> 2. Rewrite receiver type profiling code to only do atomic receiver slot installations >> 3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed >> >> This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler/` >> - [ ] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls > - Drop atomic counters > - Initial version Looking for reviews! @dean-long, @vnkozlov, @veresov -- you would probably be interested in this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25305#issuecomment-3302217390 From dlunden at openjdk.org Wed Sep 17 09:58:56 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 17 Sep 2025 09:58:56 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v23] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 14:36:59 GMT, Emanuel Peter wrote: >> @eme64 I have now addressed your comments (the renaming is in https://github.com/openjdk/jdk/pull/27215, as requested). Please have a look and let me know if I've missed something. > > @dlunde Thanks for the swift updates! I have in the meantime added some more comments, just making sure you don't miss them :) @eme64 > You seem to have a build failure: > > ``` > In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/compile.hpp:43, > from /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:29, > from /home/runner/work/jdk/jdk/test/hotspot/gtest/opto/test_rangeinference.cpp:26: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp: In constructor ?RegMask::RegMask(Arena*)?: > /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp:441:53: error: class ?RegMask? does not have any field named ?_read_only? > 441 | : _rm_word() DEBUG_ONLY(COMMA _arena(arena)), _read_only(read_only), > | ^~~~~~~~~~ > /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp:441:64: error: ?read_only? was not declared in this scope > 441 | : _rm_word() DEBUG_ONLY(COMMA _arena(arena)), _read_only(read_only), > | > ``` Thanks, only failed on release so didn't notice. Will fix. > I really appreciate that you added extensive `gtest`s, thanks for that ? @robcasloz contributed 90% of that, so the credit goes to him! > And thanks for using the Template Framework, I'm curious to hear if you have any feedback on it :) Sure, it was quite convenient. Happy to talk about the experience offline. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-3302233210 From dlunden at openjdk.org Wed Sep 17 10:07:58 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 17 Sep 2025 10:07:58 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v29] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update after comments from Emanuel ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/fe69f5a3..9b04b5a7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=28 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=27-28 Stats: 35 lines in 5 files changed: 6 ins; 0 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From chagedorn at openjdk.org Wed Sep 17 10:42:03 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 17 Sep 2025 10:42:03 GMT Subject: RFR: 8367728: IGV: dump node address type In-Reply-To: References: Message-ID: <5-sK87ityBleluOy9_tQuCywbTT5b03KyEUMe9Yk7LQ=.6cb6ca6c-b0b0-417f-b3ad-732d422908b2@github.com> On Tue, 16 Sep 2025 10:11:50 GMT, Roberto Casta?eda Lozano wrote: > This changeset dumps the address type of each node (`Node::adr_type()`), when not null, into the IGV graphs. This should improve the visibility and diagnosability of C2 type inconsistencies, see e.g. [JDK-8367667](https://bugs.openjdk.org/browse/JDK-8367667). > > #### Testing > - tier1 (windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64; release and debug mode). > - Tested IGV manually on a few selected graphs. Tested automatically that dumping thousands of graphs does not trigger any assertion failure (by running `java -Xcomp -XX:PrintIdealGraphLevel=1`). Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27310#pullrequestreview-3233795276 From epeter at openjdk.org Wed Sep 17 11:32:01 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 11:32:01 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 14:08:57 GMT, Aleksey Shipilev wrote: >> I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: >> >> 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. >> 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. >> >> I think we should be running CTW tests in AWT headless mode to begin with. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8367313-ctw-headless-mode > - Fix Tests passed, yes :) Approved! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27187#pullrequestreview-3233980111 From shade at openjdk.org Wed Sep 17 11:39:21 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 17 Sep 2025 11:39:21 GMT Subject: RFR: 8367313: CTW: Execute in AWT headless mode [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 14:08:57 GMT, Aleksey Shipilev wrote: >> I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: >> >> 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. >> 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. >> >> I think we should be running CTW tests in AWT headless mode to begin with. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8367313-ctw-headless-mode > - Fix Thanks! Let's go. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27187#issuecomment-3302573510 From shade at openjdk.org Wed Sep 17 11:39:22 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 17 Sep 2025 11:39:22 GMT Subject: Integrated: 8367313: CTW: Execute in AWT headless mode In-Reply-To: References: Message-ID: On Wed, 10 Sep 2025 08:11:43 GMT, Aleksey Shipilev wrote: > I have been doing CTW parallelization improvements, and noticed that some of the AWT clinits run and initialize graphics stack. This is awkward for a few reasons: > > 1. We might be running on headless environment and these clinits could fail, shrinking the CTW testing scope. > 2. There are dependencies in graphics stack initialization that break -- in one case in my parallelization tests, I have seen the VM crash due to uninitialized AWT lock, because randomized CTW runner managed to execute clinits in unusual order. Running in headless mode avoids dealing with that path altogether. > > I think we should be running CTW tests in AWT headless mode to begin with. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` This pull request has now been integrated. Changeset: 7e738f0d Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/7e738f0d906e574706a277fabbc2cc1df6f11f19 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8367313: CTW: Execute in AWT headless mode Reviewed-by: epeter, kvn ------------- PR: https://git.openjdk.org/jdk/pull/27187 From epeter at openjdk.org Wed Sep 17 11:45:37 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 11:45:37 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v2] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 09:03:48 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> for Manuel > > Thank you for your continued effort on this, @eme64! The overall change looks good to me, but I have a few minor suggestions and questions. @mhaessig Thanks for the comments! I realized I had some extra code comments "pending" on github, so I added them now. @mhaessig Ready for re-review ;) > src/hotspot/share/opto/superwordVTransformBuilder.cpp line 143: > >> 141: init_req_with_scalar(n, vtn, MemNode::ValueIn); >> 142: add_memory_dependencies_of_node_to_vtnode(n, vtn, vtn_memory_dependencies); >> 143: } else if (n->isa_CountedLoop()) { > > Suggestion: > > } else if (n->is_CountedLoop()) { > > This is an implicit `!= nullptr` otherwise. Good catch! > src/hotspot/share/opto/vectorization.cpp line 228: > >> 226: PhiNode* head = _heads.at(alias_idx); >> 227: if (head == nullptr) { >> 228: // We did not find a phi on this slice yet -> must be a slice with only loads. > > Could you elaborate for my understanding why this is? Could this not find the load before the phi? We loop over `_body.body()`, which is already topologically ordered. So if the `load` depends on the `phi` on the memory graph, then the `phi` must already have been found. I'll add a comment in the code. > src/hotspot/share/opto/vtransform.hpp line 30: > >> 28: #include "opto/vectorization.hpp" >> 29: #include "opto/vectornode.hpp" >> 30: #include "utilities/debug.hpp" > > Am I missing something, because I cannot make out the use? Good catch! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27208#issuecomment-3302599208 PR Comment: https://git.openjdk.org/jdk/pull/27208#issuecomment-3302600873 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2355221323 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2355213338 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2355228785 From epeter at openjdk.org Wed Sep 17 11:45:57 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 11:45:57 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v2] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 11:42:10 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> ------------------------------ >> >> **Goals** >> - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) >> - Remove `_nodes` from the vector vtnodes. >> >> **Details** >> - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. >> - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. >> - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). >> - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). >> - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. >> >> I also made a lot of annotations in the code below, for easier review. >> >> **Suggested order for review** >> - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. >> - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. >> - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` >> - `VTransformApplyState`: how it now tracks the memory state. >> - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) >> - Then look at all the other details. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > for Manuel src/hotspot/share/opto/phasetype.hpp line 95: > 93: flags(AUTO_VECTORIZATION4_AFTER_SPECULATIVE_RUNTIME_CHECKS, "AutoVectorization 3, after Adding Speculative Runtime Checks") \ > 94: flags(AUTO_VECTORIZATION5_AFTER_APPLY, "AutoVectorization 4, after Apply") \ > 95: flags(BEFORE_CCP1, "Before PhaseCCP 1") \ Removing `apply_memops_reordering_with_schedule`. src/hotspot/share/opto/superword.cpp line 668: > 666: } > 667: > 668: // Get all memory nodes of a slice, in reverse order Refactored and moved to `vectorization.hpp`, where the it belongs. src/hotspot/share/opto/superword.cpp line 670: > 668: // Iterate over all memory phis > 669: for (DUIterator_Fast imax, i = cl->fast_outs(imax); i < imax; i++) { > 670: PhiNode* phi = cl->fast_out(i)->isa_Phi(); Note: the old way only tracked memory slices that have a phi (i.e. slices that have stores). But we now also need to track slices that only have loads, and hence no phi. src/hotspot/share/opto/superword.cpp line 1555: > 1553: assert(pack != nullptr, "memop of final solution must still be packed"); > 1554: _vpointer_for_main_loop_alignment = &vpointer(mem); > 1555: _aw_for_main_loop_alignment = pack->size() * mem->memory_size(); Later, we only need the `VPointer`, and not the `mem` node itself. This removes the dependency on `_nodes` for vtnodes. src/hotspot/share/opto/superword.cpp line 1994: > 1992: } > 1993: > 1994: void VTransformGraph::apply_vectorization_for_each_vtnode(uint& max_vector_length, uint& max_vector_width) const { We now create the memory graph from scratch, during `apply`, `apply_backedge` and `apply_state.fix_memory_state_uses_after_loop`. The `VTransformApplyState` keeps track of the memory states. src/hotspot/share/opto/superword.cpp line 2675: > 2673: for (uint i = 0; i < pack->size(); i++) { > 2674: Node* n = pack->at(i); > 2675: assert(n->is_Load(), "only meaningful for loads"); We can use the `pack` to access the nodes during construction of the `VTransform`, and we do not need to keep the `pack` nodes in the `_nodes` any more. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 59: > 57: for (uint i = 0; i < _vloop.lpt()->_body.size(); i++) { > 58: Node* n = _vloop.lpt()->_body.at(i); > 59: if (_packset.get_pack(n) != nullptr) { continue; } Create nodes for all nodes in the loop, not just the basic block. src/hotspot/share/opto/superwordVTransformBuilder.cpp line 71: > 69: vtn = new (_vtransform.arena()) VTransformCountedLoopNode(_vtransform, n->as_CountedLoop()); > 70: } else if (n->is_CFG()) { > 71: vtn = new (_vtransform.arena()) VTransformCFGNode(_vtransform, n); `CountedLoop` is special case of `CFG` src/hotspot/share/opto/superwordVTransformBuilder.cpp line 147: > 145: init_req_with_scalar(n, vtn, LoopNode::EntryControl); > 146: init_req_with_scalar(n, vtn, LoopNode::LoopBackControl); > 147: } else { Also map the backedges of `Phi` and `CountedLoop` - we are mapping the whole loop! src/hotspot/share/opto/superwordVTransformBuilder.cpp line 178: > 176: } > 177: } > 178: We also create `Outer` vtnodes for all uses after the loop. Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). src/hotspot/share/opto/superwordVTransformBuilder.cpp line 212: > 210: vtn = new (_vtransform.arena()) VTransformElementWiseVectorNode(_vtransform, p0->req(), properties, vopc); > 211: } > 212: vtn->set_nodes(pack); We don't need `_nodes` any more! src/hotspot/share/opto/vectorization.cpp line 190: > 188: } > 189: > 190: _memory_slices.find_memory_slices(); `VLoopMemorySlices` needs the body as input, so compute it earlier! src/hotspot/share/opto/vectorization.cpp line 212: > 210: // - No memory phi: only loads. All have the same input memory state from before the loop. > 211: // - With memory phi. Chain of memory operations inside the loop. > 212: void VLoopMemorySlices::find_memory_slices() { See `VLoopMemorySlices` for more documentation on the cases. src/hotspot/share/opto/vectorization.hpp line 382: > 380: }; > 381: > 382: // Submodule of VLoopAnalyzer. Refactored and moved down. src/hotspot/share/opto/vectorization.hpp line 474: > 472: const VLoopBody& _body; > 473: > 474: GrowableArray _inputs; We used to only track slices with phis (store in the loop), and not those with only loads (no phi needed). But now we need to also know the input memory slice for loads during `apply`, when we call `apply_state.memory_state`. src/hotspot/share/opto/vtransform.cpp line 83: > 81: > 82: // Skip LoopPhi backedge. > 83: if ((use->isa_LoopPhi() != nullptr || use->isa_CountedLoop() != nullptr) && use->in_req(2) == vtn) { continue; } We now also map the `Phi` and `CountedLoop` backedges, but for scheduling we need to ignore them to get a DAG. src/hotspot/share/opto/vtransform.cpp line 778: > 776: } > 777: } > 778: } We now systematically use the edges of the vtnodes when building the graph. Before we just relied on the old C2 node edges still being correct, but we need to get away from this to allow more graph reshaping on the vtnodes later. src/hotspot/share/opto/vtransform.cpp line 787: > 785: if (_node->is_Store()) { > 786: apply_state.set_memory_state(_node->adr_type(), _node); > 787: } We build the memory graph on the fly, instead of first reordering the scalar mem nodes with `apply_memops_reordering_with_schedule`. src/hotspot/share/opto/vtransform.cpp line 914: > 912: Node* n = _nodes.at(i); > 913: phase->igvn().replace_node(n, vn); > 914: } We don't need to replace the old nodes any more: since we now systematically use the vtnode edges, the old nodes simply get disconnected. This is also why we need to map all use nodes after the loop with `Outer` vtnodes, so that they then automatically change the edges to the new nodes during `apply`. See `VTransformOuterNode::apply` uses `apply_vtn_inputs_to_node`. src/hotspot/share/opto/vtransform.cpp line 955: > 953: }); > 954: } > 955: Obsolete after removal of `apply_memops_reordering_with_schedule`. src/hotspot/share/opto/vtransform.hpp line 191: > 189: > 190: template > 191: void for_each_memop_in_schedule(Callback callback) const; Obsolete after removal of `apply_memops_reordering_with_schedule`. src/hotspot/share/opto/vtransform.hpp line 293: > 291: // loop. If there is a memory phi, this is initially the memory phi, and each time > 292: // a store is processed, it is updated to that store. > 293: GrowableArray _memory_states; Needed to build the memory graph on the fly during `apply`. src/hotspot/share/opto/vtransform.hpp line 452: > 450: virtual VTransformApplyResult apply(VTransformApplyState& apply_state) const = 0; > 451: > 452: Node* find_transformed_input(int i, const GrowableArray& vnode_idx_to_transformed_node) const; Missed the removal in an earlier refactoring. Let's do it now. src/hotspot/share/opto/vtransform.hpp line 636: > 634: const VTransformVectorNodeProperties _properties; > 635: protected: > 636: GrowableArray _nodes; Big win! Saves us some memory per node, and means the vector nodes are no longer tied to scalar nodes. We will soon be able to optimimize the graph with vector nodes that have no scalar equivalent. For example shuffle. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343516365 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343519154 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343562260 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343521196 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343524759 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343515369 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343529810 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343527996 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343533310 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343540827 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343541422 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343544731 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343546719 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343548855 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343570394 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343553989 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343577532 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343580534 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343593023 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343595455 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343598080 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343600818 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343602701 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343608818 From epeter at openjdk.org Wed Sep 17 11:45:34 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 11:45:34 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v2] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > ------------------------------ > > **Goals** > - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) > - Remove `_nodes` from the vector vtnodes. > > **Details** > - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. > - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. > - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). > - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). > - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. > > I also made a lot of annotations in the code below, for easier review. > > **Suggested order for review** > - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. > - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. > - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` > - `VTransformApplyState`: how it now tracks the memory state. > - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) > - Then look at all the other details. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: for Manuel ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27208/files - new: https://git.openjdk.org/jdk/pull/27208/files/3ec3ea2a..469426a7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27208&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27208&range=00-01 Stats: 4 lines in 3 files changed: 2 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27208.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27208/head:pull/27208 PR: https://git.openjdk.org/jdk/pull/27208 From epeter at openjdk.org Wed Sep 17 11:45:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 11:45:58 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v2] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 09:09:29 GMT, Emanuel Peter wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> for Manuel > > src/hotspot/share/opto/vectorization.cpp line 212: > >> 210: // - No memory phi: only loads. All have the same input memory state from before the loop. >> 211: // - With memory phi. Chain of memory operations inside the loop. >> 212: void VLoopMemorySlices::find_memory_slices() { > > See `VLoopMemorySlices` for more documentation on the cases. Note: we used to only track slices with phis (i.e. with stores on the slice), and not those that have only loads (and hence no phi). > src/hotspot/share/opto/vectorization.hpp line 382: > >> 380: }; >> 381: >> 382: // Submodule of VLoopAnalyzer. > > Refactored and moved down. Needs to be after `VLoopBody`, because `VLoopMemorySlices` now needs `VLoopBody` as input. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343565551 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343551782 From chagedorn at openjdk.org Wed Sep 17 12:00:38 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 17 Sep 2025 12:00:38 GMT Subject: RFR: 8367721: Test compiler/arguments/TestCompileTaskTimeout.java crashed: SIGSEGV In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 06:57:29 GMT, Manuel H?ssig wrote: > The test `TestCompileTaskTimeout.java` runs `java -Xcomp -XX:CompileTaskTimeout=1 --version` to demonstrate that the timeout works. Part of the timeout working involves it printing the method of the compile task. Inspecting the core file of the execution that failed with a `SIGSEGV` in the compile task timeout signal handler, the backtrace looks as follows: > > #n > #n+1 CompilerThreadTimeoutLinux::signal_handler() > #n+2 > #n+3 timer_settime() > #n+4 CompilerThreadTimeoutLinux::disarm() > #n+5 CompileTaskWrapper::~CompileTaskWrapper() > > So, the compile task hit the timeout during destruction of the underlying `CompileTaskWrapper`. Since the timeout was disarmed only after setting the task to null in the destructor, the signal handler segfaulted when trying to access the method of the compile task to print it out. This PR addresses this issue by moving up the disarmament of the timeout to the top of the destructor. > > Because this issue can only be triggered with bad --- or good, depending on your view --- luck on timing, I could not devise a regression test. But this is not too big of an issue, since the CI already caught this issue. > > Testing: > - [ ] Github Actions > - [ ] tier1,tier2,tier3 plus stress testing on Oracle supported platforms Looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27331#pullrequestreview-3234082978 From syan at openjdk.org Wed Sep 17 12:34:06 2025 From: syan at openjdk.org (SendaoYan) Date: Wed, 17 Sep 2025 12:34:06 GMT Subject: RFR: 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 09:56:24 GMT, Damon Fenacci wrote: > ## Problem > After [JDK-8260555](https://bugs.openjdk.org/browse/JDK-8260555) changed the default TIMEOUT_FACTOR from 4 to 1, the test compiler/startup/StartupOutput.java can occasionally slightly exceed the 2-minute timeout on Windows. > > ## Change > Rather than increasing the timeout, this change reduces the number of VM runs with randomly generated near-minimum code cache sizes from 200 to 50. This should still provide sufficient coverage while keeping execution well within the timeout. > > ## Testing: > Tiers 1-3+ LGTM ------------- Marked as reviewed by syan (Committer). PR Review: https://git.openjdk.org/jdk/pull/27254#pullrequestreview-3234207633 From rcastanedalo at openjdk.org Wed Sep 17 12:51:19 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 17 Sep 2025 12:51:19 GMT Subject: RFR: 8367728: IGV: dump node address type In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:11:50 GMT, Roberto Casta?eda Lozano wrote: > This changeset dumps the address type of each node (`Node::adr_type()`), when not null, into the IGV graphs. This should improve the visibility and diagnosability of C2 type inconsistencies, see e.g. [JDK-8367667](https://bugs.openjdk.org/browse/JDK-8367667). > > #### Testing > - tier1 (windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64; release and debug mode). > - Tested IGV manually on a few selected graphs. Tested automatically that dumping thousands of graphs does not trigger any assertion failure (by running `java -Xcomp -XX:PrintIdealGraphLevel=1`). Thanks for reviewing Marc, Damon, and Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27310#issuecomment-3302823069 From rcastanedalo at openjdk.org Wed Sep 17 12:51:20 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 17 Sep 2025 12:51:20 GMT Subject: Integrated: 8367728: IGV: dump node address type In-Reply-To: References: Message-ID: <_Dj-64Fqwn5MpGBIUoIikNcv1gldIO93cgJr15py53U=.274b31d7-f5a8-47ba-9999-8847bb5d95d4@github.com> On Tue, 16 Sep 2025 10:11:50 GMT, Roberto Casta?eda Lozano wrote: > This changeset dumps the address type of each node (`Node::adr_type()`), when not null, into the IGV graphs. This should improve the visibility and diagnosability of C2 type inconsistencies, see e.g. [JDK-8367667](https://bugs.openjdk.org/browse/JDK-8367667). > > #### Testing > - tier1 (windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64; release and debug mode). > - Tested IGV manually on a few selected graphs. Tested automatically that dumping thousands of graphs does not trigger any assertion failure (by running `java -Xcomp -XX:PrintIdealGraphLevel=1`). This pull request has now been integrated. Changeset: b00e0dae Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/b00e0dae9bbd4bd88f8e7307b7c96688fa3194fe Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod 8367728: IGV: dump node address type Reviewed-by: mchevalier, dfenacci, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/27310 From syan at openjdk.org Wed Sep 17 13:13:14 2025 From: syan at openjdk.org (SendaoYan) Date: Wed, 17 Sep 2025 13:13:14 GMT Subject: RFR: 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 09:56:24 GMT, Damon Fenacci wrote: > ## Problem > After [JDK-8260555](https://bugs.openjdk.org/browse/JDK-8260555) changed the default TIMEOUT_FACTOR from 4 to 1, the test compiler/startup/StartupOutput.java can occasionally slightly exceed the 2-minute timeout on Windows. > > ## Change > Rather than increasing the timeout, this change reduces the number of VM runs with randomly generated near-minimum code cache sizes from 200 to 50. This should still provide sufficient coverage while keeping execution well within the timeout. > > ## Testing: > Tiers 1-3+ GHA shows GetStackTraceALotWhenPinned.java timed out on macos. The failure has been fixed by [JDK-8366893](https://bugs.openjdk.org/browse/JDK-8366893). I think you can merge the master first. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27254#issuecomment-3302952057 From aturbanov at openjdk.org Wed Sep 17 14:06:49 2025 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Wed, 17 Sep 2025 14:06:49 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v5] In-Reply-To: <2gGUfvVlIaLGOd5iJUN3-oi9jlytrkULE3WZRUX1x78=.c0da1562-8cac-4215-9ae4-5cb248c89c0b@github.com> References: <2gGUfvVlIaLGOd5iJUN3-oi9jlytrkULE3WZRUX1x78=.c0da1562-8cac-4215-9ae4-5cb248c89c0b@github.com> Message-ID: On Tue, 16 Sep 2025 07:57:38 GMT, Emanuel Peter wrote: >> Demo from here: >> https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ >> >> Cleaned up and enhanced with a JTREG and IR test. >> I also added some additional "generated" normal maps from height functions. >> And I display the resulting image side-by-side with the normal map. >> >> I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. >> >> There is a **stand-alone** way to run the demo: >> `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` >> (though it may only run with JDK22+, probably due some amber features) >> >> **Quick Perforance Numbers**, running on my avx512 laptop. >> default / AVX3: 105 FPS >> AVX2: 82 FPS >> AVX1: 50 FPS >> No vectorization: 19 FPS >> GraalJIT: 13 FPS (`jdk-26-ea+5` - probably issue with vectorization / inlining?) >> >> Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: >> image >> image >> image >> image > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > more for Christian test/hotspot/jtreg/compiler/gallery/TestNormalMapping.java line 58: > 56: System.out.println("Running JTREG test in mode: " + mode); > 57: > 58: switch(mode) { nit Suggestion: switch (mode) { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27282#discussion_r2355646822 From roland at openjdk.org Wed Sep 17 14:14:19 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 17 Sep 2025 14:14:19 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v12] In-Reply-To: <_YXE9yfxaouyeyMsdurEy_uEx0FJDbGcX8M8L7aDqm0=.770ff0aa-8ae3-46ac-8cc1-7d38710e859e@github.com> References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> <_YXE9yfxaouyeyMsdurEy_uEx0FJDbGcX8M8L7aDqm0=.770ff0aa-8ae3-46ac-8cc1-7d38710e859e@github.com> Message-ID: <_vAArE_XdQUT4nJdyLfvzOzkK87h4e3BtV_KhET-Uuk=.36074582-9265-41f6-a686-d607facb915c@github.com> On Fri, 12 Sep 2025 22:42:08 GMT, Dean Long wrote: >> It's one of the things mentioned in that comment: >> https://github.com/openjdk/jdk/pull/24570#issuecomment-2883651987 >> >> "I added asserts to catch cases where proj_out is called but the node has more than one matching projection. With those asserts, I caught some false positive/cases where we got lucky and worked around them by reworking the code so it doesn't use proj_out. That's the case in PhaseIdealLoop::intrinsify_fill(): we can end up there with more than one FramePtr projection because the code pattern used elsewhere is to add one more projection and let identical projections common during igvn. " > > Are we just lucky that we don't have the same problem with ReturnAdr here? Yes, most likely. This is also a pretty harmless corner case: if there is more than one `Parm` projection, the assert in `proj_out` catches it even though it does no harm to have more than one projection in this particular case. So this change is here, not to fix some broken code, but to make it possible to have a strict assert in `proj_out`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2355669590 From dfenacci at openjdk.org Wed Sep 17 14:17:13 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 17 Sep 2025 14:17:13 GMT Subject: RFR: 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows [v2] In-Reply-To: References: Message-ID: > ## Problem > After [JDK-8260555](https://bugs.openjdk.org/browse/JDK-8260555) changed the default TIMEOUT_FACTOR from 4 to 1, the test compiler/startup/StartupOutput.java can occasionally slightly exceed the 2-minute timeout on Windows. > > ## Change > Rather than increasing the timeout, this change reduces the number of VM runs with randomly generated near-minimum code cache sizes from 200 to 50. This should still provide sufficient coverage while keeping execution well within the timeout. > > ## Testing: > Tiers 1-3+ Damon Fenacci has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8367278 - JDK-8367278: reduce loop to 50 cycles - JDK-8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27254/files - new: https://git.openjdk.org/jdk/pull/27254/files/1b4149c8..8643e5fd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27254&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27254&range=00-01 Stats: 38573 lines in 1133 files changed: 19462 ins; 10155 del; 8956 mod Patch: https://git.openjdk.org/jdk/pull/27254.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27254/head:pull/27254 PR: https://git.openjdk.org/jdk/pull/27254 From mhaessig at openjdk.org Wed Sep 17 14:32:37 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 17 Sep 2025 14:32:37 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions In-Reply-To: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> On Thu, 21 Aug 2025 15:03:57 GMT, Emanuel Peter wrote: > Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). > > Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. > > Details, in **order you should review**: > - `Operations.java`: maps lots of primitive operators as Expressions. > - `Expression.java`: the fundamental engine behind Expressions. > - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. > - `tests/TestExpression.java`: correctness test of Expression machinery. > - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. > - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. > - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. > > If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. > > **Future Work**: > - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. > - Use `Expression`s to model more operations: > - `Vector API`, more arithmetic operations like from `Math` classes etc. > - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. > - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just folds away, but under `StressIGVN` and `Stres... Thank you for this enhancement, @eme64! It is nice to see the template framework library evolving. The changes look good. I mostly have nits. test/hotspot/jtreg/compiler/igvn/ExpressionFuzzer.java line 31: > 29: * @library /test/lib / > 30: * @compile ../lib/verify/Verify.java > 31: * @run main compiler.igvn.ExpressionFuzzer Since you are fuzzing, you might want to consider adding a compile task timeout in case the random methods cause degenerate compilations. Below is a suggestion for a timeout of 10 seconds, which should be plenty. Suggestion: * @run main -XX:+IgnoreUnrecognizedVMOptions -XX:CompileTaskTimeout=10000 compiler.igvn.ExpressionFuzzer test/hotspot/jtreg/compiler/igvn/ExpressionFuzzer.java line 204: > 202: // once, and pass the same values into both the compiled and reference method. > 203: var valueTemplate = Template.make("name", "type", (String name, CodeGenerationDataNameType type) -> body( > 204: //"#type #name = ", type.con(), ";\n" Suggestion: test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 40: > 38: /** > 39: * {@link Expression}s model Java expressions, that have a list of arguments with specified > 40: * argument types, and an result with a specified result type. Once can {@link #make} a new Suggestion: * argument types, and a result with a specified result type. Once can {@link #make} a new Nit: typo test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 152: > 150: > 151: /** > 152: * Creates a new Espression with 1 arguments. For every make(): s/Espression/Expression/ test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 164: > 162: CodeGenerationDataNameType t0, > 163: String s1) { > 164: return new Expression(returnType, List.of(t0), List.of(s0, s1), new Info()); To reduce code duplication, the methods without an additional info should probably use the ones with. Suggestion: return make(returnType, s0, t0, s1, new Info()); test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 358: > 356: tokens.add(arguments.get(i)); > 357: } > 358: tokens.add(strings.get(strings.size()-1)); Suggestion: tokens.add(strings.getLast()); A wee bit easier to read. test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 380: > 378: } > 379: sb.append("\""); > 380: sb.append(this.strings.get(this.strings.size()-1)); Suggestion: sb.append(this.strings.getLast()); test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 465: > 463: newArgumentTypes.add(nestingExpression.argumentTypes.get(i)); > 464: } > 465: newStrings.add(nestingExpression.strings.get(nestingExpression.strings.size() - 1) + Suggestion: newStrings.add(nestingExpression.strings.getLast() + test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 1: > 1: /* I gave it my best shot to suggest a reasonable and reasonably consistent alignment. test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 67: > 65: ops.add(Expression.make(BYTES, "(byte)(", LONGS, ")")); > 66: ops.add(Expression.make(BYTES, "(byte)(", FLOATS, ")")); > 67: ops.add(Expression.make(BYTES, "(byte)(", DOUBLES, ")")); Suggestion: ops.add(Expression.make(BYTES, "(byte)(", BYTES, ")")); ops.add(Expression.make(BYTES, "(byte)(", SHORTS, ")")); ops.add(Expression.make(BYTES, "(byte)(", CHARS, ")")); ops.add(Expression.make(BYTES, "(byte)(", INTS, ")")); ops.add(Expression.make(BYTES, "(byte)(", LONGS, ")")); ops.add(Expression.make(BYTES, "(byte)(", FLOATS, ")")); ops.add(Expression.make(BYTES, "(byte)(", DOUBLES, ")")); Whitespace example test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 78: > 76: ops.add(Expression.make(INTS, "Byte.compareUnsigned(", BYTES, ", ", BYTES, ")")); > 77: ops.add(Expression.make(INTS, "Byte.toUnsignedInt(", BYTES, ")")); > 78: ops.add(Expression.make(LONGS, "Byte.toUnsignedLong(", BYTES, ")")); Suggestion: ops.add(Expression.make(INTS, "Byte.compare(", BYTES, ", ", BYTES, ")")); ops.add(Expression.make(INTS, "Byte.compareUnsigned(", BYTES, ", ", BYTES, ")")); ops.add(Expression.make(INTS, "Byte.toUnsignedInt(", BYTES, ")")); ops.add(Expression.make(LONGS, "Byte.toUnsignedLong(", BYTES, ")")); Alignment test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 87: > 85: ops.add(Expression.make(CHARS, "(char)(", LONGS, ")")); > 86: ops.add(Expression.make(CHARS, "(char)(", FLOATS, ")")); > 87: ops.add(Expression.make(CHARS, "(char)(", DOUBLES, ")")); Suggestion: ops.add(Expression.make(CHARS, "(char)(", BYTES, ")")); ops.add(Expression.make(CHARS, "(char)(", SHORTS, ")")); ops.add(Expression.make(CHARS, "(char)(", CHARS, ")")); ops.add(Expression.make(CHARS, "(char)(", INTS, ")")); ops.add(Expression.make(CHARS, "(char)(", LONGS, ")")); ops.add(Expression.make(CHARS, "(char)(", FLOATS, ")")); ops.add(Expression.make(CHARS, "(char)(", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 96: > 94: // ------------ Character ------------- > 95: ops.add(Expression.make(INTS, "Character.compare(", CHARS, ", ", CHARS, ")")); > 96: ops.add(Expression.make(CHARS, "Character.reverseBytes(", CHARS, ")")); Suggestion: ops.add(Expression.make(INTS, "Character.compare(", CHARS, ", ", CHARS, ")")); ops.add(Expression.make(CHARS, "Character.reverseBytes(", CHARS, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 105: > 103: ops.add(Expression.make(SHORTS, "(short)(", LONGS, ")")); > 104: ops.add(Expression.make(SHORTS, "(short)(", FLOATS, ")")); > 105: ops.add(Expression.make(SHORTS, "(short)(", DOUBLES, ")")); Suggestion: ops.add(Expression.make(SHORTS, "(short)(", BYTES, ")")); ops.add(Expression.make(SHORTS, "(short)(", SHORTS, ")")); ops.add(Expression.make(SHORTS, "(short)(", CHARS, ")")); ops.add(Expression.make(SHORTS, "(short)(", INTS, ")")); ops.add(Expression.make(SHORTS, "(short)(", LONGS, ")")); ops.add(Expression.make(SHORTS, "(short)(", FLOATS, ")")); ops.add(Expression.make(SHORTS, "(short)(", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 117: > 115: ops.add(Expression.make(SHORTS, "Short.reverseBytes(", SHORTS, ")")); > 116: ops.add(Expression.make(INTS, "Short.toUnsignedInt(", SHORTS, ")")); > 117: ops.add(Expression.make(LONGS, "Short.toUnsignedLong(", SHORTS, ")")); Suggestion: ops.add(Expression.make(INTS, "Short.compare(", SHORTS, ", ", SHORTS, ")")); ops.add(Expression.make(INTS, "Short.compareUnsigned(", SHORTS, ", ", SHORTS, ")")); ops.add(Expression.make(SHORTS, "Short.reverseBytes(", SHORTS, ")")); ops.add(Expression.make(INTS, "Short.toUnsignedInt(", SHORTS, ")")); ops.add(Expression.make(LONGS, "Short.toUnsignedLong(", SHORTS, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 126: > 124: ops.add(Expression.make(INTS, "(int)(", LONGS, ")")); > 125: ops.add(Expression.make(INTS, "(int)(", FLOATS, ")")); > 126: ops.add(Expression.make(INTS, "(int)(", DOUBLES, ")")); Suggestion: ops.add(Expression.make(INTS, "(int)(", BYTES, ")")); ops.add(Expression.make(INTS, "(int)(", SHORTS, ")")); ops.add(Expression.make(INTS, "(int)(", CHARS, ")")); ops.add(Expression.make(INTS, "(int)(", INTS, ")")); ops.add(Expression.make(INTS, "(int)(", LONGS, ")")); ops.add(Expression.make(INTS, "(int)(", FLOATS, ")")); ops.add(Expression.make(INTS, "(int)(", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 137: > 135: ops.add(Expression.make(INTS, "(", INTS, " * ", INTS, ")")); > 136: ops.add(Expression.make(INTS, "(", INTS, " / ", INTS, ")", withArithmeticException)); > 137: ops.add(Expression.make(INTS, "(", INTS, " % ", INTS, ")", withArithmeticException)); Suggestion: ops.add(Expression.make(INTS, "(-(", INTS, "))")); ops.add(Expression.make(INTS, "(", INTS, " + ", INTS, ")")); ops.add(Expression.make(INTS, "(", INTS, " - ", INTS, ")")); ops.add(Expression.make(INTS, "(", INTS, " * ", INTS, ")")); ops.add(Expression.make(INTS, "(", INTS, " / ", INTS, ")", withArithmeticException)); ops.add(Expression.make(INTS, "(", INTS, " % ", INTS, ")", withArithmeticException)); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 154: > 152: ops.add(Expression.make(BOOLEANS, "(", INTS, " < ", INTS, ")")); > 153: ops.add(Expression.make(BOOLEANS, "(", INTS, " >= ", INTS, ")")); > 154: ops.add(Expression.make(BOOLEANS, "(", INTS, " <= ", INTS, ")")); Suggestion: ops.add(Expression.make(BOOLEANS, "(", INTS, " == ", INTS, ")")); ops.add(Expression.make(BOOLEANS, "(", INTS, " != ", INTS, ")")); ops.add(Expression.make(BOOLEANS, "(", INTS, " > ", INTS, ")")); ops.add(Expression.make(BOOLEANS, "(", INTS, " < ", INTS, ")")); ops.add(Expression.make(BOOLEANS, "(", INTS, " >= ", INTS, ")")); ops.add(Expression.make(BOOLEANS, "(", INTS, " <= ", INTS, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 176: > 174: ops.add(Expression.make(INTS, "Integer.signum(", INTS, ")")); > 175: ops.add(Expression.make(INTS, "Integer.sum(", INTS, ", ", INTS, ")")); > 176: ops.add(Expression.make(LONGS, "Integer.toUnsignedLong(", INTS, ")")); Suggestion: ops.add(Expression.make(INTS, "Integer.bitCount(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.compare(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.compareUnsigned(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.compress(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.divideUnsigned(", INTS, ", ", INTS, ")", withArithmeticException)); ops.add(Expression.make(INTS, "Integer.expand(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.highestOneBit(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.lowestOneBit(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.max(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.min(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.numberOfLeadingZeros(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.numberOfTrailingZeros(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.remainderUnsigned(", INTS, ", ", INTS, ")", withArithmeticException)); ops.add(Expression.make(INTS, "Integer.reverse(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.reverseBytes(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.rotateLeft(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.rotateRight(", INTS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Integer.signum(", INTS, ")")); ops.add(Expression.make(INTS, "Integer.sum(", INTS, ", ", INTS, ")")); ops.add(Expression.make(LONGS, "Integer.toUnsignedLong(", INTS, ")")); Also aligning the arguments might be a bit much... test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 185: > 183: ops.add(Expression.make(LONGS, "(long)(", LONGS, ")")); > 184: ops.add(Expression.make(LONGS, "(long)(", FLOATS, ")")); > 185: ops.add(Expression.make(LONGS, "(long)(", DOUBLES, ")")); Suggestion: ops.add(Expression.make(LONGS, "(long)(", BYTES, ")")); ops.add(Expression.make(LONGS, "(long)(", SHORTS, ")")); ops.add(Expression.make(LONGS, "(long)(", CHARS, ")")); ops.add(Expression.make(LONGS, "(long)(", INTS, ")")); ops.add(Expression.make(LONGS, "(long)(", LONGS, ")")); ops.add(Expression.make(LONGS, "(long)(", FLOATS, ")")); ops.add(Expression.make(LONGS, "(long)(", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 196: > 194: ops.add(Expression.make(LONGS, "(", LONGS, " * ", LONGS, ")")); > 195: ops.add(Expression.make(LONGS, "(", LONGS, " / ", LONGS, ")", withArithmeticException)); > 196: ops.add(Expression.make(LONGS, "(", LONGS, " % ", LONGS, ")", withArithmeticException)); Suggestion: ops.add(Expression.make(LONGS, "(-(", LONGS, "))")); ops.add(Expression.make(LONGS, "(", LONGS, " + ", LONGS, ")")); ops.add(Expression.make(LONGS, "(", LONGS, " - ", LONGS, ")")); ops.add(Expression.make(LONGS, "(", LONGS, " * ", LONGS, ")")); ops.add(Expression.make(LONGS, "(", LONGS, " / ", LONGS, ")", withArithmeticException)); ops.add(Expression.make(LONGS, "(", LONGS, " % ", LONGS, ")", withArithmeticException)); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 213: > 211: ops.add(Expression.make(BOOLEANS, "(", LONGS, " < ", LONGS, ")")); > 212: ops.add(Expression.make(BOOLEANS, "(", LONGS, " >= ", LONGS, ")")); > 213: ops.add(Expression.make(BOOLEANS, "(", LONGS, " <= ", LONGS, ")")); Suggestion: ops.add(Expression.make(BOOLEANS, "(", LONGS, " == ", LONGS, ")")); ops.add(Expression.make(BOOLEANS, "(", LONGS, " != ", LONGS, ")")); ops.add(Expression.make(BOOLEANS, "(", LONGS, " > ", LONGS, ")")); ops.add(Expression.make(BOOLEANS, "(", LONGS, " < ", LONGS, ")")); ops.add(Expression.make(BOOLEANS, "(", LONGS, " >= ", LONGS, ")")); ops.add(Expression.make(BOOLEANS, "(", LONGS, " <= ", LONGS, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 234: > 232: ops.add(Expression.make(LONGS, "Long.rotateRight(", LONGS, ", ", INTS, ")")); > 233: ops.add(Expression.make(INTS, "Long.signum(", LONGS, ")")); > 234: ops.add(Expression.make(LONGS, "Long.sum(", LONGS, ", ", LONGS, ")")); Suggestion: ops.add(Expression.make(INTS, "Long.bitCount(", LONGS, ")")); ops.add(Expression.make(INTS, "Long.compare(", LONGS, ", ", LONGS, ")")); ops.add(Expression.make(INTS, "Long.compareUnsigned(", LONGS, ", ", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.compress(", LONGS, ", ", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.divideUnsigned(", LONGS, ", ", LONGS, ")", withArithmeticException)); ops.add(Expression.make(LONGS, "Long.expand(", LONGS, ", ", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.highestOneBit(", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.lowestOneBit(", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.max(", LONGS, ", ", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.min(", LONGS, ", ", LONGS, ")")); ops.add(Expression.make(INTS, "Long.numberOfLeadingZeros(", LONGS, ")")); ops.add(Expression.make(INTS, "Long.numberOfTrailingZeros(", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.remainderUnsigned(", LONGS, ", ", LONGS, ")", withArithmeticException)); ops.add(Expression.make(LONGS, "Long.reverse(", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.reverseBytes(", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.rotateLeft(", LONGS, ", ", INTS, ")")); ops.add(Expression.make(LONGS, "Long.rotateRight(", LONGS, ", ", INTS, ")")); ops.add(Expression.make(INTS, "Long.signum(", LONGS, ")")); ops.add(Expression.make(LONGS, "Long.sum(", LONGS, ", ", LONGS, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 243: > 241: ops.add(Expression.make(FLOATS, "(float)(", LONGS, ")")); > 242: ops.add(Expression.make(FLOATS, "(float)(", FLOATS, ")")); > 243: ops.add(Expression.make(FLOATS, "(float)(", DOUBLES, ")")); Suggestion: ops.add(Expression.make(FLOATS, "(float)(", BYTES, ")")); ops.add(Expression.make(FLOATS, "(float)(", SHORTS, ")")); ops.add(Expression.make(FLOATS, "(float)(", CHARS, ")")); ops.add(Expression.make(FLOATS, "(float)(", INTS, ")")); ops.add(Expression.make(FLOATS, "(float)(", LONGS, ")")); ops.add(Expression.make(FLOATS, "(float)(", FLOATS, ")")); ops.add(Expression.make(FLOATS, "(float)(", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 254: > 252: ops.add(Expression.make(FLOATS, "(", FLOATS, " * ", FLOATS, ")")); > 253: ops.add(Expression.make(FLOATS, "(", FLOATS, " / ", FLOATS, ")")); > 254: ops.add(Expression.make(FLOATS, "(", FLOATS, " % ", FLOATS, ")")); Suggestion: ops.add(Expression.make(FLOATS, "(", FLOATS, " + ", FLOATS, ")")); ops.add(Expression.make(FLOATS, "(", FLOATS, " - ", FLOATS, ")")); ops.add(Expression.make(FLOATS, "(", FLOATS, " * ", FLOATS, ")")); ops.add(Expression.make(FLOATS, "(", FLOATS, " / ", FLOATS, ")")); ops.add(Expression.make(FLOATS, "(", FLOATS, " % ", FLOATS, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 286: > 284: ops.add(Expression.make(DOUBLES, "(double)(", LONGS, ")")); > 285: ops.add(Expression.make(DOUBLES, "(double)(", FLOATS, ")")); > 286: ops.add(Expression.make(DOUBLES, "(double)(", DOUBLES, ")")); Suggestion: ops.add(Expression.make(DOUBLES, "(double)(", BYTES, ")")); ops.add(Expression.make(DOUBLES, "(double)(", SHORTS, ")")); ops.add(Expression.make(DOUBLES, "(double)(", CHARS, ")")); ops.add(Expression.make(DOUBLES, "(double)(", INTS, ")")); ops.add(Expression.make(DOUBLES, "(double)(", LONGS, ")")); ops.add(Expression.make(DOUBLES, "(double)(", FLOATS, ")")); ops.add(Expression.make(DOUBLES, "(double)(", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 297: > 295: ops.add(Expression.make(DOUBLES, "(", DOUBLES, " * ", DOUBLES, ")")); > 296: ops.add(Expression.make(DOUBLES, "(", DOUBLES, " / ", DOUBLES, ")")); > 297: ops.add(Expression.make(DOUBLES, "(", DOUBLES, " % ", DOUBLES, ")")); Suggestion: ops.add(Expression.make(DOUBLES, "(-(", DOUBLES, "))")); ops.add(Expression.make(DOUBLES, "(", DOUBLES, " + ", DOUBLES, ")")); ops.add(Expression.make(DOUBLES, "(", DOUBLES, " - ", DOUBLES, ")")); ops.add(Expression.make(DOUBLES, "(", DOUBLES, " * ", DOUBLES, ")")); ops.add(Expression.make(DOUBLES, "(", DOUBLES, " / ", DOUBLES, ")")); ops.add(Expression.make(DOUBLES, "(", DOUBLES, " % ", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 318: > 316: ops.add(Expression.make(DOUBLES, "Double.max(", DOUBLES, ", ", DOUBLES, ")")); > 317: ops.add(Expression.make(DOUBLES, "Double.min(", DOUBLES, ", ", DOUBLES, ")")); > 318: ops.add(Expression.make(DOUBLES, "Double.sum(", DOUBLES, ", ", DOUBLES, ")")); Suggestion: ops.add(Expression.make(INTS, "Double.compare(", DOUBLES, ", ", DOUBLES, ")")); ops.add(Expression.make(LONGS, "Double.doubleToLongBits(", DOUBLES, ")")); // Note: there are multiple NaN values with different bit representations. ops.add(Expression.make(LONGS, "Double.doubleToRawLongBits(", DOUBLES, ")", withNondeterministicResult)); ops.add(Expression.make(DOUBLES, "Double.longBitsToDouble(", LONGS, ")")); ops.add(Expression.make(BOOLEANS, "Double.isFinite(", DOUBLES, ")")); ops.add(Expression.make(BOOLEANS, "Double.isInfinite(", DOUBLES, ")")); ops.add(Expression.make(BOOLEANS, "Double.isNaN(", DOUBLES, ")")); ops.add(Expression.make(DOUBLES, "Double.max(", DOUBLES, ", ", DOUBLES, ")")); ops.add(Expression.make(DOUBLES, "Double.min(", DOUBLES, ", ", DOUBLES, ")")); ops.add(Expression.make(DOUBLES, "Double.sum(", DOUBLES, ", ", DOUBLES, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 331: > 329: ops.add(Expression.make(BOOLEANS, "(", BOOLEANS, " || ", BOOLEANS, ")")); > 330: ops.add(Expression.make(BOOLEANS, "(", BOOLEANS, " && ", BOOLEANS, ")")); > 331: ops.add(Expression.make(BOOLEANS, "(", BOOLEANS, " ^ ", BOOLEANS, ")")); Suggestion: ops.add(Expression.make(BOOLEANS, "(!(", BOOLEANS, "))")); ops.add(Expression.make(BOOLEANS, "(", BOOLEANS, " || ", BOOLEANS, ")")); ops.add(Expression.make(BOOLEANS, "(", BOOLEANS, " && ", BOOLEANS, ")")); ops.add(Expression.make(BOOLEANS, "(", BOOLEANS, " ^ ", BOOLEANS, ")")); test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 337: > 335: ops.add(Expression.make(BOOLEANS, "Boolean.logicalAnd(", BOOLEANS, ", ", BOOLEANS, ")")); > 336: ops.add(Expression.make(BOOLEANS, "Boolean.logicalOr(", BOOLEANS, ", ", BOOLEANS, ")")); > 337: ops.add(Expression.make(BOOLEANS, "Boolean.logicalXor(", BOOLEANS, ", ", BOOLEANS, ")")); Suggestion: ops.add(Expression.make(INTS, "Boolean.compare(", BOOLEANS, ", ", BOOLEANS, ")")); ops.add(Expression.make(BOOLEANS, "Boolean.logicalAnd(", BOOLEANS, ", ", BOOLEANS, ")")); ops.add(Expression.make(BOOLEANS, "Boolean.logicalOr(", BOOLEANS, ", ", BOOLEANS, ")")); ops.add(Expression.make(BOOLEANS, "Boolean.logicalXor(", BOOLEANS, ", ", BOOLEANS, ")")); test/hotspot/jtreg/testlibrary_tests/template_framework/examples/TestExpressions.java line 27: > 25: * @test > 26: * @bug 8359412 > 27: * @summary Demonstrate the use of Expressions form the Template Library. Suggestion: * @summary Demonstrate the use of Expressions from the Template Library. Typo ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26885#pullrequestreview-3233825268 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355431170 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355670847 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355097557 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355135345 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355130769 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355157237 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355161790 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355198561 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355213439 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355218584 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355223397 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355224657 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355226625 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355227306 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355231339 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355232551 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355236084 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355238196 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355246768 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355239424 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355240287 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355241763 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355248929 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355249864 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355251161 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355253694 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355254950 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355258668 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355259835 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355261549 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2355263778 From epeter at openjdk.org Wed Sep 17 14:35:23 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 14:35:23 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v6] In-Reply-To: References: Message-ID: > Demo from here: > https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ > > Cleaned up and enhanced with a JTREG and IR test. > I also added some additional "generated" normal maps from height functions. > And I display the resulting image side-by-side with the normal map. > > I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. > > There is a **stand-alone** way to run the demo: > `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` > (though it may only run with JDK22+, probably due some amber features) > > **Quick Perforance Numbers**, running on my avx512 laptop. > default / AVX3: 105 FPS > AVX2: 82 FPS > AVX1: 50 FPS > No vectorization: 19 FPS > GraalJIT: 13 FPS (`jdk-26-ea+5` - probably issue with vectorization / inlining?) > > Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: > image > image > image > image Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Update test/hotspot/jtreg/compiler/gallery/TestNormalMapping.java Co-authored-by: Andrey Turbanov - Update test/hotspot/jtreg/compiler/gallery/NormalMapping.java Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27282/files - new: https://git.openjdk.org/jdk/pull/27282/files/806c9379..68ac841a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27282&range=04-05 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27282.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27282/head:pull/27282 PR: https://git.openjdk.org/jdk/pull/27282 From mhaessig at openjdk.org Wed Sep 17 14:40:16 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 17 Sep 2025 14:40:16 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v2] In-Reply-To: References: Message-ID: <5zLWoCC7_s5VBF435fL1hk_m9vsk5JQrdZ1tEipatFo=.bc502b75-7074-4923-8dce-d367eb1b71af@github.com> On Wed, 17 Sep 2025 11:45:34 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> ------------------------------ >> >> **Goals** >> - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) >> - Remove `_nodes` from the vector vtnodes. >> >> **Details** >> - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. >> - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. >> - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). >> - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). >> - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. >> >> I also made a lot of annotations in the code below, for easier review. >> >> **Suggested order for review** >> - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. >> - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. >> - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` >> - `VTransformApplyState`: how it now tracks the memory state. >> - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) >> - Then look at all the other details. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > for Manuel Thank you for addressing my comments and answering my question. Bar the new typo, this looks good to me. src/hotspot/share/opto/vectorization.cpp line 215: > 213: Compile* C = _vloop.phase()->C; > 214: // We iterate over the body, which is topologically sorted. Hence, if there is a phi > 215: // in a slice, we will find it first, and the loads and stres afterwards. Suggestion: // in a slice, we will find it first, and the loads and stores afterwards. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27208#pullrequestreview-3234807557 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2355753434 From chagedorn at openjdk.org Wed Sep 17 15:04:50 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 17 Sep 2025 15:04:50 GMT Subject: RFR: 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows [v2] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 14:17:13 GMT, Damon Fenacci wrote: >> ## Problem >> After [JDK-8260555](https://bugs.openjdk.org/browse/JDK-8260555) changed the default TIMEOUT_FACTOR from 4 to 1, the test compiler/startup/StartupOutput.java can occasionally slightly exceed the 2-minute timeout on Windows. >> >> ## Change >> Rather than increasing the timeout, this change reduces the number of VM runs with randomly generated near-minimum code cache sizes from 200 to 50. This should still provide sufficient coverage while keeping execution well within the timeout. >> >> ## Testing: >> Tiers 1-3+ > > Damon Fenacci has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8367278 > - JDK-8367278: reduce loop to 50 cycles > - JDK-8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows That looks reasonable to me, thanks for fixing it! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27254#pullrequestreview-3234935299 From epeter at openjdk.org Wed Sep 17 15:22:38 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 15:22:38 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v18] In-Reply-To: <53Ado9oN1yU5hgOPU2feecxsArD5yoycn09ZWPNK4AQ=.69035bde-9bec-442e-8dc2-ddd268df9d07@github.com> References: <53Ado9oN1yU5hgOPU2feecxsArD5yoycn09ZWPNK4AQ=.69035bde-9bec-442e-8dc2-ddd268df9d07@github.com> Message-ID: On Tue, 26 Aug 2025 14:47:31 GMT, Kangcheng Xu wrote: >> [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. >> >> When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) >> >> The following was implemented to address this issue. >> >> if (UseNewCode2) { >> *multiplier = bt == T_INT >> ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows >> : ((jlong) 1) << con->get_int(); >> } else { >> *multiplier = ((jlong) 1 << con->get_int()); >> } >> >> >> Two new bitshift overflow tests were added. > > Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 67 commits: > > - Merge branch 'openjdk:master' into arithmetic-canonicalization > - Merge remote-tracking branch 'origin/master' into arithmetic-canonicalization > - Allow swapping LHS/RHS in case not matched > - Merge branch 'refs/heads/master' into arithmetic-canonicalization > - improve comment readability and struct helper functions > - remove asserts, add more documentation > - fix typo: lhs->rhs > - update comments > - use java_add to avoid cpp overflow UB > - add assertion for MulLNode too > - ... and 57 more: https://git.openjdk.org/jdk/compare/173dedfb...7bb7e645 @tabjy Thanks for the ping. Sorry I did not respond earlier. I was hoping others would continue the review, but it seems it got stuck on me here, a classic though unfortunate pattern ;) @rwestrel Asked me if I wanted to continue reviewing. I'm going on a 3-week vacation, so feel free to ask others to review. -------------------------------- I'll summarize my thoughs now, so others can review the PR in my absence: - The PR looks much better now, we have made good progress. - I'm still sad that we are not covering cases like `a * CON1 + a * CON2`, or other patterns that could be collapsed to `a * CON`. But I do understand that this would require some recursive approach, and that could be a little more difficult. ----------------------------------- I'll leave it at this, and hope that others will review ? src/hotspot/share/opto/addnode.cpp line 424: > 422: // Note this also converts, for example, original expression `(a*3) + a` into `4*a` and `(a<<2) + a` into `5*a`. A more > 423: // generalized pattern `(a*b) + (a*c)` into `a*(b + c)` is handled by AddNode::IdealIL(). > 424: Node* AddNode::convert_serial_additions(PhaseGVN* phase, BasicType bt) { The name `convert_serial_additions` now seems a bit off. Because we really cover a lot of other cases too. Really you cover `a + pattern` and `pattern + a`, where `pattern` is one of the cases from `find_serial_addition_patterns`. Maybe it could be called `AddNode::Ideal_collapse_variable_times_con`. Because in the end you want to find cases that are equivalent to `a * some_con`. Lead the documentation with that as well, rather than the series of additions. Because the series of additions is not the pattern you actually match here. The series of additions is only one of the use-cases, and there are others. src/hotspot/share/opto/addnode.cpp line 442: > 440: return nullptr; > 441: } > 442: } Nice, thanks for adding it! I think it would be nice if we renamed `find_serial_addition_patterns` so that it is clear that we are looking for `a + a * con` or `con*a + a`. Because currently it is not directly clear why we need the swapping from the method name. src/hotspot/share/opto/addnode.cpp line 456: > 454: // - (3) Simple multiplication: LHS = CON * a > 455: // - (4) Power-of-two addition: LHS = (a << CON1) + (a << CON2) > 456: AddNode::Multiplication AddNode::find_serial_addition_patterns(const Node* lhs, const Node* rhs, BasicType bt) { Here, we have `rhs = a`, right? I'd suggest just renaming the method arguments `rhs`->`a` and `lhs`->`pattern`. Because you already call (1) - (4) patterns in the documentation. That would be a good fit :) src/hotspot/share/opto/addnode.cpp line 544: > 542: // - (2) AddNode(LShiftNode(a, CON), a) > 543: // - (3) AddNode(a, LShiftNode(a, CON)) > 544: // - (4) AddNode(a, a) You could drop the `Node` part from the cases here, to make it a bit more concise. test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 24: > 22: */ > 23: > 24: package compiler.c2; I would put the test in a more specific directory. I think the `igvn` directory would be a good canditate, because `Ideal` is part of IGVN ;) test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 38: > 36: * @test > 37: * @bug 8325495 8347555 > 38: * @summary C2 should optimize for series of Add of unique value. e.g., a + a + ... + a => a*n You may want to change the summary here, and also the PR summary. Because you really do not just do these series of additions, but lots of other cases as well. The examples below suggest that too ;) test/hotspot/jtreg/compiler/c2/TestSerialAdditions.java line 334: > 332: private static long randomPowerOfTwoAdditionL(long a) { > 333: return a * CON1_L + a * CON2_L + a * CON3_L + a * CON4_L; > 334: } Nice, thanks for these :) ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23506#pullrequestreview-3234938073 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355842882 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355851866 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355855659 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355859306 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355868028 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355865694 PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355871718 From epeter at openjdk.org Wed Sep 17 15:22:40 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 17 Sep 2025 15:22:40 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v18] In-Reply-To: References: <53Ado9oN1yU5hgOPU2feecxsArD5yoycn09ZWPNK4AQ=.69035bde-9bec-442e-8dc2-ddd268df9d07@github.com> Message-ID: On Wed, 17 Sep 2025 15:08:43 GMT, Emanuel Peter wrote: >> Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 67 commits: >> >> - Merge branch 'openjdk:master' into arithmetic-canonicalization >> - Merge remote-tracking branch 'origin/master' into arithmetic-canonicalization >> - Allow swapping LHS/RHS in case not matched >> - Merge branch 'refs/heads/master' into arithmetic-canonicalization >> - improve comment readability and struct helper functions >> - remove asserts, add more documentation >> - fix typo: lhs->rhs >> - update comments >> - use java_add to avoid cpp overflow UB >> - add assertion for MulLNode too >> - ... and 57 more: https://git.openjdk.org/jdk/compare/173dedfb...7bb7e645 > > src/hotspot/share/opto/addnode.cpp line 544: > >> 542: // - (2) AddNode(LShiftNode(a, CON), a) >> 543: // - (3) AddNode(a, LShiftNode(a, CON)) >> 544: // - (4) AddNode(a, a) > > You could drop the `Node` part from the cases here, to make it a bit more concise. Alternatively, you could do it with the `<<` operator like you did in `find_serial_addition_patterns`. I think that would be more consistent. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23506#discussion_r2355861249 From cslucas at openjdk.org Wed Sep 17 16:54:51 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 17 Sep 2025 16:54:51 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> <2brDXuLmbVBVRaeSyCdKokA706v3t6VsZfGvj_QceJ4=.4483390e-c726-4d82-b220-f1dbdf4efef0@github.com> Message-ID: On Thu, 11 Sep 2025 07:41:55 GMT, Roberto Casta?eda Lozano wrote: >>> @robcasloz - are you thinking that the "fixed point" loops on `find_scalar_replaceable_allocs` aren't sufficient? >> >> You're right, that should do. >> >>> At first glance yes, I think that the code would be more cleaned up if done that way. If the code had been written like that in the first place we wouldn't have seen the current issue. (...) >> >> Agree, a single fixed point loop combining NSR detection and propagation would be ideal for clarity and maintainability. >> >>> I propose that we move forward with the current patch and work on this refactoring as a separate issue. >> >> Sounds good, please file a RFE for that. I would suggest then to postpone the clean-up in `revisit_reducible_phi_status` to that RFE. > >> @robcasloz - I pushed some changes addressing yours and @eme64 comments. Could you please re-run your internal tests? > > Thanks, I will report back within a couple of days. Thank you @robcasloz ; I'll start working on that early next week. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3303827082 From cslucas at openjdk.org Wed Sep 17 16:54:53 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 17 Sep 2025 16:54:53 GMT Subject: Integrated: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: References: Message-ID: <8WNEENPja-dRlOVY3Bchz8n_eN-3brvuNzauem5SWIU=.0c69796f-5775-4f39-8309-9bc7d3b917eb@github.com> On Wed, 3 Sep 2025 00:53:59 GMT, Cesar Soares Lucas wrote: > Please, review this patch to fix issue that may occur when reducing allocation merge. > > As the assert message describe, the problem is a `Phi` considered reducible during one invocation of `adjust_scalar_replaceable_state` turned out to be later non-reducible. This situation can happen if a subsequent invocation of the same method causes all inputs to the phi to be NSR; therefore there is no point in reducing the Phi. It can also happen during the propagation of NSR state done by `find_scalar_replaceable_allocs`. > > The change in `revisit_reducible_phi_status` is just a clean-up. > The real fix is in `find_scalar_replaceable_allocs`. > > Tested on Linux x64/Aarch64 release/fastdebug with JTREG tier1-3. This pull request has now been integrated. Changeset: 6f493b4d Author: Cesar Soares Lucas URL: https://git.openjdk.org/jdk/commit/6f493b4d2e7120cbe34fb70d595f7626655b47a9 Stats: 71 lines in 2 files changed: 71 ins; 0 del; 0 mod 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT Reviewed-by: rcastanedalo ------------- PR: https://git.openjdk.org/jdk/pull/27063 From bulasevich at openjdk.org Wed Sep 17 18:08:58 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 17 Sep 2025 18:08:58 GMT Subject: RFR: 8338197: ubsan: ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int' [v4] In-Reply-To: References: Message-ID: > This reworks the recent update https://github.com/openjdk/jdk/pull/24696 to fix a UBSan issue on aarch64. The problem now reproduces on x86_64 as well, which suggests the previous update was not optimal. > > The issue reproduces with a HeapByteBufferTest jtreg test on a UBSan-enabled build. Actually the trigger is `XX:+OptoScheduling` option used by test (by default OptoScheduling is disabled on most x86 CPUs). With the option enabled, the failure can be reproduced with a simple `java -version` run. > > This fix is in ADLC-generated code. For simplicity, the examples below show the generated fragments. > > The problems is that shift count `n` may be too large here: > > class Pipeline_Use_Cycle_Mask { > protected: > uint _mask; > .. > Pipeline_Use_Cycle_Mask& operator<<=(int n) { > _mask <<= n; > return *this; > } > }; > > The recent change attempted to cap the shift amount at one call site: > > class Pipeline_Use_Element { > protected: > .. > // Mask of specific used cycles > Pipeline_Use_Cycle_Mask _mask; > .. > void step(uint cycles) { > _used = 0; > uint max_shift = 8 * sizeof(_mask) - 1; > _mask <<= (cycles < max_shift) ? cycles : max_shift; > } > } > > However, there is another site where `Pipeline_Use_Cycle_Mask::operator<<=` can be called with a too-large shift count: > > // The following two routines assume that the root Pipeline_Use entity > // consists of exactly 1 element for each functional unit > // start is relative to the current cycle; used for latency-based info > uint Pipeline_Use::full_latency(uint delay, const Pipeline_Use &pred) const { > for (uint i = 0; i < pred._count; i++) { > const Pipeline_Use_Element *predUse = pred.element(i); > if (predUse->_multiple) { > uint min_delay = 7; > // Multiple possible functional units, choose first unused one > for (uint j = predUse->_lb; j <= predUse->_ub; j++) { > const Pipeline_Use_Element *currUse = element(j); > uint curr_delay = delay; > if (predUse->_used & currUse->_used) { > Pipeline_Use_Cycle_Mask x = predUse->_mask; > Pipeline_Use_Cycle_Mask y = currUse->_mask; > > for ( y <<= curr_delay; x.overlaps(y); curr_delay++ ) > y <<= 1; > } > if (min_delay > curr_delay) > min_delay = curr_delay; > } > if (delay < min_delay) > delay = min_delay; > } > else { > for (uint j = predUse->_lb; j <= predUse->_ub; j++) { > const Pipeline_Use_Element *currUse = element(j); > if (predUse->_used & currUse->_used) { > ... Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: reduce fixed_latency(100) to fixed_latency(30) for calls/traps on ARM, PPC, RISC-V, X86 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26890/files - new: https://git.openjdk.org/jdk/pull/26890/files/e3ac8703..16d28c6d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26890&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26890&range=02-03 Stats: 8 lines in 4 files changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/26890.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26890/head:pull/26890 PR: https://git.openjdk.org/jdk/pull/26890 From bulasevich at openjdk.org Wed Sep 17 18:32:10 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 17 Sep 2025 18:32:10 GMT Subject: RFR: 8338197: ubsan: ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int' [v5] In-Reply-To: References: Message-ID: > This reworks the recent update https://github.com/openjdk/jdk/pull/24696 to fix a UBSan issue on aarch64. The problem now reproduces on x86_64 as well, which suggests the previous update was not optimal. > > The issue reproduces with a HeapByteBufferTest jtreg test on a UBSan-enabled build. Actually the trigger is `XX:+OptoScheduling` option used by test (by default OptoScheduling is disabled on most x86 CPUs). With the option enabled, the failure can be reproduced with a simple `java -version` run. > > This fix is in ADLC-generated code. For simplicity, the examples below show the generated fragments. > > The problems is that shift count `n` may be too large here: > > class Pipeline_Use_Cycle_Mask { > protected: > uint _mask; > .. > Pipeline_Use_Cycle_Mask& operator<<=(int n) { > _mask <<= n; > return *this; > } > }; > > The recent change attempted to cap the shift amount at one call site: > > class Pipeline_Use_Element { > protected: > .. > // Mask of specific used cycles > Pipeline_Use_Cycle_Mask _mask; > .. > void step(uint cycles) { > _used = 0; > uint max_shift = 8 * sizeof(_mask) - 1; > _mask <<= (cycles < max_shift) ? cycles : max_shift; > } > } > > However, there is another site where `Pipeline_Use_Cycle_Mask::operator<<=` can be called with a too-large shift count: > > // The following two routines assume that the root Pipeline_Use entity > // consists of exactly 1 element for each functional unit > // start is relative to the current cycle; used for latency-based info > uint Pipeline_Use::full_latency(uint delay, const Pipeline_Use &pred) const { > for (uint i = 0; i < pred._count; i++) { > const Pipeline_Use_Element *predUse = pred.element(i); > if (predUse->_multiple) { > uint min_delay = 7; > // Multiple possible functional units, choose first unused one > for (uint j = predUse->_lb; j <= predUse->_ub; j++) { > const Pipeline_Use_Element *currUse = element(j); > uint curr_delay = delay; > if (predUse->_used & currUse->_used) { > Pipeline_Use_Cycle_Mask x = predUse->_mask; > Pipeline_Use_Cycle_Mask y = currUse->_mask; > > for ( y <<= curr_delay; x.overlaps(y); curr_delay++ ) > y <<= 1; > } > if (min_delay > curr_delay) > min_delay = curr_delay; > } > if (delay < min_delay) > delay = min_delay; > } > else { > for (uint j = predUse->_lb; j <= predUse->_ub; j++) { > const Pipeline_Use_Element *currUse = element(j); > if (predUse->_used & currUse->_used) { > ... Boris Ulasevich has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - reduce fixed_latency(100) to fixed_latency(30) for calls/traps on ARM, PPC, RISC-V, X86 - use uint32_t for _mask - remove redundant code - 8338197: ubsan: ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int' ------------- Changes: https://git.openjdk.org/jdk/pull/26890/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26890&range=04 Stats: 25 lines in 5 files changed: 0 ins; 6 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/26890.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26890/head:pull/26890 PR: https://git.openjdk.org/jdk/pull/26890 From vlivanov at openjdk.org Wed Sep 17 19:34:46 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 19:34:46 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Tue, 16 Sep 2025 01:24:35 GMT, Dean Long wrote: >> Could we also bail out here? Or what would happen now in production if there is a RF edge? > > We also use this area past endoff() for storing the "ex_oop" (see for example GraphKit::has_saved_ex_oop()). Are ex_oop and reachability edges mutually exclusive? Yes, ex_oop and reachability edges are mutually exclusive, but there's no conflict. ex_oop is kept during parsing while reachability edges stay attached to RF nodes until loop optimizations are over (and no inlining can happen anymore). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2356550410 From vlivanov at openjdk.org Wed Sep 17 19:41:40 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 19:41:40 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: <7s8qppZ6lzq5iN-inRFkFuXgElo46UmYyIrvExOLA3A=.cf76da61-89ee-4d29-9b5a-0b6e7b3bac2b@github.com> On Fri, 12 Sep 2025 13:47:49 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopnode.cpp line 5341: >> >>> 5339: C->print_method(PHASE_ELIMINATE_REACHABILITY_FENCES, 2); >>> 5340: assert(C->reachability_fences_count() == 0, "no RF nodes allowed"); >>> 5341: } >> >> Can we somehow assert that we now really will never do loop-opts again? >> Why are you checking for `_mode == LoopOptsDefaultFinal` and not for `LoopOptsEliminateRFs`? >> If that was a bug, then more verification would be extra justified ;) > > Otherwise, please explain the meaning of `LoopOptsDefaultFinal`. Maybe it should be an OR here? > Why are you checking for _mode == LoopOptsDefaultFinal and not for LoopOptsEliminateRFs? The intention is to avoid an extra `PhaseIdealLoop` construction pass solely for `LoopOptsEliminateRFs` purposes when there's an empty pass during normal flow of loop optimizations. `LoopOptsEliminateRFs` is performed as the last resort when there was no previous pass to piggyback on. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2356569319 From vlivanov at openjdk.org Wed Sep 17 19:47:21 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 19:47:21 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: <7s8qppZ6lzq5iN-inRFkFuXgElo46UmYyIrvExOLA3A=.cf76da61-89ee-4d29-9b5a-0b6e7b3bac2b@github.com> References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> <7s8qppZ6lzq5iN-inRFkFuXgElo46UmYyIrvExOLA3A=.cf76da61-89ee-4d29-9b5a-0b6e7b3bac2b@github.com> Message-ID: On Wed, 17 Sep 2025 19:38:29 GMT, Vladimir Ivanov wrote: >> Otherwise, please explain the meaning of `LoopOptsDefaultFinal`. Maybe it should be an OR here? > >> Why are you checking for _mode == LoopOptsDefaultFinal and not for LoopOptsEliminateRFs? > > The intention is to avoid an extra `PhaseIdealLoop` construction pass solely for `LoopOptsEliminateRFs` purposes when there's an empty pass during normal flow of loop optimizations. > > `LoopOptsEliminateRFs` is performed as the last resort when there was no previous pass to piggyback on. Maybe `LoopOptsEliminateRFs` should stress that it is intended to happen as the very last step in the flow of loop optimizations. Or, something happening after all other loop optimizations are over. I'll think more about it. >From code perspective, what makes things more complicated is that `PhaseIdealLoop` instance is hidden in `PhaseIdealLoop::optimize()`, so shaping it as a step in loop opts pipeline feels like the most appropriate thing to do. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2356597895 From vlivanov at openjdk.org Wed Sep 17 19:51:25 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 19:51:25 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: <4jTV6y9R_JfATA54LC7FK3DKdBX1srsU09DK1I25Uo0=.94233927-71f2-4f13-894d-206d00f5fdaa@github.com> References: <4jTV6y9R_JfATA54LC7FK3DKdBX1srsU09DK1I25Uo0=.94233927-71f2-4f13-894d-206d00f5fdaa@github.com> Message-ID: On Fri, 12 Sep 2025 13:55:38 GMT, Emanuel Peter wrote: >> Yes, maybe say what the general problem is, and make a concrete example. I'm currently a bit struggling to think of one that is relevant. > > Ah yes: we may for example move a store out (after) the loop. But wait. We can't move a store across a SafePoint, so that's not a good example. For example, loads suffer from the same problems as stores, but constraints on them are more lax. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2356613585 From vlivanov at openjdk.org Wed Sep 17 19:56:19 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 19:56:19 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Wed, 17 Sep 2025 01:06:50 GMT, Dean Long wrote: >> src/hotspot/share/opto/reachability.cpp line 81: >> >>> 79: * (c) Unfortunately, it's not straightforward to stay with safepoint-attached representation till the very end, >>> 80: * because information about derived oops is attached to safepoints in a similar way. So, for now RFs are >>> 81: * rematerialized at safepoints before RA (phase #3). >> >> I still don't understand this. What is similar to what? And why is that a problem? > > Why don't we put RF edges somewhere else, so they don't look like derived oops? I was thinking they could go in the monitor area, or if that causes problems, we introduce a new area. It's solely an implementation limitation. As of now, the only structure imposed on safepoint inputs relates to debug info (represented as JVMState). The rest is adhoc and there are many conflicting use cases introduced over time. The proper way to address it is to introduce proper structure for non-debug inputs, but it requires significant engineering effort to properly handle it across the whole compilation pipeline. For now, I just work-around it by performing additional transformation to avoid conflicts with existing functionality. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2356629144 From vlivanov at openjdk.org Wed Sep 17 20:22:20 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 20:22:20 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v12] In-Reply-To: References: Message-ID: > This PR introduces C2 support for `Reference.reachabilityFence()`. > > After [JDK-8199462](https://bugs.openjdk.org/browse/JDK-8199462) went in, it was discovered that C2 may break the invariant the fix relied upon [1]. So, this is an attempt to introduce proper support for `Reference.reachabilityFence()` in C2. C1 is left intact for now, because there are no signs yet it is affected. > > `Reference.reachabilityFence()` can be used in performance critical code, so the primary goal for C2 is to reduce its runtime overhead as much as possible. The ultimate goal is to ensure liveness information is attached to interfering safepoints, but it takes multiple steps to properly propagate the information through compilation pipeline without negatively affecting generated code quality. > > Also, I don't consider this fix as complete. It does fix the reported problem, but it doesn't provide any strong guarantees yet. In particular, since `ReachabilityFence` is CFG-only node, nothing explicitly forbids memory operations to float past `Reference.reachabilityFence()` and potentially reaching some other safepoints current analysis treats as non-interfering. Representing `ReachabilityFence` as memory barrier (e.g., `MemBarCPUOrder`) would solve the issue, but performance costs are prohibitively high. Alternatively, the optimization proposed in this PR can be improved to conservatively extend referent's live range beyond `ReachabilityFence` nodes associated with it. It would meet performance criteria, but I prefer to implement it as a followup fix. > > Another known issue relates to reachability fences on constant oops. If such constant is GCed (most likely, due to a bug in Java code), similar reachability issues may arise. For now, RFs on constants are treated as no-ops, but there's a diagnostic flag `PreserveReachabilityFencesOnConstants` to keep the fences. I plan to address it separately. > > [1] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ref/Reference.java#L667 > "HotSpot JVM retains the ref and does not GC it before a call to this method, because the JIT-compilers do not have GC-only safepoints." > > Testing: > - [x] hs-tier1 - hs-tier8 > - [x] hs-tier1 - hs-tier6 w/ -XX:+StressReachabilityFences -XX:+VerifyLoopOptimizations > - [x] java/lang/foreign microbenchmarks Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: Add PreserveReachabilityFencesOnConstants test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25315/files - new: https://git.openjdk.org/jdk/pull/25315/files/01eaf64f..dc37ccad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25315&range=10-11 Stats: 134 lines in 5 files changed: 130 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/25315.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25315/head:pull/25315 PR: https://git.openjdk.org/jdk/pull/25315 From vlivanov at openjdk.org Wed Sep 17 20:22:21 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 20:22:21 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> Message-ID: On Mon, 15 Sep 2025 22:57:51 GMT, Dean Long wrote: >> @eme64 I think I addressed/answered all your suggestions/questions. Please, take another look. Thanks! > > @iwanowww , do you have a test that shows constant oops are a problem? My initial impression is that PreserveReachabilityFencesOnConstants shouldn't be needed, because any oops referenced during the compile should go into the ciEnv metadata[] and then into the nmethod oops. So GC can't reclaim these oops because the nmethod keeps references to them. @dean-long > because any oops referenced during the compile should go into the ciEnv metadata[] and then into the nmethod oop That's not how it behaves in practice. OOPs observed during compilation don't necessarily end up in nmethod metadata unless there're explicit usages. > do you have a test that shows constant oops are a problem? I do. Just pushed one example as `test/hotspot/jtreg/compiler/c2/TestReachabilityFenceOnConstant.java`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3304444302 From vlivanov at openjdk.org Wed Sep 17 20:31:11 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 20:31:11 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> Message-ID: On Fri, 12 Sep 2025 14:09:52 GMT, Emanuel Peter wrote: >> @eme64 I think I addressed/answered all your suggestions/questions. Please, take another look. Thanks! > > @iwanowww Thanks for the updates! I again only looked through most comments as well. > > These are the major topics for me: > - `StressReachabilityFences` only inserts RF where they are not needed. So this allows us to test the consistency of the RF machinery, but not to test if we are missing RF where they are needed. That is much harder, and we should probably invest in writing more tests for those cases, even if it is really hard. Maybe we can even write fuzzing tests for it? > - There seems to be missing support for carrying RF edges through incremental inlining, right? File an RFE, or track it elsewhere. Could we create a reproducer for this case / can we extend the existing one? https://github.com/openjdk/jdk/pull/25315#discussion_r2330095168 > - Are we sure that we don't eliminate the RF for the wrong allocation? https://github.com/openjdk/jdk/pull/25315#discussion_r2330230044 > - Extra compile-time due to extra loop-opts round. https://github.com/openjdk/jdk/pull/25315#discussion_r2330176841 . It used to be a 20% increase, now you managed to make it only 10%. Still considerable. All of it just to call `get_ctrl(referent)` in `enumerate_interfering_sfpts`. > > I think some of these issues should also be discussed in the PR description / JIRA description. > It would be especially nice if you could summarize the scope of the problem of RF, and which parts are now fixed, and which parts you know are not yet fixed. Of course there may be even more we don't know, but best write everything down we already do know. ;) > > Other ideas: > - You should file an RFE to add your stress flags to the stress job, and also the fuzzer. > - I did not yet study the reproducer `TestReachabilityFence.java`. We should consider making a fuzzer style test out of it, maybe using the template framework. Feel free to just file an RFE for that, and assign it to me. > > @shipilev @TobiHartmann @chhagedorn > I'm soon going on vacation (in a week), and so I'd like the other reviewers to be aware of these issues. > I don't want to hold up the patch, so feel free to have someone else review. But I'm also happy to come back to this mid October. @eme64 > There seems to be missing support for carrying RF edges through incremental inlining, right? File an RFE, or track it elsewhere. Could we create a reproducer for this case / can we extend the existing one? https://github.com/openjdk/jdk/pull/25315#discussion_r2330095168 There's no problem there. Safepoint-attached reachability edges are introduced when no inlining is allowed. (There's one case when virtual calls can be strength-reduced to direct calls very late -- `Compile::process_late_inline_calls_no_inline()`, but such transformation is simply disabled for now when reachability edges are present.) ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3304467597 From vlivanov at openjdk.org Wed Sep 17 21:32:06 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 21:32:06 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> Message-ID: On Fri, 12 Sep 2025 14:09:52 GMT, Emanuel Peter wrote: > Extra compile-time due to extra loop-opts round. https://github.com/openjdk/jdk/pull/25315#discussion_r2330176841 . It used to be a 20% increase, now you managed to make it only 10%. Still considerable. FTR 10% increase in loop opts time is observed with `-XX:+StressReachabiltyFences`. > All of it just to call get_ctrl(referent) in enumerate_interfering_sfpts. Well, I wouldn't frame it in such a way. RF elimination transformation relies on dominance information computed by `PhaseIdealLoop` to produce control input for each referent. And there are many other transformations under `PhaseIdealLoop` which "just" rely on dominance info it produces. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3304621503 From vlivanov at openjdk.org Wed Sep 17 21:39:45 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 21:39:45 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 20:09:18 GMT, Vladimir Ivanov wrote: >> As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. >> >> Consider `FloatVector::lanewiseTemplate`: >> >> FloatVector lanewiseTemplate(VectorOperators.Unary op) { >> if (opKind(op, VO_SPECIAL)) { >> ... >> else if (opKind(op, VO_MATHLIB)) { >> return unaryMathOp(op); >> } >> } >> int opc = opCode(op); >> return VectorSupport.unaryOp(opc, ...); >> } >> >> >> At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. >> >> It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. >> >> The fix is to fail-fast intrinsification rather than crashing the VM. >> >> Testing: tier1 - tier4 > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > review feedback Thanks for the reviews, Aleksey, Jatin, and Emanuel. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27263#issuecomment-3304631783 From vlivanov at openjdk.org Wed Sep 17 21:39:46 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 21:39:46 GMT Subject: RFR: 8367333: C2: Vector math operation intrinsification failure [v2] In-Reply-To: References: <3Cy6jhWxbaQeWwo22L9nxPnipY1-vHsGZEtk8IZUiq8=.bfefdef7-0137-422b-a7b0-e4fae2a5b282@github.com> Message-ID: On Wed, 17 Sep 2025 06:08:40 GMT, Emanuel Peter wrote: > Also: why not just add the extra run over at the original test? `test/jdk/jdk/incubator/vector/*VectorTests.java` are huge and already override default timeout setting. But running them with `-XX:+StressIncrementalInlining` does make some sense. (Maybe not by default, but as part of some stress testing configuration we have.) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27263#discussion_r2356845600 From vlivanov at openjdk.org Wed Sep 17 21:39:48 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 21:39:48 GMT Subject: Integrated: 8367333: C2: Vector math operation intrinsification failure In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 19:14:18 GMT, Vladimir Ivanov wrote: > As part of [JDK-8353786](https://bugs.openjdk.org/browse/JDK-8353786), C2 support for operations backed by the vector math library was completely removed. On JDK side, there is a special dispatching logic added to avoid intrinsic calls in `jdk.internal.vm.vector.VectorSupport`. But it's still possible to observe such paradoxical situations (intrinsic calls with obsolete operation IDs) when processing effectively dead code. > > Consider `FloatVector::lanewiseTemplate`: > > FloatVector lanewiseTemplate(VectorOperators.Unary op) { > if (opKind(op, VO_SPECIAL)) { > ... > else if (opKind(op, VO_MATHLIB)) { > return unaryMathOp(op); > } > } > int opc = opCode(op); > return VectorSupport.unaryOp(opc, ...); > } > > > At runtime, `unaryMathOp` is unconditionally invoked, but during compilation it's possible to end up with an intrinsification attempt of `VectorSupport.unaryOp()` before `opKind(op, VO_SPECIAL)` is inlined. > > It can be reliably reproduced `-XX:+StressIncrementalInlining` flag. > > The fix is to fail-fast intrinsification rather than crashing the VM. > > Testing: tier1 - tier4 This pull request has now been integrated. Changeset: aa36799a Author: Vladimir Ivanov URL: https://git.openjdk.org/jdk/commit/aa36799acb5834d730400fb073a9a3a8ee3c28ef Stats: 167 lines in 3 files changed: 167 ins; 0 del; 0 mod 8367333: C2: Vector math operation intrinsification failure Reviewed-by: epeter, shade, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/27263 From vlivanov at openjdk.org Wed Sep 17 21:44:29 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 21:44:29 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v2] In-Reply-To: References: <0WKwHjzEn5dxYLkonrk4h9yfMI3r3bKDdqgG06J69N4=.e19e9441-6197-4d53-a4f4-b196a81f69d8@github.com> <1FgOFS7aAlEbvVUez6iTfzgf2l7qUbL9C4wfSGmmfo0=.406c10f1-63d5-4333-af6d-525e46203182@github.com> Message-ID: On Fri, 12 Sep 2025 14:09:52 GMT, Emanuel Peter wrote: > StressReachabilityFences only inserts RF where they are not needed. So this allows us to test the consistency of the RF machinery, but not to test if we are missing RF where they are needed. That is much harder, and we should probably invest in writing more tests for those cases, even if it is really hard. Maybe we can even write fuzzing tests for it? That's a fair point. I'll think more about ways to automatically test RF invariants in positive/negative ways and file RFEs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25315#issuecomment-3304649404 From vlivanov at openjdk.org Wed Sep 17 22:29:26 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 17 Sep 2025 22:29:26 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> Message-ID: On Fri, 12 Sep 2025 13:18:33 GMT, Emanuel Peter wrote: >>> Is this rf guaranteed to belong to the Allocation somehow? >> >> I don't get your question. The code iterates over users of an allocation which is being eliminated. Semantically, RF is a no-op on a scalarizable referent and has to be removed in order to let the scalarization happen. >> >>> Ah, you could mention that later ReachabilityFenceNode::Identity removes the rf. >> >> Done. > > @iwanowww The code in `PhaseMacroExpand::process_users_of_allocation` iterates over direct users of result cast from Allocation nodes. And RF is not special there. Any other case in `PhaseMacroExpand::process_users_of_allocation()` would be affected. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2356922289 From dlong at openjdk.org Wed Sep 17 23:27:06 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 17 Sep 2025 23:27:06 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 14:27:28 GMT, Aleksey Shipilev wrote: >> See the bug for discussion what issues current machinery has. >> >> This PR executes the plan outlined in the bug: >> 1. Common the receiver type profiling code in interpreter and C1 >> 2. Rewrite receiver type profiling code to only do atomic receiver slot installations >> 3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed >> >> This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler/` >> - [ ] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls > - Drop atomic counters > - Initial version src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4853: > 4851: } else { > 4852: // Nothing to do, just go with defaults. > 4853: assert_different_registers(rax, mdp, recv, offset); Can't we do all register shuffling and push/pop outside the loop? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2356988910 From dlong at openjdk.org Wed Sep 17 23:42:41 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 17 Sep 2025 23:42:41 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability [v2] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 14:27:28 GMT, Aleksey Shipilev wrote: >> See the bug for discussion what issues current machinery has. >> >> This PR executes the plan outlined in the bug: >> 1. Common the receiver type profiling code in interpreter and C1 >> 2. Rewrite receiver type profiling code to only do atomic receiver slot installations >> 3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed >> >> This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler/` >> - [ ] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls > - Drop atomic counters > - Initial version src/hotspot/cpu/x86/interp_masm_x86.cpp line 1342: > 1340: > 1341: // Record the receiver type. > 1342: type_profile(receiver, mdp, 0); Why is 0 the correct offset? The C1 helper uses md->byte_offset_of_slot(). src/hotspot/cpu/x86/interp_masm_x86.cpp line 1553: > 1551: > 1552: // Record the object type. > 1553: record_klass_in_profile(klass, mdp, reg2, false); Same question as above about the 0 offset. Is this because `mdp` has already been adjusted? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2357007843 PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2357010342 From dfenacci at openjdk.org Thu Sep 18 06:27:45 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 18 Sep 2025 06:27:45 GMT Subject: RFR: 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows In-Reply-To: References: Message-ID: <5E1DUrHS_zkhw6H1ivQak0rhqtxfEivrwhJkkpf2swE=.6dac4e41-4ab4-4fc8-bf89-7af81d78a0b5@github.com> On Wed, 17 Sep 2025 13:10:33 GMT, SendaoYan wrote: >> ## Problem >> After [JDK-8260555](https://bugs.openjdk.org/browse/JDK-8260555) changed the default TIMEOUT_FACTOR from 4 to 1, the test compiler/startup/StartupOutput.java can occasionally slightly exceed the 2-minute timeout on Windows. >> >> ## Change >> Rather than increasing the timeout, this change reduces the number of VM runs with randomly generated near-minimum code cache sizes from 200 to 50. This should still provide sufficient coverage while keeping execution well within the timeout. >> >> ## Testing: >> Tiers 1-3+ > > GHA shows GetStackTraceALotWhenPinned.java timed out on macos. The failure has been fixed by [JDK-8366893](https://bugs.openjdk.org/browse/JDK-8366893). I think you can merge the master first. Thanks for your reviews @sendaoYan @chhagedorn. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27254#issuecomment-3305596171 From dfenacci at openjdk.org Thu Sep 18 06:27:46 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 18 Sep 2025 06:27:46 GMT Subject: Integrated: 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 09:56:24 GMT, Damon Fenacci wrote: > ## Problem > After [JDK-8260555](https://bugs.openjdk.org/browse/JDK-8260555) changed the default TIMEOUT_FACTOR from 4 to 1, the test compiler/startup/StartupOutput.java can occasionally slightly exceed the 2-minute timeout on Windows. > > ## Change > Rather than increasing the timeout, this change reduces the number of VM runs with randomly generated near-minimum code cache sizes from 200 to 50. This should still provide sufficient coverage while keeping execution well within the timeout. > > ## Testing: > Tiers 1-3+ This pull request has now been integrated. Changeset: a355edbb Author: Damon Fenacci URL: https://git.openjdk.org/jdk/commit/a355edbbe43f7356f9439ecabf0ab8218fc9e3e1 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8367278: Test compiler/startup/StartupOutput.java timed out after completion on Windows Reviewed-by: syan, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/27254 From epeter at openjdk.org Thu Sep 18 06:40:12 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 06:40:12 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v3] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > ------------------------------ > > **Goals** > - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) > - Remove `_nodes` from the vector vtnodes. > > **Details** > - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. > - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. > - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). > - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). > - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. > > I also made a lot of annotations in the code below, for easier review. > > **Suggested order for review** > - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. > - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. > - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` > - `VTransformApplyState`: how it now tracks the memory state. > - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) > - Then look at all the other details. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/vectorization.cpp Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27208/files - new: https://git.openjdk.org/jdk/pull/27208/files/469426a7..9af66755 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27208&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27208&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27208.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27208/head:pull/27208 PR: https://git.openjdk.org/jdk/pull/27208 From epeter at openjdk.org Thu Sep 18 06:40:14 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 06:40:14 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v2] In-Reply-To: <5zLWoCC7_s5VBF435fL1hk_m9vsk5JQrdZ1tEipatFo=.bc502b75-7074-4923-8dce-d367eb1b71af@github.com> References: <5zLWoCC7_s5VBF435fL1hk_m9vsk5JQrdZ1tEipatFo=.bc502b75-7074-4923-8dce-d367eb1b71af@github.com> Message-ID: On Wed, 17 Sep 2025 14:37:00 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> for Manuel > > Thank you for addressing my comments and answering my question. Bar the new typo, this looks good to me. @mhaessig Thanks a lot for the review, suggestions and approval :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27208#issuecomment-3305626314 From epeter at openjdk.org Thu Sep 18 06:48:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 06:48:36 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions In-Reply-To: <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> Message-ID: On Wed, 17 Sep 2025 11:10:09 GMT, Manuel H?ssig wrote: >> Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). >> >> Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. >> >> Details, in **order you should review**: >> - `Operations.java`: maps lots of primitive operators as Expressions. >> - `Expression.java`: the fundamental engine behind Expressions. >> - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. >> - `tests/TestExpression.java`: correctness test of Expression machinery. >> - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. >> - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. >> - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. >> >> If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. >> >> **Future Work**: >> - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. >> - Use `Expression`s to model more operations: >> - `Vector API`, more arithmetic operations like from `Math` classes etc. >> - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. >> - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just fol... > > test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 358: > >> 356: tokens.add(arguments.get(i)); >> 357: } >> 358: tokens.add(strings.get(strings.size()-1)); > > Suggestion: > > tokens.add(strings.getLast()); > > A wee bit easier to read. Did not know this was a thing, nice :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2357700924 From epeter at openjdk.org Thu Sep 18 06:52:34 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 06:52:34 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v2] In-Reply-To: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: <5YknYR1eLr-C-b-XIo863vtjkT9F8Aej2DYEGMaCodQ=.fb7f55bd-1fc5-426d-a974-c4770d9a2981@github.com> > Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). > > Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. > > Details, in **order you should review**: > - `Operations.java`: maps lots of primitive operators as Expressions. > - `Expression.java`: the fundamental engine behind Expressions. > - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. > - `tests/TestExpression.java`: correctness test of Expression machinery. > - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. > - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. > - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. > > If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. > > **Future Work**: > - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. > - Use `Expression`s to model more operations: > - `Vector API`, more arithmetic operations like from `Math` classes etc. > - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. > - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just folds away, but under `StressIGVN` and `Stres... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply Manuel's suggestions part 1 Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26885/files - new: https://git.openjdk.org/jdk/pull/26885/files/0709731a..d66aa985 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=00-01 Stats: 134 lines in 3 files changed: 1 ins; 2 del; 131 mod Patch: https://git.openjdk.org/jdk/pull/26885.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26885/head:pull/26885 PR: https://git.openjdk.org/jdk/pull/26885 From epeter at openjdk.org Thu Sep 18 06:52:36 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 06:52:36 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v2] In-Reply-To: <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> Message-ID: <84_nAzM5h9uSyvzRquE4x9EhrnfmNls0Btzts0zSPFw=.8b10e909-122b-421b-b148-57304b32d68c@github.com> On Wed, 17 Sep 2025 11:33:28 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply Manuel's suggestions part 1 >> >> Co-authored-by: Manuel H?ssig > > test/hotspot/jtreg/compiler/lib/template_framework/library/Operations.java line 1: > >> 1: /* > > I gave it my best shot to suggest a reasonable and reasonably consistent alignment. Oh wow, nice. Thanks for the work :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2357708984 From dfenacci at openjdk.org Thu Sep 18 07:00:38 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 18 Sep 2025 07:00:38 GMT Subject: RFR: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:15:06 GMT, Francesco Andreuzzi wrote: > This is the content of assembler.inline.hpp: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 > > Most of the `assembler_.inline.hpp` include it: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 > > They should probably include `assembler.hpp` instead. > > Testing: tier1 in GHA Tests passed. Thanks @fandreuz. LGTM ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/27311#pullrequestreview-3237615976 From rcastanedalo at openjdk.org Thu Sep 18 07:06:36 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 18 Sep 2025 07:06:36 GMT Subject: RFR: 8359378: aarch64: crash when using -XX:+UseFPUForSpilling In-Reply-To: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> References: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> Message-ID: On Wed, 17 Sep 2025 16:19:12 GMT, Boris Ulasevich wrote: > AArch64 BarrierSetAssembler path assumes only FP/vector ideal regs reach the FP spill/restore encoding. With -XX:+UseFPUForSpilling Register Allocator may allocate scalar values in FP registers. When such values (Op_RegI/Op_RegN/Op_RegL/Op_RegP) hit `BarrierSetAssembler::encode_float_vector_register_size`, we trip ShouldNotReachHere in release build and **"unexpected ideal register"** assertion in debug build. > > Fix: teach the encoder to handle scalar ideal regs when they physically live in FP regs: > - treat Op_RegI / Op_RegN as 32-bit (single slot) - same class as Op_RegF > - treat Op_RegL / Op_RegP as 64-bit (two slots) - same class as Op_RegD > > Related: > - reproduced since #19746 > - spilling logic: > - #18967 > - #17977 > > Testing: tier1-3 with javaoptions -Xcomp -Xbatch -XX:+UseFPUForSpilling on AARCH Hi @bulasevich, thanks for working on this issue, but please note that it was already assigned to me ([JDK-8359378](https://bugs.openjdk.org/browse/JDK-8359378)). I am fine with re-assigning it to you, but [next time please ask first, to avoid work duplication](https://openjdk.org/guide/#i-found-an-issue-in-jbs-that-i-want-to-fix). ------------- PR Comment: https://git.openjdk.org/jdk/pull/27350#issuecomment-3305714305 From epeter at openjdk.org Thu Sep 18 07:17:50 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 07:17:50 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v3] In-Reply-To: <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> Message-ID: On Wed, 17 Sep 2025 11:02:12 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply Manuel's suggestions part 2 > > test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 152: > >> 150: >> 151: /** >> 152: * Creates a new Espression with 1 arguments. > > For every make(): s/Espression/Expression/ Nice catch! > test/hotspot/jtreg/compiler/lib/template_framework/library/Expression.java line 164: > >> 162: CodeGenerationDataNameType t0, >> 163: String s1) { >> 164: return new Expression(returnType, List.of(t0), List.of(s0, s1), new Info()); > > To reduce code duplication, the methods without an additional info should probably use the ones with. > Suggestion: > > return make(returnType, s0, t0, s1, new Info()); Nice idea :) > test/hotspot/jtreg/testlibrary_tests/template_framework/examples/TestExpressions.java line 27: > >> 25: * @test >> 26: * @bug 8359412 >> 27: * @summary Demonstrate the use of Expressions form the Template Library. > > Suggestion: > > * @summary Demonstrate the use of Expressions from the Template Library. > > Typo Done :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2357779934 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2357777770 PR Review Comment: https://git.openjdk.org/jdk/pull/26885#discussion_r2357782258 From epeter at openjdk.org Thu Sep 18 07:17:47 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 07:17:47 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v3] In-Reply-To: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: <4DQH2DopQ0lMjj78iaff4d1qwotbvZYLgmtq36Hb_MQ=.d0bc6aef-1821-4a97-887b-0cf054667a7f@github.com> > Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). > > Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. > > Details, in **order you should review**: > - `Operations.java`: maps lots of primitive operators as Expressions. > - `Expression.java`: the fundamental engine behind Expressions. > - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. > - `tests/TestExpression.java`: correctness test of Expression machinery. > - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. > - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. > - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. > > If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. > > **Future Work**: > - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. > - Use `Expression`s to model more operations: > - `Vector API`, more arithmetic operations like from `Math` classes etc. > - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. > - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just folds away, but under `StressIGVN` and `Stres... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply Manuel's suggestions part 2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26885/files - new: https://git.openjdk.org/jdk/pull/26885/files/d66aa985..05fb63c4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=01-02 Stats: 13 lines in 2 files changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/26885.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26885/head:pull/26885 PR: https://git.openjdk.org/jdk/pull/26885 From rcastanedalo at openjdk.org Thu Sep 18 07:19:07 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 18 Sep 2025 07:19:07 GMT Subject: RFR: 8361699: C2: assert(can_reduce_phi(n->as_Phi())) failed: Sanity: previous reducible Phi is no longer reducible before SUT In-Reply-To: References: <1uDOe3Oe-hihmDHea2h8vcvRZsKKBeNp0J9lKYUujxk=.abd111bc-3625-4c71-bfa2-0a4c1f4d3875@github.com> <2brDXuLmbVBVRaeSyCdKokA706v3t6VsZfGvj_QceJ4=.4483390e-c726-4d82-b220-f1dbdf4efef0@github.com> Message-ID: On Thu, 11 Sep 2025 07:41:55 GMT, Roberto Casta?eda Lozano wrote: >>> @robcasloz - are you thinking that the "fixed point" loops on `find_scalar_replaceable_allocs` aren't sufficient? >> >> You're right, that should do. >> >>> At first glance yes, I think that the code would be more cleaned up if done that way. If the code had been written like that in the first place we wouldn't have seen the current issue. (...) >> >> Agree, a single fixed point loop combining NSR detection and propagation would be ideal for clarity and maintainability. >> >>> I propose that we move forward with the current patch and work on this refactoring as a separate issue. >> >> Sounds good, please file a RFE for that. I would suggest then to postpone the clean-up in `revisit_reducible_phi_status` to that RFE. > >> @robcasloz - I pushed some changes addressing yours and @eme64 comments. Could you please re-run your internal tests? > > Thanks, I will report back within a couple of days. > Thank you @robcasloz ; I'll start working on that early next week. @JohnTortugo thanks. Please, keep in mind that [HotSpot requires two approvals for non-trivial changes like this](https://openjdk.org/guide/#hotspot-development) (apologies if my previous comment somehow could be interpreted as an invitation to integrate, that was not the intention). ------------- PR Comment: https://git.openjdk.org/jdk/pull/27063#issuecomment-3305759603 From epeter at openjdk.org Thu Sep 18 07:20:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 07:20:24 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v4] In-Reply-To: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: <2zAgocXxD80XNyB_HLyO9JSmsqjJfRGYE-FmFmatuYk=.d42bf76a-39d0-49ba-88c9-df9eebc5aa0f@github.com> > Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). > > Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. > > Details, in **order you should review**: > - `Operations.java`: maps lots of primitive operators as Expressions. > - `Expression.java`: the fundamental engine behind Expressions. > - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. > - `tests/TestExpression.java`: correctness test of Expression machinery. > - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. > - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. > - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. > > If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. > > **Future Work**: > - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. > - Use `Expression`s to model more operations: > - `Vector API`, more arithmetic operations like from `Math` classes etc. > - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. > - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just folds away, but under `StressIGVN` and `Stres... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply Manuel's suggestions part 3 Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26885/files - new: https://git.openjdk.org/jdk/pull/26885/files/05fb63c4..0a269c3b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26885.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26885/head:pull/26885 PR: https://git.openjdk.org/jdk/pull/26885 From epeter at openjdk.org Thu Sep 18 07:34:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 07:34:43 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v5] In-Reply-To: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: > Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). > > Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. > > Details, in **order you should review**: > - `Operations.java`: maps lots of primitive operators as Expressions. > - `Expression.java`: the fundamental engine behind Expressions. > - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. > - `tests/TestExpression.java`: correctness test of Expression machinery. > - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. > - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. > - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. > > If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. > > **Future Work**: > - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. > - Use `Expression`s to model more operations: > - `Vector API`, more arithmetic operations like from `Math` classes etc. > - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. > - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just folds away, but under `StressIGVN` and `Stres... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 29 additional commits since the last revision: - Merge branch 'master' into JDK-8359412-Template-Framework-Expressions - Apply Manuel's suggestions part 3 Co-authored-by: Manuel H?ssig - Apply Manuel's suggestions part 2 - Apply Manuel's suggestions part 1 Co-authored-by: Manuel H?ssig - fix whitespaces - LibraryRNG example - fix bug - documentation - improve expression fuzzer - wip constraints - ... and 19 more: https://git.openjdk.org/jdk/compare/06680b79...a6f83b5a ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26885/files - new: https://git.openjdk.org/jdk/pull/26885/files/0a269c3b..a6f83b5a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=03-04 Stats: 69057 lines in 2033 files changed: 39891 ins; 16903 del; 12263 mod Patch: https://git.openjdk.org/jdk/pull/26885.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26885/head:pull/26885 PR: https://git.openjdk.org/jdk/pull/26885 From missa at openjdk.org Thu Sep 18 07:40:46 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 18 Sep 2025 07:40:46 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v15] In-Reply-To: References: Message-ID: > Intel® AVX10 ISA [1] extensions added new saturating floating point conversion instructions which comply with definitions in section 5.8 of the 2019 IEEE-754 standard. They can compute floating point to integral type conversions while also handling special inputs such as NaN, +Infinity, and -Infinity. > > Without AVX10.2, the current approach starts by converting the floating point value(s) in the source register to the desired integral value(s) in the destination register. In the scalar case, the CVTTSS2SI (single precision) or CVTTSD2SI (double precision) instruction is used. In the vector case, the CVTTPS2DQ (single precision) or CVTTPD2DQ (double precision) is used. However, if the source contains a special value (NaN, -Infinity, +Infinity, <= Integer.MIN_VALUE, or >= Integer.MAX_VALUE), extra handling is required. The specific sequence of instructions involved depends on the source (single precision vs double precision), destination (long, integer, short, or byte), level of parallelization (scalar vs vector), and supported AVX extension type. Essentially though, the special values are mapped to values (NaN -> 0, -Infinity, <= Integer.MIN_VALUE -> Integer.MIN_VALUE, +Infinity, >= Integer.MAX_VALUE -> Integer.MAX_VALUE) in the integer range with the help of a few temporary regist ers to store intermediate results. > > This change uses the new AVX10.2 scalar (VCVTTSS2SIS or VCVTTSD2SIS) and vector (VCVTTPS2QQS, VCVTTPS2DQS, VCVTTPD2QQS, and VCVTTPD2DQS) instructions on supported platforms to avoid the extra handling described above. Also, the JTREG tests listed below were used to verify correctness with `-XX:-UseSuperWord` / `-XX:+UseSuperWord` options to exercise both scalar and vector paths. The baseline build used is [OpenJDK v26-b11](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B11). > > 1. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteDoubleVect.java` > 2. `jtreg:test/hotspot/jtreg/compiler/codegen/TestByteFloatVect.java` > 3. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntDoubleVect.java` > 4. `jtreg:test/hotspot/jtreg/compiler/codegen/TestIntFloatVect.java` > 5. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongDoubleVect.java` > 6. `jtreg:test/hotspot/jtreg/compiler/codegen/TestLongFloatVect.java` > 7. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortDoubleVect.java` > 8. `jtreg:test/hotspot/jtreg/compiler/codegen/TestShortFloatVect.java` > 9. `jtreg:test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java` > 10. `jtreg:test/hotspot/jtreg... Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: Clean up scalar floating point conversion tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26919/files - new: https://git.openjdk.org/jdk/pull/26919/files/5d26ff48..a7940ee0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26919&range=13-14 Stats: 83 lines in 1 file changed: 10 ins; 44 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/26919.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26919/head:pull/26919 PR: https://git.openjdk.org/jdk/pull/26919 From missa at openjdk.org Thu Sep 18 07:40:50 2025 From: missa at openjdk.org (Mohamed Issa) Date: Thu, 18 Sep 2025 07:40:50 GMT Subject: RFR: 8364305: Support AVX10 saturating floating point conversion instructions [v14] In-Reply-To: References: <4Eui7URmA1Y5NPrrV4813qb7UUsNVSRP-JSnPdX0Ojg=.4db7c50e-18cd-47ec-ae8c-4ae17597b286@github.com> Message-ID: <9qebb_d7KLK6ge1CPFO_5009kTCNbsg4xCZhj3v-H0w=.76ebacee-b653-415c-99c8-aae76bd830a8@github.com> On Sat, 13 Sep 2025 08:26:13 GMT, Jatin Bhateja wrote: >> Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: >> >> Introduce scalar floating point conversion tests with IR rules > > test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 70: > >> 68: float_arr[i] = ran.nextFloat(floor_val, ceil_val); >> 69: double_arr[i] = ran.nextDouble(floor_val, ceil_val); >> 70: } > > Please use Generators instead of direct initialization. I could do it for int and long. If there's a compact way to do it for the other types, please let me know. > test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 89: > >> 87: if (int_arr[i] != expected) { >> 88: throw new RuntimeException("Invalid result: int_arr[" + i + "] = " + int_arr[i] + " != " + expected); >> 89: } > > Use Verify.checkEQ instead. Ok, I'm using Verify.checkEQ instead. > test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 109: > >> 107: if (long_arr[i] != expected) { >> 108: throw new RuntimeException("Invalid result: long_arr[" + i + "] = " + long_arr[i] + " != " + expected); >> 109: } > > Use Verify.checkEQ, checkout relevant code in https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/compiler/lib and their usages I modified this. Should I do this for VectorFPtoIntCastTest.java as well? Also, using Verify.checkEQ removes the custom error message unless I use try + catch. > test/hotspot/jtreg/compiler/floatingpoint/ScalarFPtoIntCastTest.java line 122: > >> 120: checkf2short(); >> 121: } >> 122: > > What is the reason behind additional level of abstraction when now manually inline this code. No reason other than I migrated the code from VectorFPtoIntCastTest.java, so it's gone now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2357853486 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2357856425 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2357863945 PR Review Comment: https://git.openjdk.org/jdk/pull/26919#discussion_r2357866615 From dfenacci at openjdk.org Thu Sep 18 08:06:26 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 18 Sep 2025 08:06:26 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed [v2] In-Reply-To: <5eWiPUhybQOdBZAfm8LnEGLQ8ZwXHcqatCQEf8PVlgo=.ffcd2f7e-f734-49de-a2e4-1099bfb544f5@github.com> References: <5eWiPUhybQOdBZAfm8LnEGLQ8ZwXHcqatCQEf8PVlgo=.ffcd2f7e-f734-49de-a2e4-1099bfb544f5@github.com> Message-ID: On Tue, 16 Sep 2025 21:59:10 GMT, Man Cao wrote: >> Hi, >> >> Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). >> >> For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. > > Man Cao has updated the pull request incrementally with one additional commit since the last revision: > > Switch to disable inlining for shortMethod Thanks @caoman. LGTM ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/27306#pullrequestreview-3237998259 From epeter at openjdk.org Thu Sep 18 08:07:17 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 08:07:17 GMT Subject: RFR: 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions Message-ID: Adding missing `-XX:+UnlockDiagnosticVMOptions`. Seems the test from https://github.com/openjdk/jdk/pull/27263 was not tested with the product build before integration. ------------- Commit messages: - JDK-8367333 Changes: https://git.openjdk.org/jdk/pull/27359/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27359&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8367969 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27359.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27359/head:pull/27359 PR: https://git.openjdk.org/jdk/pull/27359 From shade at openjdk.org Thu Sep 18 08:07:17 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 18 Sep 2025 08:07:17 GMT Subject: RFR: 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 07:55:10 GMT, Emanuel Peter wrote: > Adding missing `-XX:+UnlockDiagnosticVMOptions`. > > Seems the test from https://github.com/openjdk/jdk/pull/27263 was not tested with the product build before integration. Ah, oops. Looks fine and trivial. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27359#pullrequestreview-3237952363 From epeter at openjdk.org Thu Sep 18 08:12:50 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 08:12:50 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v6] In-Reply-To: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: > Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). > > Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. > > Details, in **order you should review**: > - `Operations.java`: maps lots of primitive operators as Expressions. > - `Expression.java`: the fundamental engine behind Expressions. > - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. > - `tests/TestExpression.java`: correctness test of Expression machinery. > - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. > - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. > - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. > > If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. > > **Future Work**: > - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. > - Use `Expression`s to model more operations: > - `Vector API`, more arithmetic operations like from `Math` classes etc. > - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. > - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just folds away, but under `StressIGVN` and `Stres... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - more comments - add othervm to test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26885/files - new: https://git.openjdk.org/jdk/pull/26885/files/a6f83b5a..c04c879c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26885&range=04-05 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26885.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26885/head:pull/26885 PR: https://git.openjdk.org/jdk/pull/26885 From epeter at openjdk.org Thu Sep 18 08:16:07 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 08:16:07 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v6] In-Reply-To: <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> <4YVAopGtxnlkh39pp0TaW4kNpBuSIXfbz40UDW_We1w=.308dd279-514f-4fa4-b361-ab36f165caf6@github.com> Message-ID: On Wed, 17 Sep 2025 14:29:02 GMT, Manuel H?ssig wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - more comments >> - add othervm to test > > Thank you for this enhancement, @eme64! It is nice to see the template framework library evolving. > > The changes look good. I mostly have nits. @mhaessig Thanks for the review, and the many good suggestions :) I've applied all, and the PR is ready for re-review :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26885#issuecomment-3306055839 From fandreuzzi at openjdk.org Thu Sep 18 08:17:12 2025 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Thu, 18 Sep 2025 08:17:12 GMT Subject: RFR: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 13:52:33 GMT, Damon Fenacci wrote: >> This is the content of assembler.inline.hpp: >> https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 >> >> Most of the `assembler_.inline.hpp` include it: >> https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 >> >> They should probably include `assembler.hpp` instead. >> >> Testing: tier1 in GHA > > It looks like there were a few include cycles. Thanks for fixing this @fandreuz. > Running tier1-3+ tests... Thanks for running the tests @dafedafe ------------- PR Comment: https://git.openjdk.org/jdk/pull/27311#issuecomment-3306064151 From mhaessig at openjdk.org Thu Sep 18 08:28:29 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 18 Sep 2025 08:28:29 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v6] In-Reply-To: References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: On Thu, 18 Sep 2025 08:12:50 GMT, Emanuel Peter wrote: >> Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). >> >> Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. >> >> Details, in **order you should review**: >> - `Operations.java`: maps lots of primitive operators as Expressions. >> - `Expression.java`: the fundamental engine behind Expressions. >> - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. >> - `tests/TestExpression.java`: correctness test of Expression machinery. >> - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. >> - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. >> - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. >> >> If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. >> >> **Future Work**: >> - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. >> - Use `Expression`s to model more operations: >> - `Vector API`, more arithmetic operations like from `Math` classes etc. >> - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. >> - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just fol... > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - more comments > - add othervm to test Thank you for addressing my comments. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26885#pullrequestreview-3238136469 From mhaessig at openjdk.org Thu Sep 18 08:29:34 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 18 Sep 2025 08:29:34 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation [v2] In-Reply-To: References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> Message-ID: On Sat, 6 Sep 2025 00:31:56 GMT, Dean Long wrote: >> Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. > > Looks good! Testing passed. Could you please rereview @dean-long, @eme64 ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27120#issuecomment-3306160497 From mhaessig at openjdk.org Thu Sep 18 08:32:57 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 18 Sep 2025 08:32:57 GMT Subject: RFR: 8367721: Test compiler/arguments/TestCompileTaskTimeout.java crashed: SIGSEGV In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 11:57:54 GMT, Christian Hagedorn wrote: >> The test `TestCompileTaskTimeout.java` runs `java -Xcomp -XX:CompileTaskTimeout=1 --version` to demonstrate that the timeout works. Part of the timeout working involves it printing the method of the compile task. Inspecting the core file of the execution that failed with a `SIGSEGV` in the compile task timeout signal handler, the backtrace looks as follows: >> >> #n >> #n+1 CompilerThreadTimeoutLinux::signal_handler() >> #n+2 >> #n+3 timer_settime() >> #n+4 CompilerThreadTimeoutLinux::disarm() >> #n+5 CompileTaskWrapper::~CompileTaskWrapper() >> >> So, the compile task hit the timeout during destruction of the underlying `CompileTaskWrapper`. Since the timeout was disarmed only after setting the task to null in the destructor, the signal handler segfaulted when trying to access the method of the compile task to print it out. This PR addresses this issue by moving up the disarmament of the timeout to the top of the destructor. >> >> Because this issue can only be triggered with bad --- or good, depending on your view --- luck on timing, I could not devise a regression test. But this is not too big of an issue, since the CI already caught this issue. >> >> Testing: >> - [x] Github Actions >> - [x] tier1,tier2,tier3 plus stress testing on Oracle supported platforms > > Looks good to me, too! Thank you for your reviews, @chhagedorn and @marc-chevalier! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27331#issuecomment-3306167108 From mhaessig at openjdk.org Thu Sep 18 08:32:58 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 18 Sep 2025 08:32:58 GMT Subject: Integrated: 8367721: Test compiler/arguments/TestCompileTaskTimeout.java crashed: SIGSEGV In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 06:57:29 GMT, Manuel H?ssig wrote: > The test `TestCompileTaskTimeout.java` runs `java -Xcomp -XX:CompileTaskTimeout=1 --version` to demonstrate that the timeout works. Part of the timeout working involves it printing the method of the compile task. Inspecting the core file of the execution that failed with a `SIGSEGV` in the compile task timeout signal handler, the backtrace looks as follows: > > #n > #n+1 CompilerThreadTimeoutLinux::signal_handler() > #n+2 > #n+3 timer_settime() > #n+4 CompilerThreadTimeoutLinux::disarm() > #n+5 CompileTaskWrapper::~CompileTaskWrapper() > > So, the compile task hit the timeout during destruction of the underlying `CompileTaskWrapper`. Since the timeout was disarmed only after setting the task to null in the destructor, the signal handler segfaulted when trying to access the method of the compile task to print it out. This PR addresses this issue by moving up the disarmament of the timeout to the top of the destructor. > > Because this issue can only be triggered with bad --- or good, depending on your view --- luck on timing, I could not devise a regression test. But this is not too big of an issue, since the CI already caught this issue. > > Testing: > - [x] Github Actions > - [x] tier1,tier2,tier3 plus stress testing on Oracle supported platforms This pull request has now been integrated. Changeset: 04dcaa34 Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/04dcaa3412d07c407aed604874095acaf81d7309 Stats: 5 lines in 1 file changed: 4 ins; 1 del; 0 mod 8367721: Test compiler/arguments/TestCompileTaskTimeout.java crashed: SIGSEGV Reviewed-by: mchevalier, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/27331 From mhaessig at openjdk.org Thu Sep 18 08:34:58 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 18 Sep 2025 08:34:58 GMT Subject: RFR: 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 07:55:10 GMT, Emanuel Peter wrote: > Adding missing `-XX:+UnlockDiagnosticVMOptions`. > > Seems the test from https://github.com/openjdk/jdk/pull/27263 was not tested with the product build before integration. Thank you for this fix, @eme64. It looks good to me. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/27359#pullrequestreview-3238186192 From ayang at openjdk.org Thu Sep 18 09:04:08 2025 From: ayang at openjdk.org (Albert Mingkun Yang) Date: Thu, 18 Sep 2025 09:04:08 GMT Subject: RFR: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:15:06 GMT, Francesco Andreuzzi wrote: > This is the content of assembler.inline.hpp: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 > > Most of the `assembler_.inline.hpp` include it: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 > > They should probably include `assembler.hpp` instead. > > Testing: tier1 in GHA Marked as reviewed by ayang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/27311#pullrequestreview-3238321532 From duke at openjdk.org Thu Sep 18 09:07:20 2025 From: duke at openjdk.org (duke) Date: Thu, 18 Sep 2025 09:07:20 GMT Subject: RFR: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:15:06 GMT, Francesco Andreuzzi wrote: > This is the content of assembler.inline.hpp: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 > > Most of the `assembler_.inline.hpp` include it: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 > > They should probably include `assembler.hpp` instead. > > Testing: tier1 in GHA @fandreuz Your change (at version ce90f21fb1b61d82f14bd24381914caa81ff2a1f) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27311#issuecomment-3306378862 From fandreuzzi at openjdk.org Thu Sep 18 09:12:45 2025 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Thu, 18 Sep 2025 09:12:45 GMT Subject: Integrated: 8367740: assembler_.inline.hpp should not include assembler.inline.hpp In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 10:15:06 GMT, Francesco Andreuzzi wrote: > This is the content of assembler.inline.hpp: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/share/asm/assembler.inline.hpp#L28-L30 > > Most of the `assembler_.inline.hpp` include it: > https://github.com/openjdk/jdk/blob/ca89cd06d39ed3a6bbe16f60fea4d7382849edbd/src/hotspot/cpu/zero/assembler_zero.inline.hpp#L29-L32 > > They should probably include `assembler.hpp` instead. > > Testing: tier1 in GHA This pull request has now been integrated. Changeset: 4c7c009d Author: Francesco Andreuzzi Committer: Damon Fenacci URL: https://git.openjdk.org/jdk/commit/4c7c009dd6aa2ce1f65f05c05d7376240f3c01cd Stats: 5 lines in 5 files changed: 0 ins; 0 del; 5 mod 8367740: assembler_.inline.hpp should not include assembler.inline.hpp Reviewed-by: dfenacci, ayang ------------- PR: https://git.openjdk.org/jdk/pull/27311 From duke at openjdk.org Thu Sep 18 09:41:40 2025 From: duke at openjdk.org (Don Phelix) Date: Thu, 18 Sep 2025 09:41:40 GMT Subject: RFR: 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 07:55:10 GMT, Emanuel Peter wrote: > Adding missing `-XX:+UnlockDiagnosticVMOptions`. > > Seems the test from https://github.com/openjdk/jdk/pull/27263 was not tested with the product build before integration. LGTM :) ------------- Marked as reviewed by donphelix at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/jdk/pull/27359#pullrequestreview-3238384691 From bulasevich at openjdk.org Thu Sep 18 10:47:15 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 18 Sep 2025 10:47:15 GMT Subject: RFR: 8359378: aarch64: crash when using -XX:+UseFPUForSpilling In-Reply-To: References: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> Message-ID: <0hDCgQdHVY_yIY00TsLYZlcI7aKnw992z_x0DhqvhIY=.a5ef47d5-cfa8-4117-85b2-9e4c45d50975@github.com> On Thu, 18 Sep 2025 07:03:52 GMT, Roberto Casta?eda Lozano wrote: >> AArch64 BarrierSetAssembler path assumes only FP/vector ideal regs reach the FP spill/restore encoding. With -XX:+UseFPUForSpilling Register Allocator may allocate scalar values in FP registers. When such values (Op_RegI/Op_RegN/Op_RegL/Op_RegP) hit `BarrierSetAssembler::encode_float_vector_register_size`, we trip ShouldNotReachHere in release build and **"unexpected ideal register"** assertion in debug build. >> >> Fix: teach the encoder to handle scalar ideal regs when they physically live in FP regs: >> - treat Op_RegI / Op_RegN as 32-bit (single slot) - same class as Op_RegF >> - treat Op_RegL / Op_RegP as 64-bit (two slots) - same class as Op_RegD >> >> Related: >> - reproduced since #19746 >> - spilling logic: >> - #18967 >> - #17977 >> >> Testing: tier1-3 with javaoptions -Xcomp -Xbatch -XX:+UseFPUForSpilling on AARCH > > Hi @bulasevich, thanks for working on this issue, but please note that it was already assigned to me ([JDK-8359378](https://bugs.openjdk.org/browse/JDK-8359378)). I am fine with re-assigning it to you, but [next time please ask first, to avoid work duplication](https://openjdk.org/guide/#i-found-an-issue-in-jbs-that-i-want-to-fix). Right, @robcasloz, I started investigating this issue thinking it was something wrong in my own code. Once I realized it was a common issue already assigned, I decided to propose a fix since it looked a bit abandoned. I didn?t mean to bypass your work -- you?re right, I should have contacted you first. Anyway, I?d appreciate your review. Do you think my change is reasonable? If not, let me close this PR and leave it to you. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27350#issuecomment-3306764696 From aph at openjdk.org Thu Sep 18 10:57:17 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 18 Sep 2025 10:57:17 GMT Subject: RFR: 8359378: aarch64: crash when using -XX:+UseFPUForSpilling In-Reply-To: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> References: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> Message-ID: On Wed, 17 Sep 2025 16:19:12 GMT, Boris Ulasevich wrote: > AArch64 BarrierSetAssembler path assumes only FP/vector ideal regs reach the FP spill/restore encoding. With -XX:+UseFPUForSpilling Register Allocator may allocate scalar values in FP registers. When such values (Op_RegI/Op_RegN/Op_RegL/Op_RegP) hit `BarrierSetAssembler::encode_float_vector_register_size`, we trip ShouldNotReachHere in release build and **"unexpected ideal register"** assertion in debug build. > > Fix: teach the encoder to handle scalar ideal regs when they physically live in FP regs: > - treat Op_RegI / Op_RegN as 32-bit (single slot) - same class as Op_RegF > - treat Op_RegL / Op_RegP as 64-bit (two slots) - same class as Op_RegD > > Related: > - reproduced since #19746 > - spilling logic: > - #18967 > - #17977 > > Testing: tier1-3 with javaoptions -Xcomp -Xbatch -XX:+UseFPUForSpilling on AARCH Given that you're looking at this, I'd appreciate it if you could form an opinion bout whether this option is of any use. `UseFPUForSpilling` on AArch64 is showing signs of code rot. If it has advantages on some machine we should turn it on by default; if it does not, why support it at all? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27350#issuecomment-3306800418 From epeter at openjdk.org Thu Sep 18 11:12:24 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 11:12:24 GMT Subject: RFR: 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 08:32:23 GMT, Manuel H?ssig wrote: >> Adding missing `-XX:+UnlockDiagnosticVMOptions`. >> >> Seems the test from https://github.com/openjdk/jdk/pull/27263 was not tested with the product build before integration. > > Thank you for this fix, @eme64. It looks good to me. @mhaessig @shipilev Thanks for the reviews! I agree that it is trivial, so I'm integrating before the 24h mark to quiet the CI. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27359#issuecomment-3306860478 From epeter at openjdk.org Thu Sep 18 11:12:26 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 11:12:26 GMT Subject: Integrated: 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 07:55:10 GMT, Emanuel Peter wrote: > Adding missing `-XX:+UnlockDiagnosticVMOptions`. > > Seems the test from https://github.com/openjdk/jdk/pull/27263 was not tested with the product build before integration. This pull request has now been integrated. Changeset: a49856bb Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/a49856bb044057a738ffc4186e1e5e3916c0254c Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions Reviewed-by: shade, mhaessig ------------- PR: https://git.openjdk.org/jdk/pull/27359 From bulasevich at openjdk.org Thu Sep 18 11:57:18 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 18 Sep 2025 11:57:18 GMT Subject: RFR: 8359378: aarch64: crash when using -XX:+UseFPUForSpilling In-Reply-To: References: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> Message-ID: <6rmFXFPH5a9AnoLqSQa5XBplICEu961cu-6HuXh4EX4=.30793abf-cfb4-49dc-8733-e5e17357f1a8@github.com> On Thu, 18 Sep 2025 10:54:58 GMT, Andrew Haley wrote: >> AArch64 BarrierSetAssembler path assumes only FP/vector ideal regs reach the FP spill/restore encoding. With -XX:+UseFPUForSpilling Register Allocator may allocate scalar values in FP registers. When such values (Op_RegI/Op_RegN/Op_RegL/Op_RegP) hit `BarrierSetAssembler::encode_float_vector_register_size`, we trip ShouldNotReachHere in release build and **"unexpected ideal register"** assertion in debug build. >> >> Fix: teach the encoder to handle scalar ideal regs when they physically live in FP regs: >> - treat Op_RegI / Op_RegN as 32-bit (single slot) - same class as Op_RegF >> - treat Op_RegL / Op_RegP as 64-bit (two slots) - same class as Op_RegD >> >> Related: >> - reproduced since #19746 >> - spilling logic: >> - #18967 >> - #17977 >> >> Testing: tier1-3 with javaoptions -Xcomp -Xbatch -XX:+UseFPUForSpilling on AARCH > > Given that you're looking at this, I'd appreciate it if you could form an opinion bout whether this option is of any use. > > `UseFPUForSpilling` on AArch64 is showing signs of code rot. If it has advantages on some machine we should turn it on by default; if it does not, why support it at all? @theRealAph Andrew, I agree with you. From my experience it is useless on Cortex-A72, Neoverse N1, Neoverse V1. I have now also checked on Neoverse V2 and Apple M4 - in both cases UseFPUForSpilling shows a clear performance degradation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27350#issuecomment-3307050077 From epeter at openjdk.org Thu Sep 18 11:58:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 11:58:25 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Tue, 16 Sep 2025 01:24:35 GMT, Dean Long wrote: >> Could we also bail out here? Or what would happen now in production if there is a RF edge? > > We also use this area past endoff() for storing the "ex_oop" (see for example GraphKit::has_saved_ex_oop()). Are ex_oop and reachability edges mutually exclusive? @dean-long @iwanowww Ok, but probably there will at some point be a conflict. And if RF are rather rare, we will not notice so fast. Or would your stress flag catch the conflict? Is there not a way to make it clear/explicit which edges are there for what reason? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2358840330 From epeter at openjdk.org Thu Sep 18 11:58:26 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 11:58:26 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: <_n3uP_Dkl3RNq3MFoRDXsS28SM8CcQHaR6vdUJF9U8s=.dcfab97b-be28-4244-93df-c8a23d6d66b8@github.com> Message-ID: On Wed, 17 Sep 2025 22:26:57 GMT, Vladimir Ivanov wrote: >> @iwanowww > > The code in `PhaseMacroExpand::process_users_of_allocation` iterates over direct users of result cast from Allocation nodes. And RF is not special there. Any other case in `PhaseMacroExpand::process_users_of_allocation()` would be affected. Ah ok. As long as it only iterates over the result cast, that is good :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2358818812 From epeter at openjdk.org Thu Sep 18 11:58:28 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 11:58:28 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v8] In-Reply-To: References: <4jTV6y9R_JfATA54LC7FK3DKdBX1srsU09DK1I25Uo0=.94233927-71f2-4f13-894d-206d00f5fdaa@github.com> Message-ID: On Wed, 17 Sep 2025 19:48:44 GMT, Vladimir Ivanov wrote: >> Ah yes: we may for example move a store out (after) the loop. But wait. We can't move a store across a SafePoint, so that's not a good example. > > For example, loads suffer from the same problems as stores, but constraints on them are more lax. Are you saying we are allowed to move loads across SafePoints, but not across RF? If yes, please add such an example to the code comments ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2358830331 From rcastanedalo at openjdk.org Thu Sep 18 12:00:16 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 18 Sep 2025 12:00:16 GMT Subject: RFR: 8359378: aarch64: crash when using -XX:+UseFPUForSpilling In-Reply-To: References: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> Message-ID: On Thu, 18 Sep 2025 07:03:52 GMT, Roberto Casta?eda Lozano wrote: >> AArch64 BarrierSetAssembler path assumes only FP/vector ideal regs reach the FP spill/restore encoding. With -XX:+UseFPUForSpilling Register Allocator may allocate scalar values in FP registers. When such values (Op_RegI/Op_RegN/Op_RegL/Op_RegP) hit `BarrierSetAssembler::encode_float_vector_register_size`, we trip ShouldNotReachHere in release build and **"unexpected ideal register"** assertion in debug build. >> >> Fix: teach the encoder to handle scalar ideal regs when they physically live in FP regs: >> - treat Op_RegI / Op_RegN as 32-bit (single slot) - same class as Op_RegF >> - treat Op_RegL / Op_RegP as 64-bit (two slots) - same class as Op_RegD >> >> Related: >> - reproduced since #19746 >> - spilling logic: >> - #18967 >> - #17977 >> >> Testing: tier1-3 with javaoptions -Xcomp -Xbatch -XX:+UseFPUForSpilling on AARCH > > Hi @bulasevich, thanks for working on this issue, but please note that it was already assigned to me ([JDK-8359378](https://bugs.openjdk.org/browse/JDK-8359378)). I am fine with re-assigning it to you, but [next time please ask first, to avoid work duplication](https://openjdk.org/guide/#i-found-an-issue-in-jbs-that-i-want-to-fix). > Right, @robcasloz, I started investigating this issue thinking it was something wrong in my own code. Once I realized it was a common issue already assigned, I decided to propose a fix since it looked a bit abandoned. I didn?t mean to bypass your work -- you?re right, I should have contacted you first. Anyway, I?d appreciate your review. Do you think my change is reasonable? If not, let me close this PR and leave it to you. Thanks, I had planned to look at this in the upcoming weeks but did not start yet. I just reassigned the issue to you, will have a look at your fix within the next days. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27350#issuecomment-3307058150 From epeter at openjdk.org Thu Sep 18 12:04:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:04:20 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> <7s8qppZ6lzq5iN-inRFkFuXgElo46UmYyIrvExOLA3A=.cf76da61-89ee-4d29-9b5a-0b6e7b3bac2b@github.com> Message-ID: On Wed, 17 Sep 2025 19:44:52 GMT, Vladimir Ivanov wrote: >>> Why are you checking for _mode == LoopOptsDefaultFinal and not for LoopOptsEliminateRFs? >> >> The intention is to avoid an extra `PhaseIdealLoop` construction pass solely for `LoopOptsEliminateRFs` purposes when there's an empty pass during normal flow of loop optimizations. >> >> `LoopOptsEliminateRFs` is performed as the last resort when there was no previous pass to piggyback on. > > Maybe `LoopOptsEliminateRFs` should stress that it is intended to happen as the very last step in the flow of loop optimizations. Or, something happening after all other loop optimizations are over. I'll think more about it. > > From code perspective, what makes things more complicated is that `PhaseIdealLoop` instance is hidden in `PhaseIdealLoop::optimize()`, so shaping it as a step in loop opts pipeline feels like the most appropriate thing to do. @iwanowww It is the last step in the pipeline, but the pipeline could get executed again, right? So then you may think that you have reached the last step in the pipeline, but then in the next pipeline execution, you might have already eliminated the RF, and now you would do some loop-opts that you should not. That's what I'm worried about. Can we have some assert for that? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2358850185 From epeter at openjdk.org Thu Sep 18 12:04:22 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:04:22 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Wed, 17 Sep 2025 19:53:18 GMT, Vladimir Ivanov wrote: >> Why don't we put RF edges somewhere else, so they don't look like derived oops? I was thinking they could go in the monitor area, or if that causes problems, we introduce a new area. > > It's solely an implementation limitation. As of now, the only structure imposed on safepoint inputs relates to debug info (represented as JVMState). The rest is adhoc and there are many conflicting use cases introduced over time. The proper way to address it is to introduce proper structure for non-debug inputs, but it requires significant engineering effort to properly handle it across the whole compilation pipeline. For now, I just work-around it by performing additional transformation to avoid conflicts with existing functionality. Maybe we should do that effort soon, otherwise we just keep heaping up tech dept :/ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2358853253 From epeter at openjdk.org Thu Sep 18 12:04:23 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:04:23 GMT Subject: RFR: 8290892: C2: Intrinsify Reference.reachabilityFence [v11] In-Reply-To: References: <1pShdyn-7-wwwiuY1DdMt5iiZ2qc9l_x2F-3AKqkg60=.dd260953-05cc-4b84-b6d1-7f684e74084c@github.com> Message-ID: On Thu, 18 Sep 2025 11:59:52 GMT, Emanuel Peter wrote: >> It's solely an implementation limitation. As of now, the only structure imposed on safepoint inputs relates to debug info (represented as JVMState). The rest is adhoc and there are many conflicting use cases introduced over time. The proper way to address it is to introduce proper structure for non-debug inputs, but it requires significant engineering effort to properly handle it across the whole compilation pipeline. For now, I just work-around it by performing additional transformation to avoid conflicts with existing functionality. > > Maybe we should do that effort soon, otherwise we just keep heaping up tech dept :/ And who knows, maybe conflicts are only avoided by accident, and maybe just because we did not encounter cases where the different features actually overlap and conflict. Or are we confident that we generated sufficient cases with overlaps of the different features that use the safepoint inputs? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25315#discussion_r2358857194 From epeter at openjdk.org Thu Sep 18 12:17:27 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:17:27 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6] In-Reply-To: References: Message-ID: <3LhOW_sYJcS3zgNB2PLXAQ393WU73hdgjSqmsmoy7VQ=.3cbc1e66-c59e-41b1-80c8-24373797259a@github.com> On Wed, 17 Sep 2025 08:48:16 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Add more comments for IRs and added method > - Merge branch 'jdk:master' into JDK-8351623-sve > - Merge 'jdk:master' into JDK-8351623-sve > - Address review comments > - Refine IR pattern and clean backend rules > - Fix indentation issue and move the helper matcher method to header files > - Merge branch jdk:master into JDK-8351623-sve > - 8351623: VectorAPI: Add SVE implementation of subword gather load operation src/hotspot/cpu/aarch64/matcher_aarch64.hpp line 173: > 171: // SVE requires vector indices for gather-load/scatter-store operations on all > 172: // data types. > 173: static bool gather_scatter_requires_index_in_address(BasicType bt) { I know I agreed to this naming, but I looked at the signature of `Gather` again: `LoadVectorGatherNode(Node* c, Node* mem, Node* adr, const TypePtr* at, const TypeVect* vt, Node* indices)` I'm a little confused now what is the `address` that your name references. Is it the `adr`? I think not, because that is the base address, right? Can you clarify a little more? Maybe add to the documentation of the gather and scatter node as well, if you think that helps? src/hotspot/share/opto/vectornode.hpp line 1121: > 1119: // that has the same vector type as the node's bottom type. For non-subword types, it must > 1120: // be. However, for subword types, the basic type of index is int. Hence, the index map > 1121: // can be either a vector with int elements or an address which saves the int indices. Very nice, that helps! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2358918581 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2358924085 From epeter at openjdk.org Thu Sep 18 12:22:32 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:22:32 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6] In-Reply-To: <3LhOW_sYJcS3zgNB2PLXAQ393WU73hdgjSqmsmoy7VQ=.3cbc1e66-c59e-41b1-80c8-24373797259a@github.com> References: <3LhOW_sYJcS3zgNB2PLXAQ393WU73hdgjSqmsmoy7VQ=.3cbc1e66-c59e-41b1-80c8-24373797259a@github.com> Message-ID: <--dYtit2PWnrw8fxiHum8BLdxnRAWBNfNAz4eGWYI8E=.ac6c9739-e926-47fe-8c5f-db6ef04b906c@github.com> On Thu, 18 Sep 2025 12:13:55 GMT, Emanuel Peter wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - Add more comments for IRs and added method >> - Merge branch 'jdk:master' into JDK-8351623-sve >> - Merge 'jdk:master' into JDK-8351623-sve >> - Address review comments >> - Refine IR pattern and clean backend rules >> - Fix indentation issue and move the helper matcher method to header files >> - Merge branch jdk:master into JDK-8351623-sve >> - 8351623: VectorAPI: Add SVE implementation of subword gather load operation > > src/hotspot/cpu/aarch64/matcher_aarch64.hpp line 173: > >> 171: // SVE requires vector indices for gather-load/scatter-store operations on all >> 172: // data types. >> 173: static bool gather_scatter_requires_index_in_address(BasicType bt) { > > I know I agreed to this naming, but I looked at the signature of `Gather` again: > `LoadVectorGatherNode(Node* c, Node* mem, Node* adr, const TypePtr* at, const TypeVect* vt, Node* indices)` > > I'm a little confused now what is the `address` that your name references. Is it the `adr`? I think not, because that is the base address, right? Can you clarify a little more? Maybe add to the documentation of the gather and scatter node as well, if you think that helps? Actually, you already did add documentation to the gather / scatter nodes now. And based on your explanation there, I suggest you rename the method here to: `gather_scatter_requires_indices_from_array` This would say that the indices come from an array, rather than a vector register. Your current name we had agreed on confuses me because it suggests that the index maybe already in the address `adr`, but that does not make much sense. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2358946032 From epeter at openjdk.org Thu Sep 18 12:30:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:30:52 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 08:48:16 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Add more comments for IRs and added method > - Merge branch 'jdk:master' into JDK-8351623-sve > - Merge 'jdk:master' into JDK-8351623-sve > - Address review comments > - Refine IR pattern and clean backend rules > - Fix indentation issue and move the helper matcher method to header files > - Merge branch jdk:master into JDK-8351623-sve > - 8351623: VectorAPI: Add SVE implementation of subword gather load operation @XiaohongGong I'm going to be away on vacation for about 3 weeks now. So I won't be able to continue with the review until I'm back. Maybe @vnkozlov or @iwanowww can review instead. Maybe @PaulSandoz or @jatin-bhateja would like to look at it too. If they do, I would want them to consider if the approach with the special vector nodes `VectorConcatenateAndNarrow` and `VectorMaskWiden` are really desirable. The complexity needs to go somewhere, but I'm not sure if it is better in the C2 IR or in the backend. In this PR, there are already a thread [here](https://github.com/openjdk/jdk/pull/26236#discussion_r2324740007) and [here](https://github.com/openjdk/jdk/pull/26236#discussion_r2324744990). ------------- PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3239353455 From epeter at openjdk.org Thu Sep 18 12:57:48 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:57:48 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: References: Message-ID: <1YTjbiOmc3OUXZlJ_Pg4W6En5hjU0wd_JBHERbVLDWc=.11ddbe0f-685b-463e-87b7-fcdd14ad4bb2@github.com> On Tue, 9 Sep 2025 02:09:53 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update countbitsnode.cpp > > Hi @TobiHartmann , @SirYwell , @eme64 , can you kindly verify the changes in the latest patch? @jatin-bhateja I'm going to be out of the office for about 3 weeks, so feel free to ask others for reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27075#issuecomment-3307301622 From epeter at openjdk.org Thu Sep 18 12:57:49 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:57:49 GMT Subject: RFR: 8363989: AArch64: Add missing backend support of VectorAPI expand operation [v4] In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 05:55:43 GMT, erifan wrote: >> Currently, on AArch64, the VectorAPI `expand` operation is intrinsified for 32-bit and 64-bit types only when SVE2 is available. In the following cases, `expand` has not yet been intrinsified: >> 1. **Subword types** on SVE2-capable hardware. >> 2. **All types** on NEON and SVE1 environments. >> >> As a result, `expand` API performance is very poor in these scenarios. This patch intrinsifies the `expand` operation in the above environments. >> >> Since there are no native instructions directly corresponding to `expand` in these cases, this patch mainly leverages the `TBL` instruction to implement `expand`. To compute the index input for `TBL`, the prefix sum algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. Take a 128-bit byte vector on SVE2 as an example: >> >> To compute: dst = src.expand(mask) >> Data direction: high <== low >> Input: >> src = p o n m l k j i h g f e d c b a >> mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> Expected result: >> dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> Step 1: calculate the index input of the TBL instruction. >> >> // Set tmp1 as all 0 vector. >> tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> >> // Move the mask bits from the predicate register to a vector register. >> // **1-bit** mask lane of P register to **8-bit** mask lane of V register. >> tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 >> >> // Shift the entire register. Prefix sum algorithm. >> dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 >> tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 >> >> dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 >> tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 >> >> dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 >> tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 >> >> dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 >> tmp2 += dst = 8 8 8 7 6 6 6 5 4 4 4 3 2 2 2 1 >> >> // Clear inactive elements. >> dst = sel(mask, tmp2, tmp1) = 0 0 8 7 0 0 6 5 0 0 4 3 0 0 2 1 >> >> // Set the inactive lane value to -1 and set the active lane to the target index. >> dst -= 1 = -1 -1 7 6 -1 -1 5 4 -1 -1 3 2 -1 -1 1 0 >> >> Step 2: shuffle the source vector elements to the target vector >> >> tbl(dst, src, dst) = 0 0 h g 0 0 f e 0 0 d c 0 0 b a >> >> >> The same algorithm is used for NEON and... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge branch 'master' into JDK-8363989 > - Align code example data for better reading > - Merge branch 'master' into JDK-8363989 > - Improve the comment of the vector expand implementation > - Merge branch 'master' into JDK-8363989 > - 8363989: AArch64: Add missing backend support of VectorAPI expand operation > > Currently, on AArch64, the VectorAPI `expand` operation is intrinsified > for 32-bit and 64-bit types only when SVE2 is available. In the following > cases, `expand` has not yet been intrinsified: > 1. **Subword types** on SVE2-capable hardware. > 2. **All types** on NEON and SVE1 environments. > > As a result, `expand` API performance is very poor in these scenarios. > This patch intrinsifies the `expand` operation in the above environments. > > Since there are no native instructions directly corresponding to `expand` > in these cases, this patch mainly leverages the `TBL` instruction to > implement `expand`. To compute the index input for `TBL`, the prefix sum > algorithm (see https://en.wikipedia.org/wiki/Prefix_sum) is used. > Take a 128-bit byte vector on SVE2 as an example: > ``` > To compute: dst = src.expand(mask) > Data direction: high <== low > Input: > src = p o n m l k j i h g f e d c b a > mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > Expected result: > dst = 0 0 h g 0 0 f e 0 0 d c 0 0 b a > ``` > Step 1: calculate the index input of the TBL instruction. > ``` > // Set tmp1 as all 0 vector. > tmp1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > > // Move the mask bits from the predicate register to a vector register. > // **1-bit** mask lane of P register to **8-bit** mask lane of V register. > tmp2 = mask = 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 > > // Shift the entire register. Prefix sum algorithm. > dst = tmp2 << 8 = 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 > tmp2 += dst = 0 1 2 1 0 1 2 1 0 1 2 1 0 1 2 1 > > dst = tmp2 << 16 = 2 1 0 1 2 1 0 1 2 1 0 1 2 1 0 0 > tmp2 += dst = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 > > dst = tmp2 << 32 = 2 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 > tmp2 += dst = 4 4 4 4 4 4 4 4 4 4 4 3 2 2 2 1 > > dst = tmp2 << 64 = 4 4 4 3 2 2 2 1 0 0 0 0 0 0 0 0 > ... I ran testing again, and it passed now. Sorry, must have been an infra issue. Approved! :) ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26740#pullrequestreview-3239515013 From epeter at openjdk.org Thu Sep 18 12:58:55 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 18 Sep 2025 12:58:55 GMT Subject: RFR: 8366333: AArch64: Enhance SVE subword type implementation of vector compress In-Reply-To: References: Message-ID: On Mon, 15 Sep 2025 09:58:19 GMT, erifan wrote: >> Would it make sense to additionally run the relevant benchmarks on other popular aarch64 platforms such as Graviton, to make sure the improvements are seen there as well? > > @galderz Yeah, absolutely. This is the test results on an **AWS graviton3 V1 machine**, we can see similar performance gain. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm"> > href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml"> > > > > > > > > > Benchmark | Units | Before | Error | After | Error | Uplift > -- | -- | -- | -- | -- | -- | -- > Byte128Vector.compress | ops/ms | 2405.511 | 0.763 | 6116.85 | 17.699 | 2.54284848 > Byte64Vector.compress | ops/ms | 1151.662 | 11.262 | 5278.924 | 6.74 | 4.58374419 > Double128Vector.compress | ops/ms | 4919.017 | 4.909 | 4940.232 | 20.143 | 1.00431285 > Double64Vector.compress | ops/ms | 37.071 | 0.778 | 37.109 | 0.945 | 1.00102506 > Float128Vector.compress | ops/ms | 9580.312 | 48.341 | 9586.499 | 74.934 | 1.0006458 > Float64Vector.compress | ops/ms | 4943.728 | 7.361 | 4941.917 | 5.871 | 0.99963368 > Int128Vector.compress | ops/ms | 9496.991 | 34.972 | 9515.122 | 29.204 | 1.00190913 > Int64Vector.compress | ops/ms | 4940.23 | 7.141 | 4941.815 | 5.077 | 1.00032084 > Long128Vector.compress | ops/ms | 4918.142 | 14.835 | 4917.148 | 9.05 | 0.99979789 > Long64Vector.compress | ops/ms | 36.58 | 0.426 | 36.574 | 0.431 | 0.99983598 > Short128Vector.compress | ops/ms | 3343.878 | 0.898 | 6813.421 | 4.143 | 2.03758062 > Short64Vector.compress | ops/ms | 1595.358 | 3.37 | 3390.959 | 3.55 | 2.12551603 > > > > > > @erifan I'm going to be out of the office for 3 weeks, so feel free to ask others for reviews :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27188#issuecomment-3307304775 From adinn at openjdk.org Thu Sep 18 15:01:41 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Thu, 18 Sep 2025 15:01:41 GMT Subject: RFR: 8359378: aarch64: crash when using -XX:+UseFPUForSpilling In-Reply-To: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> References: <0MbLaZYRoz-raa9x1eZycxeHE7mvPiGJujQpg8vBdek=.e3483c47-ad65-4d09-baf5-2db9b780669d@github.com> Message-ID: On Wed, 17 Sep 2025 16:19:12 GMT, Boris Ulasevich wrote: > AArch64 BarrierSetAssembler path assumes only FP/vector ideal regs reach the FP spill/restore encoding. With -XX:+UseFPUForSpilling Register Allocator may allocate scalar values in FP registers. When such values (Op_RegI/Op_RegN/Op_RegL/Op_RegP) hit `BarrierSetAssembler::encode_float_vector_register_size`, we trip ShouldNotReachHere in release build and **"unexpected ideal register"** assertion in debug build. > > Fix: teach the encoder to handle scalar ideal regs when they physically live in FP regs: > - treat Op_RegI / Op_RegN as 32-bit (single slot) - same class as Op_RegF > - treat Op_RegL / Op_RegP as 64-bit (two slots) - same class as Op_RegD > > Related: > - reproduced since #19746 > - spilling logic: > - #18967 > - #17977 > > Testing: tier1-3 with javaoptions -Xcomp -Xbatch -XX:+UseFPUForSpilling on AARCH I was wondering about that. So, perhaps a better fix is to change the command line ergonomics so that AArch64 either 1) refuses to run with it set to true or 2) prints a warning and resets it to false. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27350#issuecomment-3307970063 From vlivanov at openjdk.org Thu Sep 18 15:52:06 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 18 Sep 2025 15:52:06 GMT Subject: RFR: 8367969: C2: compiler/vectorapi/TestVectorMathLib.java fails without UnlockDiagnosticVMOptions In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 07:55:10 GMT, Emanuel Peter wrote: > Adding missing `-XX:+UnlockDiagnosticVMOptions`. > > Seems the test from https://github.com/openjdk/jdk/pull/27263 was not tested with the product build before integration. Thanks for taking care of it, Emanuel. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27359#issuecomment-3308240304 From galder at openjdk.org Thu Sep 18 17:35:26 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 18 Sep 2025 17:35:26 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v6] In-Reply-To: References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: On Thu, 18 Sep 2025 08:12:50 GMT, Emanuel Peter wrote: >> Impliementing ideas from original draft PR: https://github.com/openjdk/jdk/pull/23418 ([Exceptions](https://github.com/openjdk/jdk/pull/23418/files#diff-77e7db8cc0c5e02786e1c993362f98fabe219042eb342fdaffc09fd11380259dR41), [ExpressionFuzzer](https://github.com/openjdk/jdk/pull/23418/files#diff-01844ca5cb007f5eab5fa4195f2f1378d4e7c64ba477fba64626c98ff4054038R66)). >> >> Specifically, I'm extending the Template Library with `Expression`s, and lists of `Operations` (some basic Expressions). These Expressions can easily be nested and then filled with arguments, and applied in a `Template`. >> >> Details, in **order you should review**: >> - `Operations.java`: maps lots of primitive operators as Expressions. >> - `Expression.java`: the fundamental engine behind Expressions. >> - `examples/TestExpressions.java`: basic example using Expressions, filling them with random constants. >> - `tests/TestExpression.java`: correctness test of Expression machinery. >> - `compiler/igvn/ExpressionFuzzer.java`: expression fuzzer for primitive type expressions, including input range/bits constraints and output range/bits verification. >> - `PrimitiveType.java`: added `LibraryRNG` facility. We already had `type.con()` which gave us random constants. But we also want to have `type.callLibraryRNG()` so that we can insert a call to a random number generator of the corresponding primitive type. I use this facility in the `ExpressionFuzzer.java` to generate random arguments for the expressions. >> - `examples/TestPrimitiveTypes.java`: added a `LibraryRNG` example, that tests that has a weak test for randomness: we should have at least 2 different value in 1000 calls. >> >> If the reviewers absolutely insist, I could split out `LibraryRNG` into a separate RFE. But it's really not that much code, and has direct use in the `Expression` examples. >> >> **Future Work**: >> - Use `Expression`s in a loop over arrays / MemorySegment: fuzz auto-vectorization. >> - Use `Expression`s to model more operations: >> - `Vector API`, more arithmetic operations like from `Math` classes etc. >> - Ensure that the constraints / checksum mechanic in `compiler/igvn/ExpressionFuzzer.java` work, using IR rules. We may even need to add new IGVN optimizations. Add unsigned constraints. >> - Find a way to delay IGVN optimizations to test worklist notification: For example, we could add a new testing operator call `TestUtils.delay(x) -> x`, which is intrinsified as some new `DelayNode` that in normal circumstances just fol... > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - more comments > - add othervm to test Nice additions @eme64! I would have liked to see an example of real use case of this in action included in the PR, e.g. some kind of IR test that takes advantage of this. E.g. a companion version (and/or replacement) for `VectorReduction2`? A follow up RFE would of course be fine for this. ------------- Marked as reviewed by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/26885#pullrequestreview-3241161843 From galder at openjdk.org Thu Sep 18 18:09:40 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 18 Sep 2025 18:09:40 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v3] In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 06:40:12 GMT, Emanuel Peter wrote: >> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: >> https://github.com/openjdk/jdk/pull/20964 >> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) >> >> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. >> >> ------------------------------ >> >> **Goals** >> - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) >> - Remove `_nodes` from the vector vtnodes. >> >> **Details** >> - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. >> - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. >> - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). >> - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). >> - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. >> >> I also made a lot of annotations in the code below, for easier review. >> >> **Suggested order for review** >> - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. >> - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. >> - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` >> - `VTransformApplyState`: how it now tracks the memory state. >> - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) >> - Then look at all the other details. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/vectorization.cpp > > Co-authored-by: Manuel H?ssig Small nitpick, the rest looks fine as far as I can understand it :) src/hotspot/share/opto/vtransform.cpp line 760: > 758: // We may have reordered the scalar stores, or replaced them with vectors. Now > 759: // the last memory state in the loop may have changed. Thus, we need to change > 760: // the uses of the old last memory state the the new last memory state. Suggestion: // the uses of the old last memory state the new last memory state. ------------- Changes requested by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/27208#pullrequestreview-3241304926 PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2360559524 From psandoz at openjdk.org Thu Sep 18 20:00:53 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Thu, 18 Sep 2025 20:00:53 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 08:48:16 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Add more comments for IRs and added method > - Merge branch 'jdk:master' into JDK-8351623-sve > - Merge 'jdk:master' into JDK-8351623-sve > - Address review comments > - Refine IR pattern and clean backend rules > - Fix indentation issue and move the helper matcher method to header files > - Merge branch jdk:master into JDK-8351623-sve > - 8351623: VectorAPI: Add SVE implementation of subword gather load operation > I would want them to consider if the approach with the special vector nodes `VectorConcatenateAndNarrow` and `VectorMaskWiden` are really desirable. The complexity needs to go somewhere, but I'm not sure if it is better in the C2 IR or in the backend. > > It would just be nice to build on "simple" building blocks and not have too many complex nodes, that have very special semantics (widen + split into two) Intuitively this seems like the right way to think about it, although I don't have a proposed solution, i am really just agreeing with the above sentiment - a compositional solution, if possible, with the right primitive building blocks will likely be superior. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3309447725 From dlong at openjdk.org Thu Sep 18 23:12:00 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 18 Sep 2025 23:12:00 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation [v3] In-Reply-To: <6ijTgwXUpwm8C_U7oOsN7RScv-caCal0U67UXFZ6VmY=.5550cf2f-2c57-4fc0-a2cd-3df6627485a2@github.com> References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> <6ijTgwXUpwm8C_U7oOsN7RScv-caCal0U67UXFZ6VmY=.5550cf2f-2c57-4fc0-a2cd-3df6627485a2@github.com> Message-ID: On Tue, 16 Sep 2025 15:38:12 GMT, Manuel H?ssig wrote: >> When running a debug JVM on Linux with a compile task timeout and repeated compilation, the execution will time out almost always because the timeout does not reset for repetitions of a compilation. The core of the compile task timeout is to limit the amount of time a single compilation can take. Thus, this PR resets the `CompileTaskTimeout` for every compilation when running with `-XX:RepeatCompilation=` for n > 1. >> >> This PR is stacked on top of #27094. >> >> Testing: >> - [x] Github Actions (failures are unrelated) >> - [x] tier1, tier2, tier3 plus some additional internal testing > > Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8366875-repeat-comp-to > - Reset timeout on repeated compilations > - Add regression test > - Use timeuot factor Still good. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27120#pullrequestreview-3242316009 From duke at openjdk.org Fri Sep 19 03:24:56 2025 From: duke at openjdk.org (duke) Date: Fri, 19 Sep 2025 03:24:56 GMT Subject: Withdrawn: 8359963: compiler/c2/aarch64/TestStaticCallStub.java fails with for code cache > 250MB the static call stub is expected to be implemented using far branch In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 15:24:42 GMT, Mikhail Ablakatov wrote: > The test assumed that hsdis is always available which is not the case. Make the test accept and scan either real or pseudo disassembly. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/26047 From galder at openjdk.org Fri Sep 19 04:08:05 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 19 Sep 2025 04:08:05 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v6] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 14:35:23 GMT, Emanuel Peter wrote: >> Demo from here: >> https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ >> >> Cleaned up and enhanced with a JTREG and IR test. >> I also added some additional "generated" normal maps from height functions. >> And I display the resulting image side-by-side with the normal map. >> >> I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. >> >> There is a **stand-alone** way to run the demo: >> `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` >> (though it may only run with JDK22+, probably due some amber features) >> >> **Quick Perforance Numbers**, running on my avx512 laptop. >> default / AVX3: 105 FPS >> AVX2: 82 FPS >> AVX1: 50 FPS >> No vectorization: 19 FPS >> GraalJIT: 13 FPS (`jdk-26-ea+5` - probably issue with vectorization / inlining?) >> >> Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: >> image >> image >> image >> image > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/gallery/TestNormalMapping.java > > Co-authored-by: Andrey Turbanov > - Update test/hotspot/jtreg/compiler/gallery/NormalMapping.java > > Co-authored-by: Christian Hagedorn Great demo! I run it on my M4 Pro at 220 FPS with default flags ? ------------- Marked as reviewed by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/27282#pullrequestreview-3242796809 From epeter at openjdk.org Fri Sep 19 05:51:25 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 05:51:25 GMT Subject: RFR: 8359412: Template-Framework Library: Operations and Expressions [v6] In-Reply-To: References: <6Bm5VrrqCOzdOooIU-wud7c3aCSuv_7GNZe7pe7D7Jk=.c99a9df1-e6bb-4c8d-94e9-029978fae6ab@github.com> Message-ID: On Thu, 18 Sep 2025 17:32:22 GMT, Galder Zamarre?o wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - more comments >> - add othervm to test > > Nice additions @eme64! > > I would have liked to see an example of real use case of this in action included in the PR, e.g. some kind of IR test that takes advantage of this. E.g. a companion version (and/or replacement) for `VectorReduction2`? A follow up RFE would of course be fine for this. @galderz Thanks for reviewing! Can you spell out a little more what you would like to see? For me, the `compiler/igvn/ExpressionFuzzer.java` is already "an example of real use" for me. And I have a lot still planned in future RFE's, see the "future work" section in the PR description ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26885#issuecomment-3310686831 From epeter at openjdk.org Fri Sep 19 05:56:20 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 05:56:20 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v4] In-Reply-To: References: Message-ID: > I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR: > https://github.com/openjdk/jdk/pull/20964 > [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093) > > This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier. > > ------------------------------ > > **Goals** > - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop) > - Remove `_nodes` from the vector vtnodes. > > **Details** > - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`. > - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states. > - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi). > - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation). > - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes. > > I also made a lot of annotations in the code below, for easier review. > > **Suggested order for review** > - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly. > - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices. > - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop` > - `VTransformApplyState`: how it now tracks the memory state. > - `VTransformVectorNode` -> removal of `_nodes` (Big Win!) > - Then look at all the other details. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/vtransform.cpp Co-authored-by: Galder Zamarre?o ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27208/files - new: https://git.openjdk.org/jdk/pull/27208/files/9af66755..99fd1c99 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27208&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27208&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/27208.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27208/head:pull/27208 PR: https://git.openjdk.org/jdk/pull/27208 From epeter at openjdk.org Fri Sep 19 05:56:23 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 05:56:23 GMT Subject: RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v3] In-Reply-To: References: Message-ID: On Thu, 18 Sep 2025 18:06:33 GMT, Galder Zamarre?o wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/share/opto/vectorization.cpp >> >> Co-authored-by: Manuel H?ssig > > Small nitpick, the rest looks fine as far as I can understand it :) @galderz Thanks for having a look, I applied the suggestion :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27208#issuecomment-3310689125 From chagedorn at openjdk.org Fri Sep 19 06:37:27 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 19 Sep 2025 06:37:27 GMT Subject: RFR: 8367657: C2 SuperWord: NormalMapping demo from JVMLS 2025 [v6] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 14:35:23 GMT, Emanuel Peter wrote: >> Demo from here: >> https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/ >> >> Cleaned up and enhanced with a JTREG and IR test. >> I also added some additional "generated" normal maps from height functions. >> And I display the resulting image side-by-side with the normal map. >> >> I decided to put it in a new directory `compiler.gallery`, anticipating other compiler tests that are both visually appealing (i.e. can be used for a "gallery") and that we may want to back up with other tests like IR testing. >> >> There is a **stand-alone** way to run the demo: >> `java test/hotspot/jtreg/compiler/gallery/NormalMapping.java` >> (though it may only run with JDK22+, probably due some amber features) >> >> **Quick Perforance Numbers**, running on my avx512 laptop. >> default / AVX3: 105 FPS >> AVX2: 82 FPS >> AVX1: 50 FPS >> No vectorization: 19 FPS >> GraalJIT: 13 FPS (`jdk-26-ea+5` - probably issue with vectorization / inlining?) >> >> Here some snapshots, but **I really recommend pulling the diff and playing with it, it looks much better in motion**: >> image >> image >> image >> image > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/gallery/TestNormalMapping.java > > Co-authored-by: Andrey Turbanov > - Update test/hotspot/jtreg/compiler/gallery/NormalMapping.java > > Co-authored-by: Christian Hagedorn Update looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27282#pullrequestreview-3243376838 From fyang at openjdk.org Fri Sep 19 07:12:25 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 19 Sep 2025 07:12:25 GMT Subject: RFR: 8365732: RISC-V: implement AES CTR intrinsics [v7] In-Reply-To: References: Message-ID: On Fri, 12 Sep 2025 03:40:59 GMT, Anjian Wen wrote: >> Hi everyone, please help review this patch which Implement the _counterMode_AESCrypt with Zvkned. On my QEMU, with Zvkned extension enabled, the tests in test/hotspot/jtreg/compiler/codegen/aes/ Passed. > > Anjian Wen has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: > > - Merge branch 'openjdk:master' into aes_ctr > - fix the counter increase at limit and add test > - change format > - update reg use and instruction > - change some name and format > - delete useless Label, change L_judge_used to L_slow_loop > - add Flags and fix the stubid name > - RISC-V: implement AES-CTR mode intrinsics src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2667: > 2665: __ addi(t0, counter, 8); > 2666: __ ld(tmp, Address(t0)); > 2667: __ rev8(tmp, tmp); Note that `rev8` is only available under `UseZbb`. Maybe you should use `revb/revbw` instead which considers that the availability of Zbb extension. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25281#discussion_r2361999055 From manc at openjdk.org Fri Sep 19 08:00:40 2025 From: manc at openjdk.org (Man Cao) Date: Fri, 19 Sep 2025 08:00:40 GMT Subject: RFR: 8368071: Compilation throughput regressed 2X-8X after JDK-8355003 Message-ID: Hi all, Could anyone review this change that fixes a severe startup performance regression for `-XX:+TieredCompilation`? See https://bugs.openjdk.org/browse/JDK-8368071 for more details. -Man ------------- Commit messages: - 8368071: Compilation throughput regressed 2X-8X after JDK-8355003 Changes: https://git.openjdk.org/jdk/pull/27383/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27383&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8368071 Stats: 13 lines in 1 file changed: 8 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/27383.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27383/head:pull/27383 PR: https://git.openjdk.org/jdk/pull/27383 From manc at openjdk.org Fri Sep 19 08:04:11 2025 From: manc at openjdk.org (Man Cao) Date: Fri, 19 Sep 2025 08:04:11 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed [v2] In-Reply-To: References: Message-ID: <1rFKLR9URrdZDzT2kXZMXkhzjWMzgGZyK9CLJXB0Q_A=.ff5535f2-7851-4d17-81a4-664148a3d1fa@github.com> On Tue, 16 Sep 2025 08:10:11 GMT, Christian Hagedorn wrote: >> Man Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> Switch to disable inlining for shortMethod > > When looking at the test, it seems that we want to verify that `shortMethod()` is compiled while `hugeSwitch()` is not. When running with `-Xcomp`, we will immediately compile `main()` and directly inline `shortMethod()` with C1 (with C2 we fail to inline with "failed initial checks" and thus will compile `shortMethod()` separately when calling it the first time). Therefore, with C1, we will not compile `shortMethod()` separately and the test fails. > > Excluding `-Xcomp` looks reasonable. An alternative would be to exclude `main()` from compilation. But I think for the purpose of this test, excluding `-Xcomp` seems better. @chhagedorn Could you also approve the latest commit? ------------- PR Comment: https://git.openjdk.org/jdk/pull/27306#issuecomment-3311081698 From wenanjian at openjdk.org Fri Sep 19 08:13:20 2025 From: wenanjian at openjdk.org (Anjian Wen) Date: Fri, 19 Sep 2025 08:13:20 GMT Subject: RFR: 8365732: RISC-V: implement AES CTR intrinsics [v8] In-Reply-To: References: Message-ID: <2lhB_2BCsW-SIBFxtc7KKPRZ2SGoleG41SR_d6IAAzI=.86cbf242-bb10-4d95-9424-f2bbe4cfc7ca@github.com> > Hi everyone, please help review this patch which Implement the _counterMode_AESCrypt with Zvkned. On my QEMU, with Zvkned extension enabled, the tests in test/hotspot/jtreg/compiler/codegen/aes/ Passed. Anjian Wen has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains nine additional commits since the last revision: - Merge branch 'openjdk:master' into aes_ctr - Merge branch 'openjdk:master' into aes_ctr - fix the counter increase at limit and add test - change format - update reg use and instruction - change some name and format - delete useless Label, change L_judge_used to L_slow_loop - add Flags and fix the stubid name - RISC-V: implement AES-CTR mode intrinsics ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25281/files - new: https://git.openjdk.org/jdk/pull/25281/files/ff513708..35f82e0a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25281&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25281&range=06-07 Stats: 26416 lines in 742 files changed: 13049 ins; 7667 del; 5700 mod Patch: https://git.openjdk.org/jdk/pull/25281.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25281/head:pull/25281 PR: https://git.openjdk.org/jdk/pull/25281 From shade at openjdk.org Fri Sep 19 08:23:53 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 19 Sep 2025 08:23:53 GMT Subject: RFR: 8368071: Compilation throughput regressed 2X-8X after JDK-8355003 In-Reply-To: References: Message-ID: <5vFQJAcARfEjdesFIAbz1F9-xSoEv1IkAt4gfSATgC8=.b0e2ceb9-8deb-48fe-87dc-3364637698f9@github.com> On Fri, 19 Sep 2025 07:52:16 GMT, Man Cao wrote: > Hi all, > > Could anyone review this change that fixes a severe startup performance regression for `-XX:+TieredCompilation`? See https://bugs.openjdk.org/browse/JDK-8368071 for more details. > > -Man This one is for @veresov :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/27383#issuecomment-3311139236 From jbhateja at openjdk.org Fri Sep 19 08:23:50 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 08:23:50 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v9] In-Reply-To: References: Message-ID: <-cXUJ-9Nbp1h9REUqjyCpkrlDm4WzeiE0-6mx_QuWs4=.dd56a6a1-a626-4b03-b556-19b2b954a08b@github.com> > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/278f1dc8..367622bf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=07-08 Stats: 33 lines in 1 file changed: 15 ins; 4 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From jbhateja at openjdk.org Fri Sep 19 08:23:53 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 08:23:53 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v3] In-Reply-To: <1YTjbiOmc3OUXZlJ_Pg4W6En5hjU0wd_JBHERbVLDWc=.11ddbe0f-685b-463e-87b7-fcdd14ad4bb2@github.com> References: <1YTjbiOmc3OUXZlJ_Pg4W6En5hjU0wd_JBHERbVLDWc=.11ddbe0f-685b-463e-87b7-fcdd14ad4bb2@github.com> Message-ID: <4VQ6YYLFGU3tscZXp3lYhMPDsRvjUlagiJlMe6xiOMc=.20bf3ee6-0822-42f3-8417-1296f5076456@github.com> On Thu, 18 Sep 2025 12:55:16 GMT, Emanuel Peter wrote: >> Hi @TobiHartmann , @SirYwell , @eme64 , can you kindly verify the changes in the latest patch? > > @jatin-bhateja I'm going to be out of the office for about 3 weeks, so feel free to ask others for reviews! Hi @eme64 , @chhagedorn , @SirYwell , let me know if its good to land now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27075#issuecomment-3311136518 From jbhateja at openjdk.org Fri Sep 19 08:23:57 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 08:23:57 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v8] In-Reply-To: References: Message-ID: <_s42gZC5DcP4WobqtuohrzRqET6hpmeaLYwE6BEzcu0=.f24978f2-8ce0-4926-9f5b-0bc2cab57727@github.com> On Tue, 16 Sep 2025 07:15:00 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Extending the random ranges > > test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 56: > >> 54: static final long rand_bndL2 = G.uniformLongs(-0xFFFFFFL, 0xFFFFFF).next(); >> 55: static final long rand_popcL1 = G.uniformLongs(0, 4).next(); >> 56: static final long rand_popcL2 = G.uniformLongs(0, 32).next(); > > Can you please give us some code comments why you are doing: > - only uniform distribution. Is that needed? Generators generates special values more often for a good reason: it creates interesting edge cases, especially for bit operations like this here. > - Why are you restricting the ranges? There could always be surprises outside the ranges you pick, and it would be a shame to not generate those. Unless you are absolutely sure they are not needed. Or if extending the range would mean we would generate interesting cases with a probability that is too small, that could be another reason to restrict the ranges. Thanks @eme64!, comment addressed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2362138729 From mhaessig at openjdk.org Fri Sep 19 09:11:34 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 19 Sep 2025 09:11:34 GMT Subject: RFR: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation [v3] In-Reply-To: References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> <6ijTgwXUpwm8C_U7oOsN7RScv-caCal0U67UXFZ6VmY=.5550cf2f-2c57-4fc0-a2cd-3df6627485a2@github.com> Message-ID: On Thu, 18 Sep 2025 23:09:32 GMT, Dean Long wrote: >> Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8366875-repeat-comp-to >> - Reset timeout on repeated compilations >> - Add regression test >> - Use timeuot factor > > Still good. Thank you both for your reviews, @dean-long and @eme64! ------------- PR Comment: https://git.openjdk.org/jdk/pull/27120#issuecomment-3311368487 From mhaessig at openjdk.org Fri Sep 19 09:11:35 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 19 Sep 2025 09:11:35 GMT Subject: Integrated: 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation In-Reply-To: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> References: <4TbOkAMu-KU_tgQPg1sK0L8oto_0nD4mQo7yc0hJPm4=.8d87b900-a614-4c13-a4c6-6fe11e206482@github.com> Message-ID: On Fri, 5 Sep 2025 15:27:22 GMT, Manuel H?ssig wrote: > When running a debug JVM on Linux with a compile task timeout and repeated compilation, the execution will time out almost always because the timeout does not reset for repetitions of a compilation. The core of the compile task timeout is to limit the amount of time a single compilation can take. Thus, this PR resets the `CompileTaskTimeout` for every compilation when running with `-XX:RepeatCompilation=` for n > 1. > > This PR is stacked on top of #27094. > > Testing: > - [x] Github Actions (failures are unrelated) > - [x] tier1, tier2, tier3 plus some additional internal testing This pull request has now been integrated. Changeset: 94a301a7 Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/94a301a70e19be284f406ebb6d8b94b6f96e1a24 Stats: 17 lines in 4 files changed: 16 ins; 0 del; 1 mod 8366875: CompileTaskTimeout should be reset for each iteration of RepeatCompilation Reviewed-by: dlong, epeter ------------- PR: https://git.openjdk.org/jdk/pull/27120 From epeter at openjdk.org Fri Sep 19 09:41:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 09:41:00 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v9] In-Reply-To: <-cXUJ-9Nbp1h9REUqjyCpkrlDm4WzeiE0-6mx_QuWs4=.dd56a6a1-a626-4b03-b556-19b2b954a08b@github.com> References: <-cXUJ-9Nbp1h9REUqjyCpkrlDm4WzeiE0-6mx_QuWs4=.dd56a6a1-a626-4b03-b556-19b2b954a08b@github.com> Message-ID: On Fri, 19 Sep 2025 08:23:50 GMT, Jatin Bhateja wrote: >> This patch optimizes PopCount value transforms using KnownBits information. >> Following are the results of the micro-benchmark included with the patch >> >> >> >> System: 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s >> >> Withopt: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions As I'm about to board my plane for a 3-week vacation, I'm leaving a last little **note for the reviewers**. I think this is a really nice addition, so thanks for doing it @jatin-bhateja ? . Though it will only reach its full potential once we implement more "basic" KnownBits optimizations such as [JDK-8367341](https://bugs.openjdk.org/browse/JDK-8367341). Please make sure you **test** it, and make sure the random values generated with the Generators in `test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java` make sense. Currently, there is for example a 32 bit range for a 64 bit long value, which is not correct, I think. By default, my recommendation is to **not** constrain the Generators ranges, unless there is a really good reason. Generators are already built to produce values close to zero at an over-proportional rate. But by not restricting we may at some point also hit cases that we did not anticipate, and catch bugs that way. test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 54: > 52: static final long rand_bndL2 = G.longs().next(); > 53: static final long rand_popcL1 = G.uniformLongs(0, 32).next(); > 54: static final long rand_popcL2 = G.uniformLongs(0, 32).next(); Why did you limit the range for longs to 32? Can it not go up to 64? I asked for an explanation (in a code comment) of those that you restrict here, which you have not done, and just "resolved" it instead: https://github.com/openjdk/jdk/pull/27075#discussion_r2351166568 ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27075#pullrequestreview-3244008016 PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2362301238 From epeter at openjdk.org Fri Sep 19 09:41:04 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 09:41:04 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v9] In-Reply-To: References: <-cXUJ-9Nbp1h9REUqjyCpkrlDm4WzeiE0-6mx_QuWs4=.dd56a6a1-a626-4b03-b556-19b2b954a08b@github.com> Message-ID: On Fri, 19 Sep 2025 09:25:56 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolutions > > test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 54: > >> 52: static final long rand_bndL2 = G.longs().next(); >> 53: static final long rand_popcL1 = G.uniformLongs(0, 32).next(); >> 54: static final long rand_popcL2 = G.uniformLongs(0, 32).next(); > > Why did you limit the range for longs to 32? Can it not go up to 64? > I asked for an explanation (in a code comment) of those that you restrict here, which you have not done, and just "resolved" it instead: > https://github.com/openjdk/jdk/pull/27075#discussion_r2351166568 If you do restrict it, then at least go over the range a little bit. Why? You check `Integer.bitCount(num) < rand_popcI2`. The max value you get here is 32, so we could never get a constant folding case for the range `0..32`. Maybe that is ok, but we potentially miss a chance to find something we did not even anticipate. That is why I would recommend **not** to constrain the values, unless you really have a good reason and write it down in a code comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2362316509 From epeter at openjdk.org Fri Sep 19 09:42:37 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 09:42:37 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 09:52:34 GMT, Daniel Lund?n wrote: >> test/hotspot/jtreg/compiler/arguments/TestMethodArguments.java line 120: >> >>> 118: Template.let("classpath", comp.getEscapedClassPathOfCompiledClasses()), >>> 119: """ >>> 120: import java.util.Arrays; >> >> Personally, I would not indent this deeply. I know that the generated code will not have proper indentation, but that's no so bad. Readability of the Templates is more important I think. Subjective though. > > No strong opinion here, I just went with the eclipse-jdtls autoformatter defaults. The generated code does have fairly OK indentation (the indentation in the code does not add any actual indentation in the generated code). Let me know what you prefer and I'll update it. I would prefer readability of the test, not the generated code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2362332715 From jbhateja at openjdk.org Fri Sep 19 09:49:16 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 09:49:16 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v10] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update TestPopCountValueTransforms.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/367622bf..92cf2fad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=08-09 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From jbhateja at openjdk.org Fri Sep 19 09:49:18 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 09:49:18 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v9] In-Reply-To: References: <-cXUJ-9Nbp1h9REUqjyCpkrlDm4WzeiE0-6mx_QuWs4=.dd56a6a1-a626-4b03-b556-19b2b954a08b@github.com> Message-ID: On Fri, 19 Sep 2025 09:32:30 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/intrinsics/TestPopCountValueTransforms.java line 54: >> >>> 52: static final long rand_bndL2 = G.longs().next(); >>> 53: static final long rand_popcL1 = G.uniformLongs(0, 32).next(); >>> 54: static final long rand_popcL2 = G.uniformLongs(0, 32).next(); >> >> Why did you limit the range for longs to 32? Can it not go up to 64? >> I asked for an explanation (in a code comment) of those that you restrict here, which you have not done, and just "resolved" it instead: >> https://github.com/openjdk/jdk/pull/27075#discussion_r2351166568 > > If you do restrict it, then at least go over the range a little bit. Why? > You check `Integer.bitCount(num) < rand_popcI2`. The max value you get here is 32, so we could never get a constant folding case for the range `0..32`. Maybe that is ok, but we potentially miss a chance to find something we did not even anticipate. > > That is why I would recommend **not** to constrain the values, unless you really have a good reason and write it down in a code comment. > Why did you limit the range for longs to 32? Can it not go up to 64? I asked for an explanation (in a code comment) of those that you restrict here, which you have not done, and just "resolved" it instead: [#27075 (comment)](https://github.com/openjdk/jdk/pull/27075#discussion_r2351166568) A silly typo, so no explanation :-) enjoy your break :-) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2362345323 From jbhateja at openjdk.org Fri Sep 19 09:53:06 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 09:53:06 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v9] In-Reply-To: References: <-cXUJ-9Nbp1h9REUqjyCpkrlDm4WzeiE0-6mx_QuWs4=.dd56a6a1-a626-4b03-b556-19b2b954a08b@github.com> Message-ID: On Fri, 19 Sep 2025 09:37:45 GMT, Emanuel Peter wrote: > As I'm about to board my plane for a 3-week vacation, I'm leaving a last little **note for the reviewers**. > > I think this is a really nice addition, so thanks for doing it @jatin-bhateja ? . Though it will only reach its full potential once we implement more "basic" KnownBits optimizations such as [JDK-8367341](https://bugs.openjdk.org/browse/JDK-8367341). > Correct, currently KnownBits information is constrained as they are generated for limited value ranges, as discussed in https://github.com/openjdk/jdk/pull/27075#discussion_r2337215333 ------------- PR Comment: https://git.openjdk.org/jdk/pull/27075#issuecomment-3311500441 From epeter at openjdk.org Fri Sep 19 10:01:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 10:01:46 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v29] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 10:07:58 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update after comments from Emanuel @dlunde I just had another look through the whole code change. And I'm very happy with it now. Especially the additional code comments around `rollover`/`offset` really helped to bring it together for me :) Thanks for bearing with me through the many comments / suggestions ? I would suggest that either @vnkozlov or @robcasloz have another quick look over the changes, just to see if they agree with what we have been doing ;) test/hotspot/gtest/opto/test_regmask.cpp line 1222: > 1220: } > 1221: > 1222: #endif // !PRODUCT Optional: You could add some tests that expect a vm assert. You can do that with `TEST_VM_ASSERT_MSG`. Example: https://github.com/openjdk/jdk/blob/9b04b5a74cc09b64098fb9940aa224f529ff1a01/test/hotspot/gtest/utilities/test_growableArray.cpp ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20404#pullrequestreview-3244069001 PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2362363610 From epeter at openjdk.org Fri Sep 19 10:01:47 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 19 Sep 2025 10:01:47 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 12:54:37 GMT, Daniel Lund?n wrote: >> `//assert(is_infinite_stack == lrg->mask().is_infinite_stack(), "nbrs must not change InfiniteStackedness");` > > No idea, sorry (it has been that way since initial load). I just touched it to change from all_stack to infinite_stack. @dlunde Would you mind investigating in a follow-up RFE? I would just enable the assert and see if it triggers. If not, add the assert back in, otherwise see why the assert fails, and if that looks reasonable. If yes -> just remove it. If it is not reasonable .... we then investigate more I suppose ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2362347841 From jbhateja at openjdk.org Fri Sep 19 11:10:20 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 11:10:20 GMT Subject: RFR: 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions [v2] In-Reply-To: References: Message-ID: > Currently, while choosing the colour (register) for a definition live range during the select phase of register allocation, we pick the first available colour that does not match with already allocated neighboring live ranges. > > With Intel APX NDD ISA extension, several existing two-address arithmetic instructions can now have an explicit non-destructive destination operand; this, in general, saves additional spills for two-address instructions where the destination is also the first source operand, and where the source live range surpasses the current instruction. > > All NDD instructions mandate extended EVEX encoding with a bulky 4-byte prefix, [JDK-8351994](https://github.com/openjdk/jdk/pull/24431) added logic for NDD to REX/REX2 demotion in the assembler layer, but due to the existing first color selection register allocation policy, the demotions are rare. This patch biases the allocation of NDD definition to the first source operand or the second source operand for the commutative class of operations. > > Biasing is a compile-time hint to the allocator and is different from live range coalescing (aggressive/conservative), which merges the two live ranges using the union find algorithm. Given that REX encoding needs a 1-byte prefix and REX2 encoding needs a 2-byte prefix, domotion saves considerable JIT code size. > > The patch shows around 5-20% improvement in code size by facilitating NDD demotion. > > For the following micro, the method JIT code size reduced from 136 to 120 bytes, which is around a 13% reduction in code size footprint. > > **Micro:-** > image > > > **Baseline :-** > image > > **With opt:-** > image > > Thorough validations are underway using the latest [Intel Software Development Emulator version 9.58](https://www.intel.com/content/www/us/en/download/684897/intel-software-development-emulator.html). > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Updating as per reivew suggestions - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8351016 - Some refactoring - 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions ------------- Changes: https://git.openjdk.org/jdk/pull/26283/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26283&range=01 Stats: 87 lines in 2 files changed: 70 ins; 6 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/26283.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26283/head:pull/26283 PR: https://git.openjdk.org/jdk/pull/26283 From hgreule at openjdk.org Fri Sep 19 11:19:13 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Fri, 19 Sep 2025 11:19:13 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v10] In-Reply-To: References: Message-ID: On Fri, 19 Sep 2025 09:49:16 GMT, Jatin Bhateja wrote: >> This patch optimizes PopCount value transforms using KnownBits information. >> Following are the results of the micro-benchmark included with the patch >> >> >> >> System: 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s >> >> Withopt: >> Benchmark Mode Cnt Score Error Units >> PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s >> PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update TestPopCountValueTransforms.java src/hotspot/share/opto/countbitsnode.cpp line 123: > 121: // we have at least and at most. > 122: // From the definition of KnownBits, we know: > 123: // zeros: Indicates which bits must be 0: ones[i] =1 -> t[i]=0 I'm a bit confused by this, is ones[i] mixed up with zeros[i]? I.e., t[i]=0 if zeros[i]=1 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2362569002 From dlunden at openjdk.org Fri Sep 19 11:33:39 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 19 Sep 2025 11:33:39 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: <27TyCS4slIg2kY1yfm1niF8Rr6jr8BGfXaHAhT11VEA=.cdca78a4-ea0f-4e9d-9da3-c1bab5f5c0e4@github.com> On Fri, 19 Sep 2025 09:46:29 GMT, Emanuel Peter wrote: >> No idea, sorry (it has been that way since initial load). I just touched it to change from all_stack to infinite_stack. > > @dlunde Would you mind investigating in a follow-up RFE? I would just enable the assert and see if it triggers. If not, add the assert back in, otherwise see why the assert fails, and if that looks reasonable. If yes -> just remove it. If it is not reasonable .... we then investigate more I suppose ;) Sure thing, I'll add it to the list. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2362598084 From dlunden at openjdk.org Fri Sep 19 11:33:40 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 19 Sep 2025 11:33:40 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v27] In-Reply-To: References: Message-ID: On Fri, 19 Sep 2025 09:39:44 GMT, Emanuel Peter wrote: >> No strong opinion here, I just went with the eclipse-jdtls autoformatter defaults. The generated code does have fairly OK indentation (the indentation in the code does not add any actual indentation in the generated code). Let me know what you prefer and I'll update it. > > I would prefer readability of the test, not the generated code. OK, what I meant is that I did not understand how exactly you wanted me to make the test more readable. But, I had a look at the template framework example and will update to use the same style (align the `"""`)! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2362593056 From dlunden at openjdk.org Fri Sep 19 12:43:06 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 19 Sep 2025 12:43:06 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v29] In-Reply-To: References: Message-ID: On Fri, 19 Sep 2025 09:53:43 GMT, Emanuel Peter wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Update after comments from Emanuel > > test/hotspot/gtest/opto/test_regmask.cpp line 1222: > >> 1220: } >> 1221: >> 1222: #endif // !PRODUCT > > Optional: > You could add some tests that expect a vm assert. You can do that with `TEST_VM_ASSERT_MSG`. Example: > https://github.com/openjdk/jdk/blob/9b04b5a74cc09b64098fb9940aa224f529ff1a01/test/hotspot/gtest/utilities/test_growableArray.cpp Thanks, added a few obvious such tests! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20404#discussion_r2362753532 From dlunden at openjdk.org Fri Sep 19 12:43:01 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 19 Sep 2025 12:43:01 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v30] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Add vm-assert tests and improve template framework test indentation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/9b04b5a7..e165c961 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=29 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=28-29 Stats: 99 lines in 2 files changed: 37 ins; 3 del; 59 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From dlunden at openjdk.org Fri Sep 19 12:46:42 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 19 Sep 2025 12:46:42 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v23] In-Reply-To: References: Message-ID: <5Sarw16IDhzy-qXC9hrXwflQ549w5lcDQRSJNTWJwv0=.944fbf98-dd34-4607-b08c-591935df359c@github.com> On Wed, 17 Sep 2025 09:56:16 GMT, Daniel Lund?n wrote: >> @dlunde Thanks for the swift updates! I have in the meantime added some more comments, just making sure you don't miss them :) > > @eme64 > >> You seem to have a build failure: >> >> ``` >> In file included from /home/runner/work/jdk/jdk/src/hotspot/share/opto/compile.hpp:43, >> from /home/runner/work/jdk/jdk/src/hotspot/share/opto/type.hpp:29, >> from /home/runner/work/jdk/jdk/test/hotspot/gtest/opto/test_rangeinference.cpp:26: >> /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp: In constructor ?RegMask::RegMask(Arena*)?: >> /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp:441:53: error: class ?RegMask? does not have any field named ?_read_only? >> 441 | : _rm_word() DEBUG_ONLY(COMMA _arena(arena)), _read_only(read_only), >> | ^~~~~~~~~~ >> /home/runner/work/jdk/jdk/src/hotspot/share/opto/regmask.hpp:441:64: error: ?read_only? was not declared in this scope >> 441 | : _rm_word() DEBUG_ONLY(COMMA _arena(arena)), _read_only(read_only), >> | >> ``` > > Thanks, only failed on release so didn't notice. Will fix. > >> I really appreciate that you added extensive `gtest`s, thanks for that ? > > @robcasloz contributed 90% of that, so the credit goes to him! > >> And thanks for using the Template Framework, I'm curious to hear if you have any feedback on it :) > > Sure, it was quite convenient. Happy to talk about the experience offline. > @dlunde I just had another look through the whole code change. And I'm very happy with it now. Especially the additional code comments around `rollover`/`offset` really helped to bring it together for me :) > > Thanks for bearing with me through the many comments / suggestions ? > > I would suggest that either @vnkozlov or @robcasloz have another quick look over the changes, just to see if they agree with what we have been doing ;) Thank you @eme64 , much appreciated! I agree we have improved the changeset a lot from the initial version. Yes, @robcasloz has let me know he will have a look at the changes soon. @vnkozlov is also welcome to review again of course, but his previous review is for a very much out-of-date version of the changeset. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-3312059287 From jbhateja at openjdk.org Fri Sep 19 12:55:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 12:55:42 GMT Subject: RFR: 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions [v2] In-Reply-To: References: Message-ID: On Tue, 26 Aug 2025 23:37:01 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Updating as per reivew suggestions >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8351016 >> - Some refactoring >> - 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions > > src/hotspot/share/opto/chaitin.cpp line 1655: > >> 1653: }; >> 1654: >> 1655: if (X86_ONLY(UseAPX) NOT_X86(false)) { > > The change looks to be generically applicable and not APX or X86 specific. Hi @sviswa7, I have generalized the fix by lifting X86/APX checks as per the suggestion. Though, our intent here is to facilitate the demotion of NDD instructions having 4 byte EEVEX prefix, in other scenarios of 3-operand instructions, we may not see any benefit from biasing. If a use's live range (LRG) surpasses its user's LRG then, RA automatically prevents sharing of register, in other case **it may** assign the same register to definition as per first allocation policy. Thus, biasing is only favorable to APX NDD use case where assembler layer is equipped to perform demotion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26283#discussion_r2362786986 From dlunden at openjdk.org Fri Sep 19 12:58:40 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 19 Sep 2025 12:58:40 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v31] In-Reply-To: References: Message-ID: > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 43 commits: - Merge tag 'jdk-26+16' into many-arguments-8325467+pr-updates Added tag jdk-26+16 for changeset a49856bb - Add vm-assert tests and improve template framework test indentation - Update after comments from Emanuel - Update after comments from Emanuel - Clarify comments in regmask.hpp - Merge remote-tracking branch 'upstream/master' into many-arguments-8325467+pr-updates - Address review comments (renaming on the way in a separate PR) - Update src/hotspot/share/opto/regmask.hpp Co-authored-by: Emanuel Peter - Restore modified java/lang/invoke tests - Sort includes (new requirement) - ... and 33 more: https://git.openjdk.org/jdk/compare/a49856bb...84efc2db ------------- Changes: https://git.openjdk.org/jdk/pull/20404/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=30 Stats: 2890 lines in 29 files changed: 2325 ins; 288 del; 277 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From liach at openjdk.org Fri Sep 19 13:10:33 2025 From: liach at openjdk.org (Chen Liang) Date: Fri, 19 Sep 2025 13:10:33 GMT Subject: RFR: 8355223: Improve documentation on @IntrinsicCandidate [v8] In-Reply-To: References: Message-ID: > In offline discussion, we noted that the documentation on this annotation does not recommend minimizing the intrinsified section and moving whatever can be done in Java to Java; thus I prepared this documentation update, to shrink a "TLDR" essay to something concise for readers, such as pointing to that list at `vmIntrinsics.hpp` instead of "a list". Chen Liang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: - Separate design doc - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate - More review updates - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate - Move intrinsic to be a subsection; just one most common function of the annotation - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate - Update src/java.base/share/classes/jdk/internal/vm/annotation/IntrinsicCandidate.java Co-authored-by: Raffaello Giulietti - Shorter first sentence - Updates, thanks to John - ... and 2 more: https://git.openjdk.org/jdk/compare/0d5ea5a0...e4afa49d ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24777/files - new: https://git.openjdk.org/jdk/pull/24777/files/a312d92b..e4afa49d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24777&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24777&range=06-07 Stats: 348197 lines in 6043 files changed: 206814 ins; 98457 del; 42926 mod Patch: https://git.openjdk.org/jdk/pull/24777.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24777/head:pull/24777 PR: https://git.openjdk.org/jdk/pull/24777 From liach at openjdk.org Fri Sep 19 13:10:37 2025 From: liach at openjdk.org (Chen Liang) Date: Fri, 19 Sep 2025 13:10:37 GMT Subject: RFR: 8355223: Improve documentation on @IntrinsicCandidate [v7] In-Reply-To: References: Message-ID: On Wed, 21 May 2025 21:31:16 GMT, Chen Liang wrote: >> In offline discussion, we noted that the documentation on this annotation does not recommend minimizing the intrinsified section and moving whatever can be done in Java to Java; thus I prepared this documentation update, to shrink a "TLDR" essay to something concise for readers, such as pointing to that list at `vmIntrinsics.hpp` instead of "a list". > > Chen Liang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: > > - More review updates > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Move intrinsic to be a subsection; just one most common function of the annotation > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Update src/java.base/share/classes/jdk/internal/vm/annotation/IntrinsicCandidate.java > > Co-authored-by: Raffaello Giulietti > - Shorter first sentence > - Updates, thanks to John > - Refine validation and defensive copying > - 8355223: Improve documentation on @IntrinsicCandidate Let's continue. I've moved the majority of check and stuff into a standalone design doc. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24777#issuecomment-3312125701 From rcastanedalo at openjdk.org Fri Sep 19 13:12:36 2025 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 19 Sep 2025 13:12:36 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v12] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> Message-ID: On Tue, 9 Sep 2025 11:27:50 GMT, Roland Westrelin wrote: >> An `Initialize` node for an `Allocate` node is created with a memory >> `Proj` of adr type raw memory. In order for stores to be captured, the >> memory state out of the allocation is a `MergeMem` with slices for the >> various object fields/array element set to the raw memory `Proj` of >> the `Initialize` node. If `Phi`s need to be created during later >> transformations from this memory state, The `Phi` for a particular >> slice gets its adr type from the type of the `Proj` which is raw >> memory. If during macro expansion, the `Allocate` is found to have no >> use and so can be removed, the `Proj` out of the `Initialize` is >> replaced by the memory state on input to the `Allocate`. A `Phi` for >> some slice for a field of an object will end up with the raw memory >> state on input to the `Allocate` node. As a result, memory state at >> the `Phi` is incorrect and incorrect execution can happen. >> >> The fix I propose is, rather than have a single `Proj` for the memory >> state out of the `Initialize` with adr type raw memory, to use one >> `Proj` per slice added to the memory state after the `Initalize`. Each >> of the `Proj` should return the right adr type for its slice. For that >> I propose having a new type of `Proj`: `NarrowMemProj` that captures >> the right adr type. >> >> Logic for the construction of the `Allocate`/`Initialize` subgraph is >> tweaked so the right adr type captured in is own `NarrowMemProj` is >> added to the memory sugraph. Code that removes an allocation or moves >> it also has to be changed so it correctly takes the multiple memory >> projections out of the `Initialize` node into account. >> >> One tricky issue is that when EA split types for a scalar replaceable >> `Allocate` node: >> >> 1- the adr type captured in the `NarrowMemProj` becomes out of sync >> with the type of the slices for the allocation >> >> 2- before EA, the memory state for one particular field out of the >> `Initialize` node can be used for a `Store` to the just allocated >> object or some other. So we can have a chain of `Store`s, some to >> the newly allocated object, some to some other objects, all of them >> using the state of `NarrowMemProj` out of the `Initialize`. After >> split unique types, the `NarrowMemProj` is for the slice of a >> particular allocation. So `Store`s to some other objects shouldn't >> use that memory state but the memory state before the `Allocate`. >> >> For that, I added logic to update the adr type of `NarrowMemProj` >> during split uni... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits: > > - more > - Merge branch 'master' into JDK-8327963 > - more > - more > - Merge branch 'master' into JDK-8327963 > - more > - more > - lambda return > - lambda clean up > - Merge branch 'master' into JDK-8327963 > - ... and 35 more: https://git.openjdk.org/jdk/compare/e16c5100...b701d03e Changes requested by rcastanedalo (Reviewer). src/hotspot/share/opto/escape.hpp line 567: > 565: // MemNode - new memory input for this node > 566: // CheckCastPP - allocation that this is a cast of > 567: // allocation - CheckCastPP of the allocation Please add a new entry here explaining how `_node_map` is used for `NarrowMemProjNode` nodes. src/hotspot/share/opto/graphKit.cpp line 3645: > 3643: assert(minit_out->is_Proj() && minit_out->in(0) == init, ""); > 3644: int mark_idx = C->get_alias_index(oop_type->add_offset(oopDesc::mark_offset_in_bytes())); > 3645: // Add an edge in the MergeMem for the header fields so an access to one of those has correct memory state Suggestion: // Add an edge in the MergeMem for the header fields so an access to one of those has correct memory state. src/hotspot/share/opto/graphKit.cpp line 3647: > 3645: // Add an edge in the MergeMem for the header fields so an access to one of those has correct memory state > 3646: // Use one NarrowMemProjNode per slice to properly record the adr type of each slice. The Initialize node will have > 3647: // multiple projection as a result. Suggestion: // multiple projections as a result. src/hotspot/share/opto/macro.cpp line 1606: > 1604: // elimination. Simply add the MemBarStoreStore after object > 1605: // initialization. > 1606: MemBarNode* mb = MemBarNode::make(C, Op_MemBarStoreStore, Compile::AliasIdxRaw); Does the same argument as below apply for relaxing the scope of this memory barrier? Please clarify in a similar comment for this case (if the same argument applies, a reference to the comment below would be enough). src/hotspot/share/opto/macro.cpp line 1623: > 1621: Node* init_ctrl = init->proj_out_or_null(TypeFunc::Control); > 1622: > 1623: // What we want is to prevent the compiler and the cpu from re-ordering the stores that initialize this object Suggestion: // What we want is to prevent the compiler and the CPU from re-ordering the stores that initialize this object src/hotspot/share/opto/macro.cpp line 1628: > 1626: // only captures/produces a partial memory state making it complicated to insert such a MemBar. Because > 1627: // re-ordering by the compiler can't happen by construction (a later Store that publishes the just allocated > 1628: // object reference is indirectly control dependent on the Initialize node), preventing reordering by the cpu is Suggestion: // object reference is indirectly control dependent on the Initialize node), preventing reordering by the CPU is src/hotspot/share/opto/memnode.hpp line 1383: > 1381: bool already_has_narrow_mem_proj_with_adr_type(const TypePtr* adr_type) const; > 1382: > 1383: MachProjNode* mem_mach_proj() const; Please add a brief comment above this function, possibly clarifying that we do not expect to find more than one Mach memory projection. src/hotspot/share/opto/multnode.cpp line 73: > 71: }; > 72: return apply_to_projs(filter, which_proj); > 73: } Consider moving this implementation to `multnode.hpp`, perhaps next to that of `MultiNode::apply_to_projs(DUIterator_Fast& imax, DUIterator_Fast& i, Callback callback, uint which_proj)`, for consistency. src/hotspot/share/opto/multnode.cpp line 279: > 277: void NarrowMemProjNode::dump_spec(outputStream *st) const { > 278: ProjNode::dump_spec(st); > 279: dump_adr_type(st); Do we need to define a special version of `NarrowMemProjNode::dump_adr_type` or could we just have the same effect calling `MemNode::dump_adr_type(this, _adr_type, st)` here? src/hotspot/share/opto/multnode.cpp line 284: > 282: void NarrowMemProjNode::dump_compact_spec(outputStream *st) const { > 283: ProjNode::dump_compact_spec(st); > 284: dump_adr_type(st); Same here. src/hotspot/share/opto/multnode.hpp line 71: > 69: } > 70: Node* current() { > 71: return _node->fast_out(_i);; Suggestion: return _node->fast_out(_i); src/hotspot/share/opto/multnode.hpp line 90: > 88: } > 89: Node* current() { > 90: return _node->out(_i);; Suggestion: return _node->out(_i); src/hotspot/share/opto/phaseX.cpp line 2621: > 2619: add_users_to_worklist0(proj, worklist); > 2620: return MultiNode::CONTINUE; > 2621: }; Consider defining `enqueue` only once and reusing it in both cases. test/hotspot/jtreg/compiler/escapeAnalysis/TestIterativeEA.java line 53: > 51: analyzer.shouldContain("++++ Eliminated: 26 Allocate"); > 52: analyzer.shouldContain("++++ Eliminated: 51 Allocate"); > 53: analyzer.shouldContain("++++ Eliminated: 84 Allocate"); Did you analyze why there are more allocations removed than before in this test case? I did not expect this changeset to have an effect on the number of removed allocations. test/hotspot/jtreg/compiler/macronodes/TestEarlyEliminationOfAllocationWithoutUse.java line 1: > 1: /* Please add a package declaration (and make the corresponding class names fully qualified in the `@run` directives). test/hotspot/jtreg/compiler/macronodes/TestEliminationOfAllocationWithoutUse.java line 30: > 28: * Now that array slice depends on the rawslice. And then when the Initialize MemBar gets > 29: * removed in expand_allocate_common, the rawslice sees that it has now no effect, looks > 30: * through the MergeMem and sees the initial stae. That way, also the linked array slice Suggestion: * through the MergeMem and sees the initial state. That way, also the linked array slice ------------- PR Review: https://git.openjdk.org/jdk/pull/24570#pullrequestreview-3244667543 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362830370 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362759304 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362760441 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362798596 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362800147 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362800934 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362782847 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362757140 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362743051 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362743403 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362746650 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362750245 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362767659 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362816473 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362810978 PR Review Comment: https://git.openjdk.org/jdk/pull/24570#discussion_r2362745517 From jbhateja at openjdk.org Fri Sep 19 13:17:04 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 13:17:04 GMT Subject: RFR: 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions [v3] In-Reply-To: References: Message-ID: <52RpYM-r-1EZcYjbaNllAEPHQP1nYhQcs-GfydIzP08=.0bfb8185-78a7-4dfb-9700-f4a36a1d0e99@github.com> > Currently, while choosing the colour (register) for a definition live range during the select phase of register allocation, we pick the first available colour that does not match with already allocated neighboring live ranges. > > With Intel APX NDD ISA extension, several existing two-address arithmetic instructions can now have an explicit non-destructive destination operand; this, in general, saves additional spills for two-address instructions where the destination is also the first source operand, and where the source live range surpasses the current instruction. > > All NDD instructions mandate extended EVEX encoding with a bulky 4-byte prefix, [JDK-8351994](https://github.com/openjdk/jdk/pull/24431) added logic for NDD to REX/REX2 demotion in the assembler layer, but due to the existing first color selection register allocation policy, the demotions are rare. This patch biases the allocation of NDD definition to the first source operand or the second source operand for the commutative class of operations. > > Biasing is a compile-time hint to the allocator and is different from live range coalescing (aggressive/conservative), which merges the two live ranges using the union find algorithm. Given that REX encoding needs a 1-byte prefix and REX2 encoding needs a 2-byte prefix, domotion saves considerable JIT code size. > > The patch shows around 5-20% improvement in code size by facilitating NDD demotion. > > For the following micro, the method JIT code size reduced from 136 to 120 bytes, which is around a 13% reduction in code size footprint. > > **Micro:-** > image > > > **Baseline :-** > image > > **With opt:-** > image > > Thorough validations are underway using the latest [Intel Software Development Emulator version 9.58](https://www.intel.com/content/www/us/en/download/684897/intel-software-development-emulator.html). > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Fix jtreg, one less spill ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26283/files - new: https://git.openjdk.org/jdk/pull/26283/files/cd13fe60..3ebe52fa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26283&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26283&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26283.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26283/head:pull/26283 PR: https://git.openjdk.org/jdk/pull/26283 From qamai at openjdk.org Fri Sep 19 13:52:21 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 19 Sep 2025 13:52:21 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v9] In-Reply-To: <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> Message-ID: On Sun, 14 Sep 2025 14:44:02 GMT, Hannes Greule wrote: >> This change improves the precision of the `Mod(I|L)Node::Value()` functions. >> >> I reordered the structure a bit. First, we handle constants, afterwards, we handle ranges. The bottom checks seem to be excessive (`Type::BOTTOM` is covered by using `isa_(int|long)()`, the local bottom is just the full range). Given we can even give reasonable bounds if only one input has any bounds, we don't want to return early. >> The changes after that are commented. Please let me know if the explanations are good, or if you have any suggestions. >> >> ### Monotonicity >> >> Before, a 0 divisor resulted in `Type(Int|Long)::POS`. Initially I wanted to keep it this way, but that violates monotonicity during PhaseCCP. As an example, if we see a 0 divisor first and a 3 afterwards, we might try to go from `>=0` to `-2..2`, but the meet of these would be `>=-2` rather than `-2..2`. Using `Type(Int|Long)::ZERO` instead (zero is always in the resulting value if we cover a range). >> >> ### Testing >> >> I added tests for cases around the relevant bounds. I also ran tier1, tier2, and tier3 but didn't see any related failures after addressing the monotonicity problem described above (I'm having a few unrelated failures on my system currently, so separate testing would be appreciated in case I missed something). >> >> Please review and let me know what you think. >> >> ### Other >> >> The `UMod(I|L)Node`s were adjusted to be more in line with its signed variants. This change diverges them again, but similar improvements could be made after #17508. >> >> During experimenting with these changes, I stumbled upon a few things that aren't directly related to this change, but might be worth to further look into: >> - If the divisor is a constant, we will directly replace the `Mod(I|L)Node` with more but less expensive nodes in `::Ideal()`. Type analysis for these nodes combined is less precise, means we miss potential cases were this would help e.g., removing range checks. Would it make sense to delay the replacement? >> - To force non-negative ranges, I'm using `char`. I noticed that method parameters of sub-int integer types all fall back to `TypeInt::INT`. This seems to be an intentional change of https://github.com/openjdk/jdk/commit/200784d505dd98444c48c9ccb7f2e4df36dcbb6a. The bug report is private, so I can't really judge if that part is necessary, but it seems odd. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove unused parameter src/hotspot/share/opto/divnode.cpp line 1211: > 1209: // We always generate the dynamic check for 0. > 1210: // 0 MOD X is 0 > 1211: if (t1 == TypeInteger::zero(bt)) { return t1; } I think the culprit for [JDK-8356813](https://bugs.openjdk.org/browse/JDK-8356813) is this place. We need to check for the divisor being a constant 0 and return `Type::TOP` before this check and the check below. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25254#discussion_r2362960218 From hgreule at openjdk.org Fri Sep 19 14:17:26 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Fri, 19 Sep 2025 14:17:26 GMT Subject: RFR: 8356813: Improve Mod(I|L)Node::Value [v9] In-Reply-To: References: <2Jf_gfvRlKcmCFoQHp5T0WW_fU_yK5-0Z3z41f00-YU=.164be9f0-fae1-44bb-84c3-846d8c2c0db2@github.com> <1ZCEMsPvSQaLGWRuNtO89LNP_XUeaz-edeIUrKwRCZY=.9dad5a02-c739-4e24-8692-8941f31e5a49@github.com> Message-ID: On Fri, 19 Sep 2025 13:49:08 GMT, Quan Anh Mai wrote: >> Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unused parameter > > src/hotspot/share/opto/divnode.cpp line 1211: > >> 1209: // We always generate the dynamic check for 0. >> 1210: // 0 MOD X is 0 >> 1211: if (t1 == TypeInteger::zero(bt)) { return t1; } > > I think the culprit for [JDK-8356813](https://bugs.openjdk.org/browse/JDK-8356813) is this place. We need to check for the divisor being a constant 0 and return `Type::TOP` before this check and the check below. Yes, I already worked a bit on it, see https://github.com/SirYwell/jdk/tree/fix/mod-not-monotonic but I didn't have time to create a PR yet. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25254#discussion_r2363030436 From dlunden at openjdk.org Fri Sep 19 16:02:35 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 19 Sep 2025 16:02:35 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v32] In-Reply-To: References: Message-ID: <_3SEByIuKhkAQvZ9gvMOHYMH2y_Xh9F4UM1lS2ixzpw=.f572fe77-31f0-4724-9611-9f53231d6bec@github.com> > If a method has a large number of parameters, we currently bail out from C2 compilation. > > ### Changeset > > Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. > > Changes: > - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. > - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. > - Remove all `can_represent` checks and bailouts. > - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. > - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. > > ### Testing > > - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) > - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. > - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it). > > ![c2-regression](https:/... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Increase timeout for TestMethodArguments.java ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20404/files - new: https://git.openjdk.org/jdk/pull/20404/files/84efc2db..1dd5084f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=31 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20404&range=30-31 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/20404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404 PR: https://git.openjdk.org/jdk/pull/20404 From sviswanathan at openjdk.org Fri Sep 19 16:23:56 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 19 Sep 2025 16:23:56 GMT Subject: RFR: 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions [v3] In-Reply-To: <52RpYM-r-1EZcYjbaNllAEPHQP1nYhQcs-GfydIzP08=.0bfb8185-78a7-4dfb-9700-f4a36a1d0e99@github.com> References: <52RpYM-r-1EZcYjbaNllAEPHQP1nYhQcs-GfydIzP08=.0bfb8185-78a7-4dfb-9700-f4a36a1d0e99@github.com> Message-ID: On Fri, 19 Sep 2025 13:17:04 GMT, Jatin Bhateja wrote: >> Currently, while choosing the colour (register) for a definition live range during the select phase of register allocation, we pick the first available colour that does not match with already allocated neighboring live ranges. >> >> With Intel APX NDD ISA extension, several existing two-address arithmetic instructions can now have an explicit non-destructive destination operand; this, in general, saves additional spills for two-address instructions where the destination is also the first source operand, and where the source live range surpasses the current instruction. >> >> All NDD instructions mandate extended EVEX encoding with a bulky 4-byte prefix, [JDK-8351994](https://github.com/openjdk/jdk/pull/24431) added logic for NDD to REX/REX2 demotion in the assembler layer, but due to the existing first color selection register allocation policy, the demotions are rare. This patch biases the allocation of NDD definition to the first source operand or the second source operand for the commutative class of operations. >> >> Biasing is a compile-time hint to the allocator and is different from live range coalescing (aggressive/conservative), which merges the two live ranges using the union find algorithm. Given that REX encoding needs a 1-byte prefix and REX2 encoding needs a 2-byte prefix, domotion saves considerable JIT code size. >> >> The patch shows around 5-20% improvement in code size by facilitating NDD demotion. >> >> For the following micro, the method JIT code size reduced from 136 to 120 bytes, which is around a 13% reduction in code size footprint. >> >> **Micro:-** >> image >> >> >> **Baseline :-** >> image >> >> **With opt:-** >> image >> >> Thorough validations are underway using the latest [Intel Software Development Emulator version 9.58](https://www.intel.com/content/www/us/en/download/684897/intel-software-development-emulator.html). >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fix jtreg, one less spill Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26283#pullrequestreview-3245738167 From iveresov at openjdk.org Fri Sep 19 16:32:40 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Fri, 19 Sep 2025 16:32:40 GMT Subject: RFR: 8368071: Compilation throughput regressed 2X-8X after JDK-8355003 In-Reply-To: References: Message-ID: On Fri, 19 Sep 2025 07:52:16 GMT, Man Cao wrote: > Hi all, > > Could anyone review this change that fixes a severe startup performance regression for `-XX:+TieredCompilation`? See https://bugs.openjdk.org/browse/JDK-8368071 for more details. > > -Man Good catch! I need to refactor some of it in the future but for now it's a good conservative fix. Let me run some internal testing on it first and I'll get back to you. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27383#issuecomment-3312866388 From shade at openjdk.org Fri Sep 19 16:33:46 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 19 Sep 2025 16:33:46 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability [v2] In-Reply-To: References: Message-ID: <_P0NOFGGa3uTaZJ17X8jYvgtbbOU90SD6LJ-mM4-P-U=.5910d123-6993-49b2-8d65-4776f0333d4c@github.com> On Wed, 17 Sep 2025 23:24:16 GMT, Dean Long wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls >> - Drop atomic counters >> - Initial version > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4853: > >> 4851: } else { >> 4852: // Nothing to do, just go with defaults. >> 4853: assert_different_registers(rax, mdp, recv, offset); > > Can't we do all register shuffling and push/pop outside the loop? I remember having an initial version that did it, but the code ended up even hairier and inefficient, because: a) there are different exits from the loop; b) in majority of cases we do not need to do any shuffling (e.g. none of the registers in questions are not `rax`); c) it also caused some branches to become un-shortened. For this profiling stencil, every instruction counts :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2363548849 From shade at openjdk.org Fri Sep 19 16:51:14 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 19 Sep 2025 16:51:14 GMT Subject: RFR: 8357258: x86: Improve receiver type profiling reliability [v2] In-Reply-To: References: Message-ID: On Wed, 17 Sep 2025 23:38:39 GMT, Dean Long wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls >> - Drop atomic counters >> - Initial version > > src/hotspot/cpu/x86/interp_masm_x86.cpp line 1342: > >> 1340: >> 1341: // Record the receiver type. >> 1342: type_profile(receiver, mdp, 0); > > Why is 0 the correct offset? The C1 helper uses md->byte_offset_of_slot(). Interpreter and C1 do profiling data offsets a bit differently. Interpreter tracks MDP as BCI changes. It has to, because it does not really know statically where it is. Take a look at `InterpreterMacroAssembler::update_mdp_*` family of methods, and one of its uses: void InterpreterMacroAssembler::profile_taken_branch(Register mdp) { if (ProfileInterpreter) { Label profile_continue; // If no method data exists, go to profile_continue. test_method_data_pointer(mdp, profile_continue); // We are taking a branch. Increment the taken count. increment_mdp_data_at(mdp, in_bytes(JumpData::taken_offset())); // The method data pointer needs to be updated to reflect the new target. update_mdp_by_offset(mdp, in_bytes(JumpData::displacement_offset())); bind(profile_continue); } } It is fairly confusing in interpreter code that `mdp` is not pointing to `MethodData*` head, but actually is the _interior_ pointer somewhere in MDP. Profiling code is weaved in in such a way that MDP at current point is pointing at area that belongs to current BCI. Compilers are able to compute the mapping from BCI to MDP to data slot directly, since they have a good view on the whole method and can ask VM questions about the slot addresses. C1 commonly does this: ciProfileData* data = md->bci_to_data(bci); md->byte_offset_of_slot(data, ); Anyhow, I did most of the interface changes mechanically, so the `0` slot offset naturally appeared in these places through refactoring. Which gives me additional confidence about its correctness. > src/hotspot/cpu/x86/interp_masm_x86.cpp line 1553: > >> 1551: >> 1552: // Record the object type. >> 1553: record_klass_in_profile(klass, mdp, reg2, false); > > Same question as above about the 0 offset. Is this because `mdp` has already been adjusted? Same answer as above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2363591249 PR Review Comment: https://git.openjdk.org/jdk/pull/25305#discussion_r2363591882 From chagedorn at openjdk.org Fri Sep 19 19:51:15 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 19 Sep 2025 19:51:15 GMT Subject: RFR: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed [v2] In-Reply-To: <5eWiPUhybQOdBZAfm8LnEGLQ8ZwXHcqatCQEf8PVlgo=.ffcd2f7e-f734-49de-a2e4-1099bfb544f5@github.com> References: <5eWiPUhybQOdBZAfm8LnEGLQ8ZwXHcqatCQEf8PVlgo=.ffcd2f7e-f734-49de-a2e4-1099bfb544f5@github.com> Message-ID: On Tue, 16 Sep 2025 21:59:10 GMT, Man Cao wrote: >> Hi, >> >> Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). >> >> For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. > > Man Cao has updated the pull request incrementally with one additional commit since the last revision: > > Switch to disable inlining for shortMethod Looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/27306#pullrequestreview-3246876138 From manc at openjdk.org Fri Sep 19 19:56:55 2025 From: manc at openjdk.org (Man Cao) Date: Fri, 19 Sep 2025 19:56:55 GMT Subject: Integrated: 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed In-Reply-To: References: Message-ID: On Tue, 16 Sep 2025 06:48:23 GMT, Man Cao wrote: > Hi, > > Could anyone approve this change that exclude this test when running with `-Xcomp`? This avoids the test failure reported in [JDK-8367613](https://bugs.openjdk.org/browse/JDK-8367613). > > For reasons I don't yet understand, the `HugeSwitch::shortMethod` method is not compiled under `-Xcomp -XX:TieredStopAtLevel=1`. The method gets compiled with either `-Xcomp` or `-XX:TieredStopAtLevel=1`, but not both. I appreciate if anyone could provide insights on possible reasons. This pull request has now been integrated. Changeset: 25a4e263 Author: Man Cao URL: https://git.openjdk.org/jdk/commit/25a4e26320340cdda082cd45639e73b137ce45a2 Stats: 5 lines in 1 file changed: 4 ins; 0 del; 1 mod 8367613: Test compiler/runtime/TestDontCompileHugeMethods.java failed Reviewed-by: chagedorn, dfenacci ------------- PR: https://git.openjdk.org/jdk/pull/27306 From jbhateja at openjdk.org Fri Sep 19 20:44:54 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 20:44:54 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v11] In-Reply-To: References: Message-ID: > This patch optimizes PopCount value transforms using KnownBits information. > Following are the results of the micro-benchmark included with the patch > > > > System: 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 215460.670 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 294014.826 ops/s > > Withopt: > Benchmark Mode Cnt Score Error Units > PopCountValueTransform.LogicFoldingKerenLong thrpt 2 389978.082 ops/s > PopCountValueTransform.LogicFoldingKerenlInt thrpt 2 417261.583 ops/s > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update countbitsnode.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/27075/files - new: https://git.openjdk.org/jdk/pull/27075/files/92cf2fad..e206ccc3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=27075&range=09-10 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/27075.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/27075/head:pull/27075 PR: https://git.openjdk.org/jdk/pull/27075 From jbhateja at openjdk.org Fri Sep 19 20:52:32 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 19 Sep 2025 20:52:32 GMT Subject: RFR: 8365205: C2: Optimize popcount value computation using knownbits [v10] In-Reply-To: References: Message-ID: <3w_iOULgRgaA1kCHlXrprLPdQfMYcvo1kXqvE7VaaQk=.ab753d0d-1184-4865-bace-564a4938d6d5@github.com> On Fri, 19 Sep 2025 11:16:26 GMT, Hannes Greule wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update TestPopCountValueTransforms.java > > src/hotspot/share/opto/countbitsnode.cpp line 123: > >> 121: // we have at least and at most. >> 122: // From the definition of KnownBits, we know: >> 123: // zeros: Indicates which bits must be 0: ones[i] =1 -> t[i]=0 > > I'm a bit confused by this, is ones[i] mixed up with zeros[i]? I.e., t[i]=0 if zeros[i]=1 @SirYwell , comment updated. Links to formal z3 proofs for this:- https://github.com/openjdk/jdk/pull/25928#discussion_r2256750507 https://bugs.openjdk.org/browse/JDK-8365205?focusedId=14807707&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14807707:~:text=C%3A%5CGithub%5Csoftwares%5Cz3%5Cz3%2D4.15.2%2Dx64%2Dwin%5Cbin%5Cpython%3Epython3%20known_bits_popcount.py%0AMain%20constraints%20satisfiable.%0AConstraints%20are%20valid%20(negation%20unsatisfiable). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27075#discussion_r2364498821 From snatarajan at openjdk.org Fri Sep 19 20:58:21 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Fri, 19 Sep 2025 20:58:21 GMT Subject: RFR: 8349835: C2: simplify IGV property printing Message-ID: <16DzrQX_urXyKeFfY1FlaEM8Q9QjYgc0CHa25EWQV84=.ecec4b5d-c7d9-4060-9b76-0a4e4e0786e3@github.com> The code that prints node properties and live range properties is very verbose and repetitive and could be simplified by applying a refactoring suggested [here](https://github.com/openjdk/jdk/pull/23558#discussion_r1950785708). ### Fix Implemented the suggested refactoring. ### Testing Github Actions, Tier 1-3 ------------- Commit messages: - changing int to bool in a struct - fix to failing test - initial fix Changes: https://git.openjdk.org/jdk/pull/26902/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26902&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8349835 Stats: 117 lines in 2 files changed: 20 ins; 54 del; 43 mod Patch: https://git.openjdk.org/jdk/pull/26902.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26902/head:pull/26902 PR: https://git.openjdk.org/jdk/pull/26902 From rriggs at openjdk.org Fri Sep 19 21:58:22 2025 From: rriggs at openjdk.org (Roger Riggs) Date: Fri, 19 Sep 2025 21:58:22 GMT Subject: RFR: 8355223: Improve documentation on @IntrinsicCandidate [v8] In-Reply-To: References: Message-ID: On Fri, 19 Sep 2025 13:10:33 GMT, Chen Liang wrote: >> In offline discussion, we noted that the documentation on this annotation does not recommend minimizing the intrinsified section and moving whatever can be done in Java to Java; thus I prepared this documentation update, to shrink a "TLDR" essay to something concise for readers, such as pointing to that list at `vmIntrinsics.hpp` instead of "a list". > > Chen Liang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: > > - Separate design doc > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - More review updates > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Move intrinsic to be a subsection; just one most common function of the annotation > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Update src/java.base/share/classes/jdk/internal/vm/annotation/IntrinsicCandidate.java > > Co-authored-by: Raffaello Giulietti > - Shorter first sentence > - Updates, thanks to John > - ... and 2 more: https://git.openjdk.org/jdk/compare/380c643a...e4afa49d This seems more like guidance for people writing intrinsics and should be in the HotSpot part of the src tree. The annotation can link there. src/java.base/share/classes/jdk/internal/vm/annotation/IntrinsicCandidate.java line 42: > 40: /// what intrinsics are and cautions for working with annotated methods. > 41: /// > 42: /// @since 16 Lets stick to the javadoc /*... */ markup. src/java.base/share/classes/jdk/internal/vm/annotation/intrinsics.md line 1: > 1: