From xliu at openjdk.org Sat Apr 1 00:47:23 2023 From: xliu at openjdk.org (Xin Liu) Date: Sat, 1 Apr 2023 00:47:23 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 30 Mar 2023 23:36:20 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. src/hotspot/share/opto/escape.cpp line 639: > 637: call->add_req(selector); > 638: > 639: for (uint i = 1; i < ophi->req(); i++) { Comparing to new_phi and selector, I think this is the heavy-lifting work. You "replace" all appearances of ptn with SPSO. This logic almost overlaps the 'scalar replacement' part in MacroExpand. Do you consider to perform the transformation in MacroExpand? Your prior changes have already removed NSR marks, ME/SR will consider 'ptn'. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1155028545 From qamai at openjdk.org Sat Apr 1 07:44:25 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 1 Apr 2023 07:44:25 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v4] In-Reply-To: References: Message-ID: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - instruction asserts - Merge branch 'master' into sliceIntrinsics - add comments explaining anonymous classes - address reviews - sse2, increase warmup - aesthetic - optimise 64B - add jmh - vector slice intrinsics ------------- Changes: https://git.openjdk.org/jdk/pull/12909/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=03 Stats: 1603 lines in 58 files changed: 1277 ins; 257 del; 69 mod Patch: https://git.openjdk.org/jdk/pull/12909.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12909/head:pull/12909 PR: https://git.openjdk.org/jdk/pull/12909 From jbhateja at openjdk.org Sat Apr 1 10:10:30 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 1 Apr 2023 10:10:30 GMT Subject: Withdrawn: 8302673: [SuperWord] MaxReduction and MinReduction should vectorize for int In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 07:30:30 GMT, Jatin Bhateja wrote: > This bugfix patch bypasses couple of canonicalizing ideal transformations for MaxI/MinI IR nodes to prevent breaking reduction chain. > > Kindly review. > > Best Regards, > Jatin This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/13260 From jbhateja at openjdk.org Sun Apr 2 05:56:35 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sun, 2 Apr 2023 05:56:35 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 15:21:45 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 28 additional commits since the last revision: > > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - Add negative vectorization test > - Update copyright headers > - Add two more reduction vectorization microbenchmarks > - ... and 18 more: https://git.openjdk.org/jdk/compare/a8c9a58e...95f6cc33 src/hotspot/share/opto/superword.cpp line 504: > 502: // to the phi node following edge index 'input'. > 503: PathEnd path = > 504: find_in_path( Hi @robcasloz, find_in_path expects reduction nodes to be present at same edge indices in the reduction chain, it also honors has_swapped_edge flag during backward traversal. However, there are still some ideal transforms like following which may break the reduction chain and this will prevent Min/Max reductions for test case mentioned in JDK-8302673. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1147 https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1230 src/hotspot/share/opto/superword.cpp line 539: > 537: pred = current; > 538: current = original_input(current, reduction_input); > 539: } If we bookkeep the nodes in the reduction chain path during initial backward traversal we may simplify this checking and also another call to _original_input_ while populating _loop_reductions set on [#L547 ](https://github.com/openjdk/jdk/pull/13120/files#diff-8f29dd005a0f949d108687dabb7379c73dfd85cd782da453509dc9b6cb8c9f81R547) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1155239685 PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1155240919 From jbhateja at openjdk.org Sun Apr 2 05:56:35 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sun, 2 Apr 2023 05:56:35 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Sun, 2 Apr 2023 04:57:34 GMT, Jatin Bhateja wrote: >> Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 28 additional commits since the last revision: >> >> - Merge master >> - Relax the reduction cycle search bound >> - Remove redundant IR check precondition >> - Use SuperWord members in reduction marking >> - Remove redundant opcode checks >> - Do not run test in x86-32 >> - Update existing test instead of removing it >> - Add negative vectorization test >> - Update copyright headers >> - Add two more reduction vectorization microbenchmarks >> - ... and 18 more: https://git.openjdk.org/jdk/compare/a8c9a58e...95f6cc33 > > src/hotspot/share/opto/superword.cpp line 504: > >> 502: // to the phi node following edge index 'input'. >> 503: PathEnd path = >> 504: find_in_path( > > Hi @robcasloz, > find_in_path expects reduction nodes to be present at same edge indices in the reduction chain, it also honors has_swapped_edge flag during backward traversal. > However, there are still some ideal transforms like following which may break the reduction chain and this will prevent Min/Max reductions for test case mentioned in JDK-8302673. > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1147 > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1230 One way to add fault-tolerance to find_in_path could be to follow strict DFS semantics where an alternate path is taken if node's predicates are not satisfied, currently we are starting all over again from the first node of chain with a different reduction_input which prevents inferring reduction chain even though all the nodes in the chain are commutative isomorphic operations. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1155244291 From xgong at openjdk.org Mon Apr 3 01:57:25 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 3 Apr 2023 01:57:25 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:25:16 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > small cosmetics Looks good to me! Thanks! ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/13093#pullrequestreview-1368199580 From thartmann at openjdk.org Mon Apr 3 05:27:21 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Apr 2023 05:27:21 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 05:26:08 GMT, Xin Liu wrote: > This patch improves how Region::Ideal trims unreachable paths. > > 1. Don't restart from beginning. Trimming doesn't change the DU-chain. > 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. > 3. Don't call add_users_to_worklist(this) repeatly. > 4. Reduce its strength from add_users_to_worklist to > add_users_to_worklist0 because RegionNode has no special logic. > > This patch also includes a cosmetic change: rename n to 'use' inside of the loop. > Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. Looks good to me. src/hotspot/share/opto/cfgnode.cpp line 575: > 573: Node* use = fast_out(j); > 574: > 575: if(use->req() != req() && use->is_Phi()) { Suggestion: if (use->req() != req() && use->is_Phi()) { src/hotspot/share/opto/cfgnode.cpp line 576: > 574: > 575: if(use->req() != req() && use->is_Phi()) { > 576: assert(use->in(0) == this, ""); Suggestion: assert(use->in(0) == this, "unexpected control input"); ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13238#pullrequestreview-1368304294 PR Review Comment: https://git.openjdk.org/jdk/pull/13238#discussion_r1155492344 PR Review Comment: https://git.openjdk.org/jdk/pull/13238#discussion_r1155492682 From thartmann at openjdk.org Mon Apr 3 05:30:23 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Apr 2023 05:30:23 GMT Subject: RFR: 8303278: Imprecise bottom type of ExtractB/UB [v2] In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 01:22:10 GMT, Eric Liu wrote: >> This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. >> >> ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. >> >> Below shows a typical case used ExtractBNode >> >> >> public static byte byteLt16() { >> ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); >> return vecb.lane(1); >> } >> >> >> In this case, c2 constructs IR graph like: >> >> ExtractB ConI(24) >> | __| >> | / | >> LShiftI __| >> | / >> RShiftI >> >> which generates AArch64 code: >> >> movi v16.16b, #0x1 >> smov x11, v16.b[1] >> sxtb w0, w11 >> >> with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: >> >> movi v16.16b, #0x1 >> smov x0, v16.b[1] >> >> [TEST] >> >> Full jtreg passed except 4 files on x86: >> >> jdk/incubator/vector/Byte128VectorTests.java >> jdk/incubator/vector/Byte256VectorTests.java >> jdk/incubator/vector/Byte512VectorTests.java >> jdk/incubator/vector/Byte64VectorTests.java >> >> They are caused by a known issue on x86 [2]. >> >> [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 >> [2] https://bugs.openjdk.org/browse/JDK-8303508 > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge jdk:master > > Change-Id: I40cce803da09bae31cd74b86bf93607a08219545 > - 8303278: Imprecise bottom type of ExtractB/UB > > This is a trivial patch, which fixes the bottom type of ExtractB/UB > nodes. > > ExtractNode can be generated by Vector API Vector.lane(int), which gets > the lane element at the given index. A more precise type of range can > help to optimize out unnecessary type conversion in some cases. > > Below shows a typical case used ExtractBNode > > ``` > public static byte byteLt16() { > ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); > return vecb.lane(1); > } > > ``` > In this case, c2 constructs IR graph like: > > ExtractB ConI(24) > | __| > | / | > LShiftI __| > | / > RShiftI > > which generates AArch64 code: > > movi v16.16b, #0x1 > smov x11, v16.b[1] > sxtb w0, w11 > > with this patch, this shift pair can be optimized out by RShiftI's > identity [1]. The code is optimized to: > > movi v16.16b, #0x1 > smov x0, v16.b[1] > > [TEST] > > Full jtreg passed except 4 files on x86: > > jdk/incubator/vector/Byte128VectorTests.java > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/Byte64VectorTests.java > > They are caused by a known issue on x86 [2]. > > [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 > [2] https://bugs.openjdk.org/browse/JDK-8303508 > > Change-Id: Ibea9aeacb41b4d1c5b2621c7a97494429394b599 Looks good. All tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13070#pullrequestreview-1368306807 From thartmann at openjdk.org Mon Apr 3 05:51:22 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Apr 2023 05:51:22 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v6] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 07:26:54 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Looks good to me. src/hotspot/share/opto/loopnode.cpp line 4450: > 4448: return; > 4449: } > 4450: DEBUG_ONLY( if(VerifyLoopOptimizations) { verify(); } ); Suggestion: DEBUG_ONLY( if (VerifyLoopOptimizations) { verify(); } ); src/hotspot/share/opto/loopnode.cpp line 4522: > 4520: visited.clear(); > 4521: split_if_with_blocks( visited, nstack); > 4522: DEBUG_ONLY( if(VerifyLoopOptimizations) { verify(); } ); Suggestion: DEBUG_ONLY( if (VerifyLoopOptimizations) { verify(); } ); ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1368314113 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155498928 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155499010 From thartmann at openjdk.org Mon Apr 3 05:51:23 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Apr 2023 05:51:23 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v6] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 05:37:50 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> fix typo > > src/hotspot/share/opto/loopnode.cpp line 4522: > >> 4520: visited.clear(); >> 4521: split_if_with_blocks( visited, nstack); >> 4522: DEBUG_ONLY( if(VerifyLoopOptimizations) { verify(); } ); > > Suggestion: > > DEBUG_ONLY( if (VerifyLoopOptimizations) { verify(); } ); There are more of these in the code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155499750 From thartmann at openjdk.org Mon Apr 3 06:04:29 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Apr 2023 06:04:29 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies [v2] In-Reply-To: References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Wed, 29 Mar 2023 10:30:52 GMT, Emanuel Peter wrote: >> I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). >> >> Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). >> >> This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: >> >> https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 >> >> This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): >> >> >> 3.7 Scheduling >> Dependence analysis before packing ensures that statements within a group can be executed >> safely in parallel. However, it may be the case that executing two groups produces a dependence >> violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between >> groups if a statement in one group is dependent on a statement in the other. As long as there >> are no cycles in this dependence graph, all groups can be scheduled such that no violations >> occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group >> will need to be eliminated. Although experimental data has shown this case to be extremely rare, >> care must be taken to ensure correctness. >> >> >> **Solution** >> >> Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). >> >> **FYI** >> >> I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > review feedback implemented Nice analysis and test coverage! The fix looks good to me. test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java line 28: > 26: * @test > 27: * @bug 8304042 > 28: * @summary Test some examples with indepenednet packs with cyclic dependency Suggestion: * @summary Test some examples with independent packs with cyclic dependency test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency2.java line 28: > 26: * @test > 27: * @bug 8304042 > 28: * @summary Test some examples with indepenednet packs with cyclic dependency Suggestion: * @summary Test some examples with independent packs with cyclic dependency ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13078#pullrequestreview-1368325513 PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1155506558 PR Review Comment: https://git.openjdk.org/jdk/pull/13078#discussion_r1155506717 From epeter at openjdk.org Mon Apr 3 06:13:22 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Apr 2023 06:13:22 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies [v3] In-Reply-To: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: > I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). > > Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). > > This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: > > https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 > > This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): > > > 3.7 Scheduling > Dependence analysis before packing ensures that statements within a group can be executed > safely in parallel. However, it may be the case that executing two groups produces a dependence > violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between > groups if a statement in one group is dependent on a statement in the other. As long as there > are no cycles in this dependence graph, all groups can be scheduled such that no violations > occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group > will need to be eliminated. Although experimental data has shown this case to be extremely rare, > care must be taken to ensure correctness. > > > **Solution** > > Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). > > **FYI** > > I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Typo fix by Tobias Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13078/files - new: https://git.openjdk.org/jdk/pull/13078/files/adc297e4..8e7814c4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13078&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13078&range=01-02 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13078.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13078/head:pull/13078 PR: https://git.openjdk.org/jdk/pull/13078 From epeter at openjdk.org Mon Apr 3 06:21:15 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Apr 2023 06:21:15 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v7] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. > - I now report all failures, before asserting. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Style fix by Tobias ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/d01296d8..39fb94da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=05-06 Stats: 5 lines in 3 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From chagedorn at openjdk.org Mon Apr 3 07:44:28 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Apr 2023 07:44:28 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v7] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 06:21:15 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Style fix by Tobias Thanks for doing the updates! Looks much cleaner now. I have some more comments. src/hotspot/share/opto/loopnode.cpp line 4823: > 4821: while (child != nullptr) { > 4822: assert(child->_parent == this, "all must be children of this"); > 4823: children.push(child); You could directly use: children.insert_sorted(child); instead of `push` + `sort()` afterwards. This would simplify `compare_tree()` to: int compare_tree(IdealLoopTree* const& a, IdealLoopTree* const& b) { assert(a != nullptr && b != nullptr, "must be"); return a->_head->_idx - b->_head->_idx; } src/hotspot/share/opto/loopnode.cpp line 4873: > 4871: // Process the two children, or potentially log the failure if we only found one. > 4872: if (child_verify == nullptr) { > 4873: if (child_verify->_irreducible && Compile::current()->major_progress()) { Copy-paste error: Suggestion: if (child->_irreducible && Compile::current()->major_progress()) { src/hotspot/share/opto/loopnode.cpp line 4876: > 4874: // Irreducible loops can pick a different header (one of its entries). > 4875: } else { > 4876: tty->print_cr("We have loop that verify does not have"); Suggestion: tty->print_cr("We have a loop that verify does not have"); src/hotspot/share/opto/loopnode.cpp line 4890: > 4888: // mean that we lost it, which is not ok. > 4889: } else { > 4890: tty->print_cr("Verify has loop that we do not have"); Suggestion: tty->print_cr("Verify has a loop that we do not have"); src/hotspot/share/opto/loopnode.cpp line 4922: > 4920: assert(ctrl != nullptr && ctrl->is_CFG(), "sane loop in-ctrl"); > 4921: assert(back != nullptr && back->is_CFG(), "sane loop backedge"); > 4922: Node* loopexit = cl->loopexit(); // assert implied Local variable assignment can be removed. Suggestion: cl->loopexit(); // assert implied src/hotspot/share/opto/loopnode.hpp line 797: > 795: > 796: #ifdef ASSERT > 797: void collect_children(GrowableArray &children) const; Suggestion: void collect_children(GrowableArray& children) const; ------------- PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1368343134 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155557413 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155569153 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155569752 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155572492 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155573608 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155574891 From chagedorn at openjdk.org Mon Apr 3 07:44:33 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Apr 2023 07:44:33 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v6] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 07:26:54 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fix typo src/hotspot/share/opto/loopnode.cpp line 4669: > 4667: } > 4668: > 4669: //------------------------------verify_idom_and_nodes----------------------------- You can directly remove these `//----...` comments as currently suggested in [JDK-8304034](https://bugs.openjdk.org/browse/JDK-8304034). src/hotspot/share/opto/loopnode.cpp line 4696: > 4694: // Verify IDOM for all CFG nodes (except root). > 4695: if (!n->is_CFG() || n->is_Root()) { > 4696: return false; // pass I suggest to use `true` to indicate success. src/hotspot/share/opto/loopnode.cpp line 4819: > 4817: > 4818: void IdealLoopTree::collect_children(GrowableArray &children) const { > 4819: children.clear(); As you directly pass a new list to this method from `verify_tree()`, you do not need to clear it (you could theoretically also create the list here and return it by utilizing RVO). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155517517 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155540586 PR Review Comment: https://git.openjdk.org/jdk/pull/13207#discussion_r1155542450 From epeter at openjdk.org Mon Apr 3 08:36:57 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Apr 2023 08:36:57 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v8] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. > - I now report all failures, before asserting. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Christian's review suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/39fb94da..3e5aa600 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=06-07 Stats: 52 lines in 2 files changed: 6 ins; 14 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From epeter at openjdk.org Mon Apr 3 09:27:21 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 3 Apr 2023 09:27:21 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v9] In-Reply-To: References: Message-ID: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. > - I now report all failures, before asserting. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove unnecessary ampersand ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13207/files - new: https://git.openjdk.org/jdk/pull/13207/files/3e5aa600..18709c3e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13207&range=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13207/head:pull/13207 PR: https://git.openjdk.org/jdk/pull/13207 From chagedorn at openjdk.org Mon Apr 3 09:28:18 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Apr 2023 09:28:18 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v9] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 09:27:21 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove unnecessary ampersand Thanks for all the updates, looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1368640593 From thartmann at openjdk.org Mon Apr 3 09:31:18 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Apr 2023 09:31:18 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v9] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 09:27:21 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove unnecessary ampersand Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1368645412 From tholenstein at openjdk.org Mon Apr 3 12:42:06 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Apr 2023 12:42:06 GMT Subject: RFR: JDK-8305356: Fix ignored bad CompileCommands in tests Message-ID: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> The following tests have a wrong `CompileCommand` and print an error message in `CompilerOracle::print_parse_error`: * missing `::*`: - `test/hotspot/jtreg/compiler/loopopts/TestPeelingRemoveDominatedTest.java` * used unsupported quotation marks `""`: - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegMultiply.java` - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegAnd.java` * Use of the pattern `CompileCommand=option,Klass::method,type,option,value` but `DisableIntrinsic` has type option type `ccstrlist` and not `ccstr`: - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMulAdd.java` - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMultiplyToLen.java` - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestShift.java` - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestSquareToLen.java` ------------- Commit messages: - JDK-8305356: Fix ignored bad CompileCommands in tests and add an assertion - TestPeelingRemoveDominatedTest - ccstrlist,DisableIntrinsic - fixed TestMulAdd.java - JDK-8282797: CompileCommand parsing errors should exit VM Changes: https://git.openjdk.org/jdk/pull/13297/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13297&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305356 Stats: 18 lines in 7 files changed: 0 ins; 0 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/13297.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13297/head:pull/13297 PR: https://git.openjdk.org/jdk/pull/13297 From tholenstein at openjdk.org Mon Apr 3 12:53:52 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Mon, 3 Apr 2023 12:53:52 GMT Subject: RFR: JDK-8305356: Fix ignored bad CompileCommands in tests [v2] In-Reply-To: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> References: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> Message-ID: > The following tests have a wrong `CompileCommand` and print an error message in `CompilerOracle::print_parse_error`: > > * missing `::*`: > - `test/hotspot/jtreg/compiler/loopopts/TestPeelingRemoveDominatedTest.java` > > * used unsupported quotation marks `""`: > - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegMultiply.java` > - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegAnd.java` > > * Use of the pattern `CompileCommand=option,Klass::method,type,option,value` but `DisableIntrinsic` has type option type `ccstrlist` and not `ccstr`: > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMulAdd.java` > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMultiplyToLen.java` > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestShift.java` > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestSquareToLen.java` Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: copyright year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13297/files - new: https://git.openjdk.org/jdk/pull/13297/files/a9943e26..c86155aa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13297&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13297&range=00-01 Stats: 7 lines in 7 files changed: 2 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13297.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13297/head:pull/13297 PR: https://git.openjdk.org/jdk/pull/13297 From thartmann at openjdk.org Mon Apr 3 14:04:59 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 3 Apr 2023 14:04:59 GMT Subject: RFR: JDK-8305356: Fix ignored bad CompileCommands in tests [v2] In-Reply-To: References: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> Message-ID: On Mon, 3 Apr 2023 12:53:52 GMT, Tobias Holenstein wrote: >> The following tests have a wrong `CompileCommand` and print an error message in `CompilerOracle::print_parse_error`: >> >> * missing `::*`: >> - `test/hotspot/jtreg/compiler/loopopts/TestPeelingRemoveDominatedTest.java` >> >> * used unsupported quotation marks `""`: >> - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegMultiply.java` >> - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegAnd.java` >> >> * Use of the pattern `CompileCommand=option,Klass::method,type,option,value` but `DisableIntrinsic` has type option type `ccstrlist` and not `ccstr`: >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMulAdd.java` >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMultiplyToLen.java` >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestShift.java` >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestSquareToLen.java` > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > copyright year Looks good to me. FTR, [JDK-8282797](https://bugs.openjdk.org/browse/JDK-8282797) will then enforce this. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13297#pullrequestreview-1369106878 From chagedorn at openjdk.org Mon Apr 3 14:22:58 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 3 Apr 2023 14:22:58 GMT Subject: RFR: JDK-8305356: Fix ignored bad CompileCommands in tests [v2] In-Reply-To: References: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> Message-ID: On Mon, 3 Apr 2023 12:53:52 GMT, Tobias Holenstein wrote: >> The following tests have a wrong `CompileCommand` and print an error message in `CompilerOracle::print_parse_error`: >> >> * missing `::*`: >> - `test/hotspot/jtreg/compiler/loopopts/TestPeelingRemoveDominatedTest.java` >> >> * used unsupported quotation marks `""`: >> - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegMultiply.java` >> - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegAnd.java` >> >> * Use of the pattern `CompileCommand=option,Klass::method,type,option,value` but `DisableIntrinsic` has type option type `ccstrlist` and not `ccstr`: >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMulAdd.java` >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMultiplyToLen.java` >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestShift.java` >> - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestSquareToLen.java` > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > copyright year Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13297#pullrequestreview-1369146072 From psandoz at openjdk.org Mon Apr 3 15:02:07 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 3 Apr 2023 15:02:07 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:25:16 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > small cosmetics Tier 2/3 tests passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1494486846 From psandoz at openjdk.org Mon Apr 3 15:20:21 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 3 Apr 2023 15:20:21 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:25:16 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > small cosmetics Marked as reviewed by psandoz (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13093#pullrequestreview-1369262809 From jcking at openjdk.org Mon Apr 3 15:31:07 2023 From: jcking at openjdk.org (Justin King) Date: Mon, 3 Apr 2023 15:31:07 GMT Subject: RFR: JDK-8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process Message-ID: Remove unnecessary usage of Arena for globals that live the lifetime of the process and are initialized once. ------------- Commit messages: - Remove unnecessary use of an Arena in C1 Changes: https://git.openjdk.org/jdk/pull/13300/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13300&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305484 Stats: 50 lines in 5 files changed: 20 ins; 9 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/13300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13300/head:pull/13300 PR: https://git.openjdk.org/jdk/pull/13300 From jcking at openjdk.org Mon Apr 3 15:32:58 2023 From: jcking at openjdk.org (Justin King) Date: Mon, 3 Apr 2023 15:32:58 GMT Subject: RFR: JDK-8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 15:23:43 GMT, Justin King wrote: > Remove unnecessary usage of Arena for globals that live the lifetime of the process and are initialized once. Actually, let me double check this isn't per-thread. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13300#issuecomment-1494539665 From jcking at openjdk.org Mon Apr 3 15:41:56 2023 From: jcking at openjdk.org (Justin King) Date: Mon, 3 Apr 2023 15:41:56 GMT Subject: RFR: JDK-8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process [v2] In-Reply-To: References: Message-ID: <6fLO1cz_N50WWJ_qEPVtAyXdVwrwKuzKNI_hkEr-3kg=.9743f6fb-0eda-4ad7-bce6-3c5c78c1a129@github.com> > Remove unnecessary usage of Arena for globals that live the lifetime of the process and are initialized once. Justin King has updated the pull request incrementally with three additional commits since the last revision: - Remove now unused include Signed-off-by: Justin King - Remove incorrect comment Signed-off-by: Justin King - Fix typo Signed-off-by: Justin King ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13300/files - new: https://git.openjdk.org/jdk/pull/13300/files/09c38fd1..d1663998 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13300&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13300&range=00-01 Stats: 5 lines in 2 files changed: 0 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13300.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13300/head:pull/13300 PR: https://git.openjdk.org/jdk/pull/13300 From jcking at openjdk.org Mon Apr 3 15:41:59 2023 From: jcking at openjdk.org (Justin King) Date: Mon, 3 Apr 2023 15:41:59 GMT Subject: RFR: JDK-8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process In-Reply-To: References: Message-ID: <-cI8TZcod-Q0kxDCkLWeh5QsWbhPs7z0v1crN8gHpEE=.927c76d8-24c6-436c-a6f9-347d8f98f571@github.com> On Mon, 3 Apr 2023 15:23:43 GMT, Justin King wrote: > Remove unnecessary usage of Arena for globals that live the lifetime of the process and are initialized once. Looks like its only ever executed by a single thread. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13300#issuecomment-1494547212 From psandoz at openjdk.org Mon Apr 3 16:39:01 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 3 Apr 2023 16:39:01 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v4] In-Reply-To: References: Message-ID: On Sat, 1 Apr 2023 07:44:25 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - instruction asserts > - Merge branch 'master' into sliceIntrinsics > - add comments explaining anonymous classes > - address reviews > - sse2, increase warmup > - aesthetic > - optimise 64B > - add jmh > - vector slice intrinsics With the latest PR I am observing failures with debug builds for test compiler/vectorapi/TestVectorSlice.java on both AVX512 machines and aarch64 machines. On AVX512 machines the test fails with JVM args `-XX:UseAVX=3` and `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting` and results in a test assertion failure e.g., Caused by: java.lang.RuntimeException: assertEquals: expected 70 to equal 0 at jdk.test.lib.Asserts.fail(Asserts.java:594) at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) at compiler.vectorapi.TestVectorSlice.lambda$testInts$2(TestVectorSlice.java:163) at compiler.vectorapi.TestVectorSlice.testInts(TestVectorSlice.java:181) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ... 7 more CPU flags are: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant tsc arch perfmon rep good nopl xtopology cpuid tsc known freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4 1 sse4 2 x2apic movbe popcnt tsc deadline timer aes xsave avx f16c rdrand hypervisor lahf lm abm 3dnowprefetch cpuid fault invpcid single ssbd ibrs ibpb stibp ibrs enhanced tpr shadow vnmi flexpriority ept vpid ept ad fsgsbase tsc adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves nt good wbnoinvd arat avx512vbmi umip pku ospke avx512 vbmi2 gfni vaes vpclmulqdq avx512 vnni avx512 bitalg avx512 vpopcntdq la57 rdpid md clear arch capabilities On aarch64 there is an IR rule failure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12909#issuecomment-1494641261 From xliu at openjdk.org Mon Apr 3 19:42:42 2023 From: xliu at openjdk.org (Xin Liu) Date: Mon, 3 Apr 2023 19:42:42 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v2] In-Reply-To: References: Message-ID: > This patch improves how Region::Ideal trims unreachable paths. > > 1. Don't restart from beginning. Trimming doesn't change the DU-chain. > 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. > 3. Don't call add_users_to_worklist(this) repeatly. > 4. Reduce its strength from add_users_to_worklist to > add_users_to_worklist0 because RegionNode has no special logic. > > This patch also includes a cosmetic change: rename n to 'use' inside of the loop. > Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: Update coding style according to reviewer's feedback. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13238/files - new: https://git.openjdk.org/jdk/pull/13238/files/40510a42..8732ee71 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13238&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13238&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13238.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13238/head:pull/13238 PR: https://git.openjdk.org/jdk/pull/13238 From xliu at openjdk.org Mon Apr 3 19:42:46 2023 From: xliu at openjdk.org (Xin Liu) Date: Mon, 3 Apr 2023 19:42:46 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v2] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 05:24:19 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> Update coding style according to reviewer's feedback. > > src/hotspot/share/opto/cfgnode.cpp line 576: > >> 574: >> 575: if(use->req() != req() && use->is_Phi()) { >> 576: assert(use->in(0) == this, ""); > > Suggestion: > > assert(use->in(0) == this, "unexpected control input"); updated. Could you help me submit this to your CI validation? TBH, I don't know why the original author starts over the loop. In my understanding, it doesn't delete any phi node but just put them into worklist. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13238#discussion_r1156381846 From kvn at openjdk.org Mon Apr 3 19:44:05 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 3 Apr 2023 19:44:05 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure [v9] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 09:27:21 GMT, Emanuel Peter wrote: >> I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. >> >> The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. >> >> **The Idea of VerifyLoopOptimizations** >> Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. >> `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. >> >> **My Approach** >> I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. >> >> **What I fixed** >> >> - `verify_compare` >> - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). >> - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. >> - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. >> - I now report all failures, before asserting. >> - `verify_tree` >> - I corrected the style and improved comments. >> - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. >> - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. >> >> **Disabled Verifications** >> I commented out the following verifications: >> >> (A) data nodes should have same ctrl >> (B) ctrl node should belong to same loop >> (C) ctrl node should have same idom >> (D) loop should have same tail >> (E) loop should have same body (list of nodes) >> (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong >> >> >> Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. >> >> **Follow-Up Work** >> >> I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. >> >> I propose the following order: >> >> - idom (C): The dominance structure is at the base of everything else. >> - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. >> - tail (D): ensure the tail of a loop is updated correctly >> - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. >> - other issues like (F) >> - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. >> - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. >> >> **Testing** >> I am running `tier1-tier6` and stress testing. >> Preliminary results are all good. >> >> **Conclusion** >> With this fix, I have the basic infrastructure of the verification working. >> However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. >> Follow-up RFE's will have to address these one-by-one. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove unnecessary ampersand Update is good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13207#pullrequestreview-1369694519 From qamai at openjdk.org Tue Apr 4 00:59:21 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Apr 2023 00:59:21 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:25:16 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > small cosmetics Thanks, may I integrate the changes now? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1495191197 From psandoz at openjdk.org Tue Apr 4 01:10:18 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Tue, 4 Apr 2023 01:10:18 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 00:56:09 GMT, Quan Anh Mai wrote: > Thanks, may I integrate the changes now? You might need another HotSpot reviewer? @vnkozlov is that correct? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1495198225 From eliu at openjdk.org Tue Apr 4 01:21:24 2023 From: eliu at openjdk.org (Eric Liu) Date: Tue, 4 Apr 2023 01:21:24 GMT Subject: Integrated: 8303278: Imprecise bottom type of ExtractB/UB In-Reply-To: References: Message-ID: On Fri, 17 Mar 2023 06:14:00 GMT, Eric Liu wrote: > This is a trivial patch, which fixes the bottom type of ExtractB/UB nodes. > > ExtractNode can be generated by Vector API Vector.lane(int), which gets the lane element at the given index. A more precise type of range can help to optimize out unnecessary type conversion in some cases. > > Below shows a typical case used ExtractBNode > > > public static byte byteLt16() { > ByteVector vecb = ByteVector.broadcast(ByteVector.SPECIES_128, 1); > return vecb.lane(1); > } > > > In this case, c2 constructs IR graph like: > > ExtractB ConI(24) > | __| > | / | > LShiftI __| > | / > RShiftI > > which generates AArch64 code: > > movi v16.16b, #0x1 > smov x11, v16.b[1] > sxtb w0, w11 > > with this patch, this shift pair can be optimized out by RShiftI's identity [1]. The code is optimized to: > > movi v16.16b, #0x1 > smov x0, v16.b[1] > > [TEST] > > Full jtreg passed except 4 files on x86: > > jdk/incubator/vector/Byte128VectorTests.java > jdk/incubator/vector/Byte256VectorTests.java > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/Byte64VectorTests.java > > They are caused by a known issue on x86 [2]. > > [1] https://github.com/openjdk/jdk/blob/742bc041eaba1ff9beb7f5b6d896e4f382b030ea/src/hotspot/share/opto/mulnode.cpp#L1052 > [2] https://bugs.openjdk.org/browse/JDK-8303508 This pull request has now been integrated. Changeset: ac898e90 Author: Eric Liu URL: https://git.openjdk.org/jdk/commit/ac898e90517b08d846a940ae58966905ef5f1aa6 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8303278: Imprecise bottom type of ExtractB/UB Reviewed-by: qamai, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13070 From xgong at openjdk.org Tue Apr 4 02:40:12 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 4 Apr 2023 02:40:12 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v4] In-Reply-To: References: Message-ID: On Sat, 1 Apr 2023 07:44:25 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - instruction asserts > - Merge branch 'master' into sliceIntrinsics > - add comments explaining anonymous classes > - address reviews > - sse2, increase warmup > - aesthetic > - optimise 64B > - add jmh > - vector slice intrinsics src/hotspot/share/opto/vectorIntrinsics.cpp line 1935: > 1933: return false; // should be primitive type > 1934: } > 1935: BasicType elem_bt = elem_type->basic_type(); Code style: It's better to add a blank line between different blocks. src/hotspot/share/opto/vectorIntrinsics.cpp line 1941: > 1939: if (C->print_intrinsics()) { > 1940: tty->print_cr(" ** not supported: arity=2 op=slice vlen=%d etype=%s ismask=notused", > 1941: num_elem, type2name(elem_bt)); `ismask=notused` could be removed. We used `ismask` in other intrinsics to print whether it is a vector mask operation instead of vector class. src/hotspot/share/opto/vectorIntrinsics.cpp line 1954: > 1952: if (v1 == NULL || v2 == NULL) { > 1953: return false; // operand unboxing failed > 1954: } Suggest to reorder line-1950 and the if-statement in line-1952. And then we doesn't need too more spaces in the variable definition `Node* v1 = unbox_vector(xxx)`. Besides, could we rename variable `o` to `index` or `origin` ? I know you'v used `origin` at the begin, maybe we can rename it to `origin_type`. I see the similari name style in `inline_vector_frombits_coerced`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1156645453 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1156646311 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1156653402 From amitkumar at openjdk.org Tue Apr 4 03:42:54 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 4 Apr 2023 03:42:54 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v3] In-Reply-To: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: > This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. Amit Kumar has updated the pull request incrementally with three additional commits since the last revision: - updates assert condition - Revert "build fix for s390x" This reverts commit bb0ae5d8340b8fd99c4c8d7ab5623739e4d2fa7a. - Revert "use constant instead of enum" This reverts commit 820e288422e86e14323872dbe21550915e28c7e9. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12825/files - new: https://git.openjdk.org/jdk/pull/12825/files/820e2884..4e2c8112 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12825&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12825&range=01-02 Stats: 47 lines in 14 files changed: 15 ins; 22 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/12825.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12825/head:pull/12825 PR: https://git.openjdk.org/jdk/pull/12825 From amitkumar at openjdk.org Tue Apr 4 03:49:07 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 4 Apr 2023 03:49:07 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v3] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Tue, 4 Apr 2023 03:42:54 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with three additional commits since the last revision: > > - updates assert condition > - Revert "build fix for s390x" > > This reverts commit bb0ae5d8340b8fd99c4c8d7ab5623739e4d2fa7a. > - Revert "use constant instead of enum" > > This reverts commit 820e288422e86e14323872dbe21550915e28c7e9. `FrameMap::first_available_sp_in_frame` was non-zero for s390x & PPC, and set to 0 for all other archs. Initially factor was `4` so things were fine but now it changed to `2` which became problematic for us. Now we have modified asserts. Thanks to @RealLucy for suggestion & help. @TheRealMDoerr or @reinrich would one of you review this PR as well ? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12825#issuecomment-1495304927 From duke at openjdk.org Tue Apr 4 03:56:06 2023 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 4 Apr 2023 03:56:06 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v3] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Tue, 4 Apr 2023 03:42:54 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with three additional commits since the last revision: > > - updates assert condition > - Revert "build fix for s390x" > > This reverts commit bb0ae5d8340b8fd99c4c8d7ab5623739e4d2fa7a. > - Revert "use constant instead of enum" > > This reverts commit 820e288422e86e14323872dbe21550915e28c7e9. Marked as reviewed by sunny868 at github.com (no known OpenJDK username). ------------- PR Review: https://git.openjdk.org/jdk/pull/12825#pullrequestreview-1370153611 From epeter at openjdk.org Tue Apr 4 06:32:46 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Apr 2023 06:32:46 GMT Subject: RFR: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 16:38:41 GMT, Vladimir Kozlov wrote: >>> I had testing run, for tier1-tier6 and stress testing. >>> The non-verification run finished with 27d 3h machine time. >>> The verification run is still running, with at least 27d 7h machine time. >> >> Thanks for the data! If the verification run does not take much longer (say <1% on top of what the non-verification run takes), it might be a good trade-off to have it enabled by default. Not just to prevent the verification code from rotting but to actually get more value from it (better chances to find bugs earlier). > >> > I had testing run, for tier1-tier6 and stress testing. >> > The non-verification run finished with 27d 3h machine time. >> > The verification run is still running, with at least 27d 7h machine time. >> >> Thanks for the data! If the verification run does not take much longer (say <1% on top of what the non-verification run takes), it might be a good trade-off to have it enabled by default. Not just to prevent the verification code from rotting but to actually get more value from it (better chances to find bugs earlier). > > I assume @eme64 tested it with current limited verification. With adding/restoring more code the time will increase. > I suggest to enable it only for `stress` testing now so we always use it for pre-integration testing and later tiers. > After enabling of all verification code we will check time again and can decide if we can enable it by default always. > So after pushing this fix we should add it to stress testing - we need that for pre-integration testing. Thanks @vnkozlov @chhagedorn @TobiHartmann for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13207#issuecomment-1495418474 From epeter at openjdk.org Tue Apr 4 06:32:49 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 4 Apr 2023 06:32:49 GMT Subject: Integrated: 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 12:49:57 GMT, Emanuel Peter wrote: > I am reviving `-XX:+VerifyLoopOptimizations` after many years of abandonment. There were many bugs filed, but so far it has not been addressed. > > The hope is that this work will allow us to catch ctrl / idom / loop body bugs quicker, and fix many of the existing ones along the way. > > **The Idea of VerifyLoopOptimizations** > Before loop-opts, we build many data-structures for dominance (idom), control, and loop membership. Then, loop-opts use this data to transform the graph. At the same time, they must maintain the correctness of the data-structures, so that other optimizations can be made, without needing to re-compute the data-structures every time. > `VerifyLoopOptimizations` was implemented to verify correctness of the data-structures. After some loop-opts, we re-compute a verification data-structure and compare it to the one we created before the loop-opts and maintained during loopopts. > > **My Approach** > I soon realized that there were many reasons why `VerifyLoopOptimizations` was broken. It seemed infeasible to fix all of them at once. I decided to first remove any part that was failing, until I have a minimal set that is working. I will leave many parts commented out. In follow-up RFE's, I will then iteratively improve the verification by re-enabling some verification and fixing the corresponding bugs. > > **What I fixed** > > - `verify_compare` > - Renamed it to `verify_idom_and_nodes`, since it does verification node-by-node (vs `verify_tree`, which verifies the loop-tree). > - Previously, it was implemented as a BFS with recursion, which lead to stack-overflow. I flattened the BFS into a loop. > - The BFS calls `verify_idom` and `verify_nodes` on every node. I refactored `verify_nodes` a bit, so that it is more readable. > - I now report all failures, before asserting. > - `verify_tree` > - I corrected the style and improved comments. > - I removed the broken verification for `Opaque` nodes. I added some rudamentary verification for `CountedLoop`. I leave more of this work for follow-up RFE's. > - I also converted the asserts to reporting failures, just like in `verify_idom_and_nodes`. > > **Disabled Verifications** > I commented out the following verifications: > > (A) data nodes should have same ctrl > (B) ctrl node should belong to same loop > (C) ctrl node should have same idom > (D) loop should have same tail > (E) loop should have same body (list of nodes) > (F) broken verification in PhaseIdealLoop::build_loop_late_post, because ctrl was set wrong > > > Note: verifying `idom`, `ctrl` and `_body` is the central goal of `VerifyLoopOptimizations`. But all of them are broken in many parts of the VM, as we have now not verified them for many years. > > **Follow-Up Work** > > I filed a first follow-up RFE [JDK-8305073](https://bugs.openjdk.org/browse/JDK-8305073). The following tasks should be addressed in it, or in subsequent follow-up RFE's. > > I propose the following order: > > - idom (C): The dominance structure is at the base of everything else. > - ctrl / loop (A, B): Once dominance is fixed, we can ensure every node is assigned to the correct ctrl/loop. > - tail (D): ensure the tail of a loop is updated correctly > - body (E): nodes are assigned to the `_body` of a loop, according to the node ctrl. > - other issues like (F) > - Add more verification to IdealLoopTree::verify_tree. For example zero-trip-guard, etc. > - Evaluate from where else we should call `PhaseIdealLoop::verify`. Maybe we are missing some cases. > > **Testing** > I am running `tier1-tier6` and stress testing. > Preliminary results are all good. > > **Conclusion** > With this fix, I have the basic infrastructure of the verification working. > However, all of the substantial verification are now still disabled, because there are too many places in the VM that do not maintain the data-structures properly. > Follow-up RFE's will have to address these one-by-one. This pull request has now been integrated. Changeset: 24c6af06 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/24c6af0637631153707615932f1f10ced4e5c0e8 Stats: 285 lines in 4 files changed: 153 ins; 39 del; 93 mod 8173709: Fix VerifyLoopOptimizations - step 1 - minimal infrastructure Reviewed-by: kvn, chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13207 From thartmann at openjdk.org Tue Apr 4 07:27:08 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 4 Apr 2023 07:27:08 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v2] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 19:38:00 GMT, Xin Liu wrote: >> src/hotspot/share/opto/cfgnode.cpp line 576: >> >>> 574: >>> 575: if(use->req() != req() && use->is_Phi()) { >>> 576: assert(use->in(0) == this, ""); >> >> Suggestion: >> >> assert(use->in(0) == this, "unexpected control input"); > > updated. > > Could you help me submit this to your CI validation? TBH, I don't know why the original author starts over the loop. In my understanding, it doesn't delete any phi node but just put them into worklist. I did, all tests passed. I'm also not sure why that loop was there. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13238#discussion_r1156835291 From duke at openjdk.org Tue Apr 4 07:30:15 2023 From: duke at openjdk.org (SUN Guoyun) Date: Tue, 4 Apr 2023 07:30:15 GMT Subject: RFR: 8305523: Some StoreStore barriers in the interpreter are unnecessary after JDK-8205683 Message-ID: After JDK-8205683, InterpreterRuntime::_new() and InterpreterRuntime::newarray() eventually calls Atomic::release_store() to prevent reordering of stores for object initialization with stores that publish the new object. The stack call is as follows Atomic::release_store((Klass**)raw_mem, k) oopDesc::release_set_klass(HeapWord* mem, Klass* k) oop MemAllocator::finish(HeapWord* mem) ArrayKlass::cast(klass)->allocate_arrayArray(1, length, THREAD); oopFactory::new_objArray(klass, size, CHECK); InterpreterRuntime::anewarray(JavaThread* current, ConstantPool* pool, int index, jint size) So StoreStore barriers trailing behind bytecode _new, newarray, anewarray is unnecessary. aarch64, riscv and ppc have the same problem. ------------- Commit messages: - 8305523: Some StoreStore barriers in the interpreter are unnecessary after JDK-8205683 Changes: https://git.openjdk.org/jdk/pull/13320/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13320&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305523 Stats: 24 lines in 3 files changed: 0 ins; 24 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13320.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13320/head:pull/13320 PR: https://git.openjdk.org/jdk/pull/13320 From mdoerr at openjdk.org Tue Apr 4 10:21:12 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Apr 2023 10:21:12 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v3] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Tue, 4 Apr 2023 03:42:54 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with three additional commits since the last revision: > > - updates assert condition > - Revert "build fix for s390x" > > This reverts commit bb0ae5d8340b8fd99c4c8d7ab5623739e4d2fa7a. > - Revert "use constant instead of enum" > > This reverts commit 820e288422e86e14323872dbe21550915e28c7e9. This makes sense because `_reserved_argument_area_size` gets initialized with `hir()->max_stack()` which doesn't include the frame header. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12825#pullrequestreview-1370654552 From eosterlund at openjdk.org Tue Apr 4 12:48:16 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 4 Apr 2023 12:48:16 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API Message-ID: The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. ------------- Commit messages: - 8305351: C2 setScopedValueCache intrinsic doesn't use access API Changes: https://git.openjdk.org/jdk/pull/13324/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13324&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305351 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13324.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13324/head:pull/13324 PR: https://git.openjdk.org/jdk/pull/13324 From eosterlund at openjdk.org Tue Apr 4 13:12:18 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 4 Apr 2023 13:12:18 GMT Subject: RFR: 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers Message-ID: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> The arraycopy stubs on AArch64 now allows the GC to vectorize arraycopy barriers. That's great! But the gct3 registers we hand to the GC is v8 today, which is callee saved (well at least the lower 64 bits). Therefore, if the GC clobbers this temp registers, it can have unexpected side effects on the caller float/double registers. We should use a caller saved register instead. This is only used by generational ZGC, so isn't a mainline bug yet. We should fix it before it becomes one. ------------- Commit messages: - 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers Changes: https://git.openjdk.org/jdk/pull/13325/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13325&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305543 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13325.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13325/head:pull/13325 PR: https://git.openjdk.org/jdk/pull/13325 From qamai at openjdk.org Tue Apr 4 13:24:18 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Apr 2023 13:24:18 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v5] In-Reply-To: References: Message-ID: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: add identity, fix flags ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12909/files - new: https://git.openjdk.org/jdk/pull/12909/files/bedb73bd..e68e215d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=03-04 Stats: 42 lines in 4 files changed: 19 ins; 0 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/12909.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12909/head:pull/12909 PR: https://git.openjdk.org/jdk/pull/12909 From qamai at openjdk.org Tue Apr 4 13:46:12 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Apr 2023 13:46:12 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6] In-Reply-To: References: Message-ID: > `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: style ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12909/files - new: https://git.openjdk.org/jdk/pull/12909/files/e68e215d..a17942f5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12909&range=04-05 Stats: 13 lines in 1 file changed: 4 ins; 2 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12909.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12909/head:pull/12909 PR: https://git.openjdk.org/jdk/pull/12909 From amitkumar at openjdk.org Tue Apr 4 14:31:25 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 4 Apr 2023 14:31:25 GMT Subject: Integrated: 8303147: [s390x] fast & slow debug builds are broken In-Reply-To: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: <4ZsyVU-OUufxY-t_gQBFP-agK6cYZwRdYCwz0tOmeJo=.75669b4a-fc32-4e58-a095-fb2a4adb9c12@github.com> On Thu, 2 Mar 2023 10:06:17 GMT, Amit Kumar wrote: > This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. This pull request has now been integrated. Changeset: 62bd2eba Author: Amit Kumar Committer: Martin Doerr URL: https://git.openjdk.org/jdk/commit/62bd2ebac4dd11ceecafd7f988485fe2aaea1a5e Stats: 8 lines in 2 files changed: 2 ins; 1 del; 5 mod 8303147: [s390x] fast & slow debug builds are broken Reviewed-by: mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/12825 From mdoerr at openjdk.org Tue Apr 4 14:31:23 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Apr 2023 14:31:23 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v3] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Tue, 4 Apr 2023 03:42:54 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with three additional commits since the last revision: > > - updates assert condition > - Revert "build fix for s390x" > > This reverts commit bb0ae5d8340b8fd99c4c8d7ab5623739e4d2fa7a. > - Revert "use constant instead of enum" > > This reverts commit 820e288422e86e14323872dbe21550915e28c7e9. This change has high priority because it fixes the build. So, let's get it integrated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12825#issuecomment-1496070919 From mdoerr at openjdk.org Tue Apr 4 14:31:23 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Apr 2023 14:31:23 GMT Subject: RFR: 8303147: [s390x] fast & slow debug builds are broken [v3] In-Reply-To: References: <_ZdU0STu6HC-2e2UuIXJPe7RslDoKVDPLYXSGJFAwkA=.7639982c-26d7-4a93-8e6f-2d0171b71477@github.com> Message-ID: On Tue, 4 Apr 2023 03:42:54 GMT, Amit Kumar wrote: >> This PR fixes broken fast debug and slow debug build for s390x-arch. tier1 test are completed and results are not affect after this patch. > > Amit Kumar has updated the pull request incrementally with three additional commits since the last revision: > > - updates assert condition > - Revert "build fix for s390x" > > This reverts commit bb0ae5d8340b8fd99c4c8d7ab5623739e4d2fa7a. > - Revert "use constant instead of enum" > > This reverts commit 820e288422e86e14323872dbe21550915e28c7e9. This change has high priority because it fixes the build. So, let's get it integrated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12825#issuecomment-1496070919 From qamai at openjdk.org Tue Apr 4 14:57:18 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 4 Apr 2023 14:57:18 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v4] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 16:36:08 GMT, Paul Sandoz wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: >> >> - instruction asserts >> - Merge branch 'master' into sliceIntrinsics >> - add comments explaining anonymous classes >> - address reviews >> - sse2, increase warmup >> - aesthetic >> - optimise 64B >> - add jmh >> - vector slice intrinsics > > With the latest PR I am observing failures with debug builds for test compiler/vectorapi/TestVectorSlice.java on both AVX512 machines and aarch64 machines. > > On AVX512 machines the test fails with JVM args `-XX:UseAVX=3` and `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting` and results in a test assertion failure e.g., > > Caused by: java.lang.RuntimeException: assertEquals: expected 70 to equal 0 > at jdk.test.lib.Asserts.fail(Asserts.java:594) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) > at compiler.vectorapi.TestVectorSlice.lambda$testInts$2(TestVectorSlice.java:163) > at compiler.vectorapi.TestVectorSlice.testInts(TestVectorSlice.java:181) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) > ... 7 more > > > CPU flags are: > > fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant tsc arch perfmon rep good nopl xtopology cpuid tsc known freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4 1 sse4 2 x2apic movbe popcnt tsc deadline timer aes xsave avx f16c rdrand hypervisor lahf lm abm 3dnowprefetch cpuid fault invpcid single ssbd ibrs ibpb stibp ibrs enhanced tpr shadow vnmi flexpriority ept vpid ept ad fsgsbase tsc adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves nt good wbnoinvd arat avx512vbmi umip pku ospke avx512 vbmi2 gfni vaes vpclmulqdq avx512 vnni avx512 bitalg avx512 vpopcntdq la57 rdpid md clear arch capabilities > > > On aarch64 there is an IR rule failure. @PaulSandoz I have fixed the error in AVX512 and added feature predicates to not do IR check on AArch64 @XiaohongGong Thanks for your reviews, I have addressed them ------------- PR Comment: https://git.openjdk.org/jdk/pull/12909#issuecomment-1496115432 From kvn at openjdk.org Tue Apr 4 16:25:08 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 4 Apr 2023 16:25:08 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: <3xuoHe0z8GMT2hefuTXRvHWsG09A8JlsNlcnKkiQ04I=.7f19788b-a7cf-455e-ac8f-95de6840f3a5@github.com> On Tue, 4 Apr 2023 12:40:14 GMT, Erik ?sterlund wrote: > The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13324#pullrequestreview-1371351980 From rcastanedalo at openjdk.org Tue Apr 4 18:30:07 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 4 Apr 2023 18:30:07 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 12:40:14 GMT, Erik ?sterlund wrote: > The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. Looks good. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13324#pullrequestreview-1371552209 From psandoz at openjdk.org Tue Apr 4 23:57:06 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Tue, 4 Apr 2023 23:57:06 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6] In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > style Tier 1-3 tests now pass. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12909#issuecomment-1496737872 From duke at openjdk.org Wed Apr 5 04:54:05 2023 From: duke at openjdk.org (Joshua Cao) Date: Wed, 5 Apr 2023 04:54:05 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool Message-ID: 1. Create an independent jar file with CtwRunner as the main class to make it easier to run 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. ### Motivation CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. ------------- Commit messages: - 8300829: Make CtwRunner available as an independent tool Changes: https://git.openjdk.org/jdk/pull/13344/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13344&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8300829 Stats: 109 lines in 3 files changed: 68 ins; 35 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13344.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13344/head:pull/13344 PR: https://git.openjdk.org/jdk/pull/13344 From epeter at openjdk.org Wed Apr 5 04:55:18 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Apr 2023 04:55:18 GMT Subject: Integrated: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies In-Reply-To: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Fri, 17 Mar 2023 14:34:26 GMT, Emanuel Peter wrote: > I discovered this bug during the bug fix of [JDK-8298935](https://bugs.openjdk.org/browse/JDK-8298935) [PR](https://git.openjdk.org/jdk/pull/12350). > > Currently, the SuperWord algorithm only ensures that all `packs` are `isomorphic` and `independent` (additionally memops are `adjacent`). > > This is **not sufficient**. We need to ensure that the `packs` do not introduce `cycles` into the graph. Example: > > https://github.com/openjdk/jdk/blob/ad580d18dbbf074c8a3692e2836839505b574326/test/hotspot/jtreg/compiler/loopopts/superword/TestIndependentPacksWithCyclicDependency.java#L217-L231 > > This is also mentioned in the [SuperWord Paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (2000, Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets): > > > 3.7 Scheduling > Dependence analysis before packing ensures that statements within a group can be executed > safely in parallel. However, it may be the case that executing two groups produces a dependence > violation. An example of this is shown in Figure 6. Here, dependence edges are drawn between > groups if a statement in one group is dependent on a statement in the other. As long as there > are no cycles in this dependence graph, all groups can be scheduled such that no violations > occur. However, a cycle indicates that the set of chosen groups is invalid and at least one group > will need to be eliminated. Although experimental data has shown this case to be extremely rare, > care must be taken to ensure correctness. > > > **Solution** > > Just before scheduling, I introduced `SuperWord::remove_cycles`. It creates a `PacksetGraph`, based on nodes in the `packs`, and scalar-nodes which are not in a pack. The edges are taken from `DepPreds`. We check if the graph can be scheduled without cycles (via topological sort). > > **FYI** > > I found a further bug, this time I think it happens during scheduling. See [JDK-8304720](https://bugs.openjdk.org/browse/JDK-8304720). Because of that, I had to disable a test case (`TestIndependentPacksWithCyclicDependency::test5`). I also had to require 64 bit, and either `avx2` or `asimd`. I hope we can lift that again once we fix the other bug. The issue is this: the cyclic dependency example can degenerate to non-cyclic ones, that need to reorder the non-vectorized memory operations. This pull request has now been integrated. Changeset: 83a924a1 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/83a924a1008853dee2ead8f6c3a82f9e3abc6125 Stats: 862 lines in 6 files changed: 849 ins; 1 del; 12 mod 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13078 From epeter at openjdk.org Wed Apr 5 04:55:16 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Apr 2023 04:55:16 GMT Subject: RFR: 8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies [v2] In-Reply-To: References: <9VgAQeNZfUZJXO8llozcZZuRftv6kk43jw0YIrBIdck=.b5c89436-608b-4ed1-816d-b3514374eaeb@github.com> Message-ID: On Thu, 30 Mar 2023 16:29:10 GMT, Vladimir Kozlov wrote: >>> Do you know if this affect any our existing vector tests? >> >> @vladimir Thanks for the review. >> Yes. I had a run where I assert if I find cycles. I ran it up to tier5 and stress testing. And the assert was never triggered, except in the two regression tests that I added (there it triggered a lot). So I think it really has no effect, except the extra runtime. > >> > Do you know if this affect any our existing vector tests? >> >> @vladimir Thanks for the review. Yes. I had a run where I assert if I find cycles. I ran it up to tier5 and stress testing. And the assert was never triggered, except in the two regression tests that I added (there it triggered a lot). So I think it really has no effect, except the extra runtime. > > Perfect! Thank you for doing it. Thanks @vnkozlov @fg1417 @TobiHartmann for the reviews and suggestions! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13078#issuecomment-1496913196 From duke at openjdk.org Wed Apr 5 05:08:10 2023 From: duke at openjdk.org (Jasmine Karthikeyan) Date: Wed, 5 Apr 2023 05:08:10 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE Message-ID: Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instructio ns that are generated, as that can help in cases where applications are I-Cache bound. Baseline Patch Improvement Benchmark Mode Cnt Score Error Units Score Error Units Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) Covn2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) Covn2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! Testing: tier1-2 on linux x64, GHA ------------- Commit messages: - Fix whitespace and add bug tag to IR test - Merge branch 'master' into conv2b-x86-lowering - Merge branch 'master' into conv2b-x86-lowering - Merge branch 'master' into conv2b-x86-lowering - Merge branch 'master' into conv2b-x86-lowering - Add flipped versions of Conv2B rules to decrease generated code complexity Changes: https://git.openjdk.org/jdk/pull/13345/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8051725 Stats: 265 lines in 7 files changed: 251 ins; 11 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13345.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13345/head:pull/13345 PR: https://git.openjdk.org/jdk/pull/13345 From xliu at openjdk.org Wed Apr 5 05:08:12 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 5 Apr 2023 05:08:12 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 04:46:21 GMT, Joshua Cao wrote: > 1. Create an independent jar file with CtwRunner as the main class to make it easier to run > 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. > 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. > > ### Motivation > CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run > > > JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base > > > To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. test/hotspot/jtreg/testlibrary/ctw/Makefile line 94: > 92: @rm -rf $(OUTPUT_DIR) > 93: > 94: $(DST_DIR)/ctwrunner.jar: filelist $(DST_DIR)/wb.jar Theoretically, we can avoid from generating ctwrunner.jar. The classes are the same as 'ctw.jar' and the only different part is the entry point. to launch CTWRunner, you can do thing like this in ctwruner.sh echo '$${JAVA_HOME}/bin/java $${JAVA_OPTIONS} -Dtest.jdk=$${JAVA_HOME} -cp ctw.jar $CTWRUNNER_MAIN_CLASS $$@' > $@ it's up to you. if we generate ctwrunner.jar, we will have a "standalone" jar. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13344#discussion_r1158027103 From xliu at openjdk.org Wed Apr 5 05:14:05 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 5 Apr 2023 05:14:05 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 04:46:21 GMT, Joshua Cao wrote: > 1. Create an independent jar file with CtwRunner as the main class to make it easier to run > 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. > 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. > > ### Motivation > CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run > > > JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base > > > To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/CtwRunner.java line 60: > 58: * comma-separated arguments to pass to CTW subprocesses. > 59: */ > 60: public static final String CTW_EXTRA_ARGS how about you use getProperty("sun.hotspot.tools.ctwrunner.ctw_extra_args", ""). by giving it an empty string as default value, you can take iout if (null != CTW_EXTRA_ARGS) below. btw, you may also need to update the year in the copyrights header. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13344#discussion_r1158030134 From rcastanedalo at openjdk.org Wed Apr 5 07:19:06 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Apr 2023 07:19:06 GMT Subject: RFR: 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers In-Reply-To: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> References: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> Message-ID: On Tue, 4 Apr 2023 13:02:05 GMT, Erik ?sterlund wrote: > The arraycopy stubs on AArch64 now allows the GC to vectorize arraycopy barriers. That's great! But the gct3 registers we hand to the GC is v8 today, which is callee saved (well at least the lower 64 bits). Therefore, if the GC clobbers this temp registers, it can have unexpected side effects on the caller float/double registers. We should use a caller saved register instead. > This is only used by generational ZGC, so isn't a mainline bug yet. We should fix it before it becomes one. Looks good! We should improve our test coverage in this area. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13325#pullrequestreview-1372290277 From aph at openjdk.org Wed Apr 5 08:46:14 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 5 Apr 2023 08:46:14 GMT Subject: RFR: 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers In-Reply-To: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> References: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> Message-ID: On Tue, 4 Apr 2023 13:02:05 GMT, Erik ?sterlund wrote: > The arraycopy stubs on AArch64 now allows the GC to vectorize arraycopy barriers. That's great! But the gct3 registers we hand to the GC is v8 today, which is callee saved (well at least the lower 64 bits). Therefore, if the GC clobbers this temp registers, it can have unexpected side effects on the caller float/double registers. We should use a caller saved register instead. > This is only used by generational ZGC, so isn't a mainline bug yet. We should fix it before it becomes one. Great catch, thanks. Shouldn't this say "callee-saved" in the title? ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13325#pullrequestreview-1372439604 From aph at openjdk.org Wed Apr 5 08:49:15 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 5 Apr 2023 08:49:15 GMT Subject: RFR: 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers In-Reply-To: References: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> Message-ID: On Wed, 5 Apr 2023 07:16:43 GMT, Roberto Casta?eda Lozano wrote: > Looks good! We should improve our test coverage in this area. In debug mode we could (fairly easily?) check for corruption. We could also clobber all of the call-clobbered registers. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13325#issuecomment-1497135976 From aph at openjdk.org Wed Apr 5 09:05:15 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 5 Apr 2023 09:05:15 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 12:40:14 GMT, Erik ?sterlund wrote: > The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13324#pullrequestreview-1372471590 From rcastanedalo at openjdk.org Wed Apr 5 09:16:19 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Apr 2023 09:16:19 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: <0LOm4bHAMpi3f86vxVovim3MbGPgV8YERm7LezcYa0g=.26cb83cc-2d5d-434a-a449-d42b50d97d2c@github.com> On Sun, 2 Apr 2023 05:52:17 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/superword.cpp line 504: >> >>> 502: // to the phi node following edge index 'input'. >>> 503: PathEnd path = >>> 504: find_in_path( >> >> Hi @robcasloz, >> find_in_path expects reduction nodes to be present at same edge indices in the reduction chain, it also honors has_swapped_edge flag during backward traversal. >> However, there are still some ideal transforms like following which may break the reduction chain and this will prevent Min/Max reductions for test case mentioned in JDK-8302673. >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1147 >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1230 > > One way to add fault-tolerance to find_in_path could be to follow strict DFS semantics where an alternate path is taken if node's predicates are not satisfied, currently we are starting all over again from the first node of chain with a different reduction_input which prevents inferring reduction chain even though all the nodes in the chain are commutative isomorphic operations. Thanks for the observation and the suggestion! This changeset proposes swapped edge tracking for efficiency and better worst-case behavior of the analysis, but as you observe this is done at the expense of robustness. I will investigate whether the swapped edge tracking approach can be extended to deal with the `MinI`/`MaxI` transformations that you mention. If not, I will re-consider using a generic search approach like you and @eme64 suggest. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1158247329 From rcastanedalo at openjdk.org Wed Apr 5 09:16:25 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Apr 2023 09:16:25 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Sun, 2 Apr 2023 05:14:37 GMT, Jatin Bhateja wrote: >> Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 28 additional commits since the last revision: >> >> - Merge master >> - Relax the reduction cycle search bound >> - Remove redundant IR check precondition >> - Use SuperWord members in reduction marking >> - Remove redundant opcode checks >> - Do not run test in x86-32 >> - Update existing test instead of removing it >> - Add negative vectorization test >> - Update copyright headers >> - Add two more reduction vectorization microbenchmarks >> - ... and 18 more: https://git.openjdk.org/jdk/compare/dd0f65e9...95f6cc33 > > src/hotspot/share/opto/superword.cpp line 539: > >> 537: pred = current; >> 538: current = original_input(current, reduction_input); >> 539: } > > If we bookkeep the nodes in the reduction chain path during initial backward traversal we may simplify this checking and also another call to _original_input_ while populating _loop_reductions set on [#L547 ](https://github.com/openjdk/jdk/pull/13120/files#diff-8f29dd005a0f949d108687dabb7379c73dfd85cd782da453509dc9b6cb8c9f81R547) Thanks, will consider this together with your earlier comments. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1158248611 From eosterlund at openjdk.org Wed Apr 5 09:27:15 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 5 Apr 2023 09:27:15 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 09:01:56 GMT, Andrew Haley wrote: >> The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. > > Marked as reviewed by aph (Reviewer). Thanks for the reviews @theRealAph, @vnkozlov and @robcasloz! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13324#issuecomment-1497184043 From rcastanedalo at openjdk.org Wed Apr 5 09:31:13 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Apr 2023 09:31:13 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 15:21:45 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 28 additional commits since the last revision: > > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - Add negative vectorization test > - Update copyright headers > - Add two more reduction vectorization microbenchmarks > - ... and 18 more: https://git.openjdk.org/jdk/compare/7e320e32...95f6cc33 Moving back to draft mode for further investigation and conflict resolution. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1497188979 From eosterlund at openjdk.org Wed Apr 5 09:32:20 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 5 Apr 2023 09:32:20 GMT Subject: RFR: 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers In-Reply-To: References: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> Message-ID: On Wed, 5 Apr 2023 08:46:43 GMT, Andrew Haley wrote: >> Looks good! We should improve our test coverage in this area. > >> Looks good! We should improve our test coverage in this area. > > In debug mode we could (fairly easily?) check for corruption. We could also clobber all of the call-clobbered registers. Thanks for the reviews @theRealAph and @robcasloz! > > Looks good! We should improve our test coverage in this area. > > > > In debug mode we could (fairly easily?) check for corruption. We could also clobber all of the call-clobbered registers. Yes @xmas92 had a similar idea to check for clobbering, but I think we should do that for all these stubs in the stub generator, rather than make it an arraycopy thing, IMO. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13325#issuecomment-1497188269 PR Comment: https://git.openjdk.org/jdk/pull/13325#issuecomment-1497190687 From tholenstein at openjdk.org Wed Apr 5 09:39:24 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 5 Apr 2023 09:39:24 GMT Subject: RFR: JDK-8305356: Fix ignored bad CompileCommands in tests [v2] In-Reply-To: References: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> Message-ID: On Mon, 3 Apr 2023 14:01:52 GMT, Tobias Hartmann wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> copyright year > > Looks good to me. FTR, [JDK-8282797](https://bugs.openjdk.org/browse/JDK-8282797) will then enforce this. Thanks @TobiHartmann and @chhagedorn for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13297#issuecomment-1497196527 From tholenstein at openjdk.org Wed Apr 5 09:39:24 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 5 Apr 2023 09:39:24 GMT Subject: Integrated: JDK-8305356: Fix ignored bad CompileCommands in tests In-Reply-To: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> References: <_jdVLVxHge11PN9t-j4CHRZPEWmJcP4yeaQiuKXDJYM=.b61be8f6-507f-448a-b79f-290b7f7b01fd@github.com> Message-ID: On Mon, 3 Apr 2023 12:31:41 GMT, Tobias Holenstein wrote: > The following tests have a wrong `CompileCommand` and print an error message in `CompilerOracle::print_parse_error`: > > * missing `::*`: > - `test/hotspot/jtreg/compiler/loopopts/TestPeelingRemoveDominatedTest.java` > > * used unsupported quotation marks `""`: > - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegMultiply.java` > - `test/hotspot/jtreg/compiler/integerArithmetic/TestNegAnd.java` > > * Use of the pattern `CompileCommand=option,Klass::method,type,option,value` but `DisableIntrinsic` has type option type `ccstrlist` and not `ccstr`: > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMulAdd.java` > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestMultiplyToLen.java` > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestShift.java` > - `test/hotspot/jtreg/compiler/intrinsics/bigInteger/TestSquareToLen.java` This pull request has now been integrated. Changeset: 0e0c022b Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/0e0c022b1f870806963789afdef9298851719498 Stats: 25 lines in 7 files changed: 2 ins; 0 del; 23 mod 8305356: Fix ignored bad CompileCommands in tests Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/13297 From tholenstein at openjdk.org Wed Apr 5 09:41:20 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 5 Apr 2023 09:41:20 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view Message-ID: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> When switching the layouting mode to or from CFG the node text is not updated. - Switching to `CFG view` gave the wrong node text: fail_to - with this fix it looks like this fix_to - Switching from `CFG view` gave the wrong node text: fail_from - with this fix it looks like this fix-from ------------- Commit messages: - JDK-8305644: IGV: Node text not updated when switching from/to CFG view Changes: https://git.openjdk.org/jdk/pull/13348/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13348&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305644 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13348.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13348/head:pull/13348 PR: https://git.openjdk.org/jdk/pull/13348 From epeter at openjdk.org Wed Apr 5 11:03:31 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 5 Apr 2023 11:03:31 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL Message-ID: **Context** During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. **Problem** The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. **Solution** `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). **Discussion** This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. **Caveat** I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. I hope that this fix here at least reduces the frequency of failures significantly. **Future Work** We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. ------------- Commit messages: - Renamed regression test, removed old test2 - remove spurious empty line - Use the MaxL/MinL consistently in PhaseIdealLoop::adjust_limit - do_unroll with MaxL/MinL - revert SubINoUnderflow idea - typo: flipped compare - added assert for test mask - 8303466: C2: failed: malformed control flow. Introducing SubINoUnderflowNode Changes: https://git.openjdk.org/jdk/pull/13269/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303466 Stats: 269 lines in 5 files changed: 178 ins; 69 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From rcastanedalo at openjdk.org Wed Apr 5 11:23:06 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Apr 2023 11:23:06 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view In-Reply-To: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> Message-ID: On Wed, 5 Apr 2023 09:33:21 GMT, Tobias Holenstein wrote: > When switching the layouting mode to or from CFG the node text is not updated. > > - Switching to `CFG view` gave the wrong node text: > fail_to > - with this fix it looks like this > fix_to > > - Switching from `CFG view` gave the wrong node text: > fail_from > - with this fix it looks like this > fix-from Looks good, thanks for fixing this! I must have missed testing this scenario when merging the changes from JDK-8302644 into JDK-8302738. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13348#pullrequestreview-1372694070 From chagedorn at openjdk.org Wed Apr 5 11:37:03 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 5 Apr 2023 11:37:03 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view In-Reply-To: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> Message-ID: On Wed, 5 Apr 2023 09:33:21 GMT, Tobias Holenstein wrote: > When switching the layouting mode to or from CFG the node text is not updated. > > - Switching to `CFG view` gave the wrong node text: > fail_to > - with this fix it looks like this > fix_to > > - Switching from `CFG view` gave the wrong node text: > fail_from > - with this fix it looks like this > fix-from Given that this line was dropped by mistake, the fix to add the line again looks good and trivial. Nevertheless, setting a field in a getter method like that seems wrong/unexpected. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13348#pullrequestreview-1372712215 From tholenstein at openjdk.org Wed Apr 5 11:53:05 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 5 Apr 2023 11:53:05 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view [v2] In-Reply-To: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> Message-ID: <0geWftbRkANPFckOXHNBcfS6GZDdWBj0Oa0Xv5ZA0fU=.d240400c-7e4c-4152-9e06-28a0848a3789@github.com> > When switching the layouting mode to or from CFG the node text is not updated. > > - Switching to `CFG view` gave the wrong node text: > fail_to > - with this fix it looks like this > fix_to > > - Switching from `CFG view` gave the wrong node text: > fail_from > - with this fix it looks like this > fix-from Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: fix the bug properly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13348/files - new: https://git.openjdk.org/jdk/pull/13348/files/06756d71..8b69ac9b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13348&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13348&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13348.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13348/head:pull/13348 PR: https://git.openjdk.org/jdk/pull/13348 From tholenstein at openjdk.org Wed Apr 5 11:53:06 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 5 Apr 2023 11:53:06 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view [v2] In-Reply-To: References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> Message-ID: On Wed, 5 Apr 2023 11:33:53 GMT, Christian Hagedorn wrote: > Given that this line was dropped by mistake, the fix to add the line again looks good and trivial. Nevertheless, setting a field in a getter method like that seems wrong/unexpected. You are right. Even though re-adding the dropped line fixes the issue, it is still bad practice. I fixed the issue now in the proper way: `showCFG` is set in `setShowCFG(boolean enable)` - there is the right place to also set `diagram.setCFG(enable);` and NOT in the `getDiagram()` since getters should be side effect free. Thanks for the input @chhagedorn ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13348#issuecomment-1497352443 From chagedorn at openjdk.org Wed Apr 5 12:00:10 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 5 Apr 2023 12:00:10 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view [v2] In-Reply-To: <0geWftbRkANPFckOXHNBcfS6GZDdWBj0Oa0Xv5ZA0fU=.d240400c-7e4c-4152-9e06-28a0848a3789@github.com> References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> <0geWftbRkANPFckOXHNBcfS6GZDdWBj0Oa0Xv5ZA0fU=.d240400c-7e4c-4152-9e06-28a0848a3789@github.com> Message-ID: On Wed, 5 Apr 2023 11:53:05 GMT, Tobias Holenstein wrote: >> When switching the layouting mode to or from CFG the node text is not updated. >> >> - Switching to `CFG view` gave the wrong node text: >> fail_to >> - with this fix it looks like this >> fix_to >> >> - Switching from `CFG view` gave the wrong node text: >> fail_from >> - with this fix it looks like this >> fix-from > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > fix the bug properly That looks much better! Thanks for investigating further how to properly fix this :-) ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13348#pullrequestreview-1372753641 From rcastanedalo at openjdk.org Wed Apr 5 12:00:12 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 5 Apr 2023 12:00:12 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view [v2] In-Reply-To: <0geWftbRkANPFckOXHNBcfS6GZDdWBj0Oa0Xv5ZA0fU=.d240400c-7e4c-4152-9e06-28a0848a3789@github.com> References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> <0geWftbRkANPFckOXHNBcfS6GZDdWBj0Oa0Xv5ZA0fU=.d240400c-7e4c-4152-9e06-28a0848a3789@github.com> Message-ID: On Wed, 5 Apr 2023 11:53:05 GMT, Tobias Holenstein wrote: >> When switching the layouting mode to or from CFG the node text is not updated. >> >> - Switching to `CFG view` gave the wrong node text: >> fail_to >> - with this fix it looks like this >> fix_to >> >> - Switching from `CFG view` gave the wrong node text: >> fail_from >> - with this fix it looks like this >> fix-from > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > fix the bug properly Marked as reviewed by rcastanedalo (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13348#pullrequestreview-1372754084 From tholenstein at openjdk.org Wed Apr 5 12:13:26 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 5 Apr 2023 12:13:26 GMT Subject: RFR: JDK-8305644: IGV: Node text not updated when switching from/to CFG view [v2] In-Reply-To: References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> <0geWftbRkANPFckOXHNBcfS6GZDdWBj0Oa0Xv5ZA0fU=.d240400c-7e4c-4152-9e06-28a0848a3789@github.com> Message-ID: On Wed, 5 Apr 2023 11:56:37 GMT, Christian Hagedorn wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> fix the bug properly > > That looks much better! Thanks for investigating further how to properly fix this :-) Thanks @chhagedorn and @robcasloz for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13348#issuecomment-1497381511 From tholenstein at openjdk.org Wed Apr 5 12:13:26 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 5 Apr 2023 12:13:26 GMT Subject: Integrated: JDK-8305644: IGV: Node text not updated when switching from/to CFG view In-Reply-To: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> References: <-dti5UPgMIKgdtePCqm0OtVlE957l_DGyd2mOrgDX6M=.23cfb32d-1f13-4b54-bdc1-c7a39a27c84a@github.com> Message-ID: On Wed, 5 Apr 2023 09:33:21 GMT, Tobias Holenstein wrote: > When switching the layouting mode to or from CFG the node text is not updated. > > - Switching to `CFG view` gave the wrong node text: > fail_to > - with this fix it looks like this > fix_to > > - Switching from `CFG view` gave the wrong node text: > fail_from > - with this fix it looks like this > fix-from This pull request has now been integrated. Changeset: 9f587d27 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/9f587d272fe7097b330d8d81b7ae43149ff92485 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8305644: IGV: Node text not updated when switching from/to CFG view Reviewed-by: rcastanedalo, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/13348 From cslucas at openjdk.org Wed Apr 5 15:52:29 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 15:52:29 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v6] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12897/files - new: https://git.openjdk.org/jdk/pull/12897/files/5ef86371..3752b21a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=04-05 Stats: 346 lines in 3 files changed: 113 ins; 106 del; 127 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From cslucas at openjdk.org Wed Apr 5 15:52:33 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 15:52:33 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: On Fri, 24 Mar 2023 19:02:57 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Add support for SR'ing some inputs of merges used for field loads > > src/hotspot/share/code/debugInfo.hpp line 199: > >> 197: // ObjectValue describing an object that was scalar replaced. >> 198: >> 199: class ObjectMergeValue: public ScopeValue { > > Why you did not make subclass of ObjectValue? You would need to check `sv->is_object_merge()` first before `sv->is_object()` in few places. But on other hand you don't need to duplicates ObjectValue`s fields and asserts. Hi @vnkozlov, just FYI. I made the changes that you suggested. Please let me know what you think. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1158700435 From cslucas at openjdk.org Wed Apr 5 15:52:34 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 15:52:34 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> <7xRwVRVapKbqiVQMDMZUh3ILhfaYub_brXWVopFhJ8M=.28289c04-0ff0-4f19-b764-03af4d3155d6@github.com> Message-ID: On Sat, 25 Mar 2023 00:07:20 GMT, Vladimir Kozlov wrote: >> I had considered that but decided not to do it to prevent adding a new IR node. I'll give that a shot and update this thread with how it goes. > > It **will** complicate your DebugInfo code (packing/unpacking) information. But I think it is right thing to do to avoid duplicated re-allocations during deoptimization - you should have only one new object. Hi @vnkozlov, just FYI. I made the changes that you suggested. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1158699338 From cslucas at openjdk.org Wed Apr 5 15:52:38 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 15:52:38 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Fri, 31 Mar 2023 18:24:45 GMT, Xin Liu wrote: > It looks like we can use (safepoints == nullptr) instead? Yeap. Thanks. I don't know how I missed that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1157909570 From cslucas at openjdk.org Wed Apr 5 15:52:39 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 15:52:39 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Wed, 5 Apr 2023 00:59:29 GMT, Cesar Soares Lucas wrote: >> Do you really need the boolean parameter ignore_merges here? >> It looks like we can use (safepoints == nullptr) instead? > >> It looks like we can use (safepoints == nullptr) instead? > > Yeap. Thanks. I don't know how I missed that. > With ignore_merges, why we also skip EncodeP or MemBarRelease here? The EncodeP shouldn't prevent the reduction of Phi because I check how the Phi is used. The MemBarRelease node shouldn't prevent the reduction because once the Allocate input to the Phi is set to SR the MemBarRelease node will be removed as part of Ideal transformations after EA. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1157910405 From cslucas at openjdk.org Wed Apr 5 15:52:37 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 15:52:37 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Sat, 1 Apr 2023 00:44:55 GMT, Xin Liu wrote: > Do you consider to perform the transformation in MacroExpand? Your prior changes have already removed NSR marks, ME/SR will consider 'ptn'. Yes, I actually did. However, that makes the changes much more complicated. I patched this method to reuse the scalar replacement method in MacroExpand so that we don't have code duplication. I hope that's sufficient as a first implementation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1157914272 From cslucas at openjdk.org Wed Apr 5 15:52:40 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 15:52:40 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> <0UbMqMHtVIayPdJMmfDF6YTadWe4YTlSW6mZc5P3IU8=.c4b1a292-e434-4c57-a5cd-015edca2ec95@github.com> Message-ID: On Fri, 31 Mar 2023 18:38:43 GMT, Xin Liu wrote: >> I see, you use it in escape.cpp. Okay. I need to review changes there too. > > or you could construct a temporary PhaseMacroExpand object in EA. > > I see that you convert many member function to static so you can query in EA. the only blocker is _igvn. That seems a good idea. Together with some other refactoring I decided to revert making the methods static and instead use them through an instance of ME. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1158698606 From cslucas at openjdk.org Wed Apr 5 16:31:20 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 5 Apr 2023 16:31:20 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v7] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Merge with Master - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. - Add support for SR'ing some inputs of merges used for field loads - Fix some typos and do some small refactorings. - Merge master - Add support for rematerializing scalar replaced objects participating in allocation merges ------------- Changes: https://git.openjdk.org/jdk/pull/12897/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=06 Stats: 2193 lines in 22 files changed: 1939 ins; 107 del; 147 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From duke at openjdk.org Wed Apr 5 17:01:10 2023 From: duke at openjdk.org (Joshua Cao) Date: Wed, 5 Apr 2023 17:01:10 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v2] In-Reply-To: References: Message-ID: > 1. Create an independent jar file with CtwRunner as the main class to make it easier to run > 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. > 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. > > ### Motivation > CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run > > > JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base > > > To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. Joshua Cao has updated the pull request incrementally with two additional commits since the last revision: - Upgrade CTWRunner.java copyright header - Remove ctwrunner.jar and default value for CTW extra args ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13344/files - new: https://git.openjdk.org/jdk/pull/13344/files/32037378..35979be8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13344&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13344&range=00-01 Stats: 14 lines in 2 files changed: 0 ins; 8 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13344.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13344/head:pull/13344 PR: https://git.openjdk.org/jdk/pull/13344 From duke at openjdk.org Wed Apr 5 17:01:14 2023 From: duke at openjdk.org (Joshua Cao) Date: Wed, 5 Apr 2023 17:01:14 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v2] In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 05:05:33 GMT, Xin Liu wrote: >> Joshua Cao has updated the pull request incrementally with two additional commits since the last revision: >> >> - Upgrade CTWRunner.java copyright header >> - Remove ctwrunner.jar and default value for CTW extra args > > test/hotspot/jtreg/testlibrary/ctw/Makefile line 94: > >> 92: @rm -rf $(OUTPUT_DIR) >> 93: >> 94: $(DST_DIR)/ctwrunner.jar: filelist $(DST_DIR)/wb.jar > > Theoretically, we can avoid from generating ctwrunner.jar. The classes are the same as 'ctw.jar' and the only different part is the entry point. > > to launch CTWRunner, you can do thing like this in ctwruner.sh > > echo '$${JAVA_HOME}/bin/java $${JAVA_OPTIONS} -Dtest.jdk=$${JAVA_HOME} -cp ctw.jar $CTWRUNNER_MAIN_CLASS $$@' > $@ > > > it's up to you. if we generate ctwrunner.jar, we will have a "standalone" jar. I removed ctwrunner.jar in newest commit. I also realized I was javac'ing twice. We avoid this entirely by removing the extra jar. > test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/CtwRunner.java line 60: > >> 58: * comma-separated arguments to pass to CTW subprocesses. >> 59: */ >> 60: public static final String CTW_EXTRA_ARGS > > how about you use getProperty("sun.hotspot.tools.ctwrunner.ctw_extra_args", ""). > by giving it an empty string as default value, you can take iout if (null != CTW_EXTRA_ARGS) below. > > btw, you may also need to update the year in the copyrights header. Yup. Made these changes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13344#discussion_r1158781138 PR Review Comment: https://git.openjdk.org/jdk/pull/13344#discussion_r1158782366 From claes.redestad at oracle.com Wed Apr 5 21:20:40 2023 From: claes.redestad at oracle.com (Claes Redestad) Date: Wed, 5 Apr 2023 21:20:40 +0000 Subject: Bimodal compilation In-Reply-To: <1621877955.30061527.1680728754364.JavaMail.zimbra@univ-eiffel.fr> References: <795367426.30059295.1680728164210.JavaMail.zimbra@univ-eiffel.fr> <1621877955.30061527.1680728754364.JavaMail.zimbra@univ-eiffel.fr> Message-ID: <11A15FDE-D99B-4779-A06E-F2E8231D5649@oracle.com> Hi, I?d assume this is an issue with the hotspot JIT rather than with how javac happens to compile this particular program. Can you please provide the full program (or is @ZeroDefault some new OpenJDK construct everyone needs to be aware of?) /Claes > 5 apr. 2023 kl. 23:05 skrev Remi Forax : > > oops, please do not take into account this message ! > > I've not read the LogCompilation file correctly, there is a bimodal issue but due to "sumPopulationOf(Populated[] populateds)" being inlinined or not, not due to how the switch is compiled. > > regards, > R?mi > > ----- Original Message ----- >> From: "Remi Forax" >> To: "amber-dev" >> Cc: "jan lahoda" >> Sent: Wednesday, April 5, 2023 10:56:04 PM >> Subject: Bimodal compilation > >> Hi all, >> for Devoxx France, me and Jos? Paumard are giving a talk about Valhalla and >> Amber with several benchmarks mixing the two. >> >> One problem we have is that the way pattern matching is compiled by javac and >> later JIT compiled leads to bimodal performance. >> Depending on the day (exactly, depending on the JIT threads scheduling), either >> the method containing the switch is compiled as a whole method (good day) or >> only the method handle (corresponding to the invokedynamic) is compiled and >> when the whole method is compiled, the assembly code corresponding to the >> method handle is considered as too big thus not inlined (bad day). >> >> The example is using arrays of non null instance of value classes with a default >> instance (the kind of value classes that are flattened in memory) and a sealed >> interface, if the method is not fully inlined (good day) performance have very >> good and if not performance are terrible (bad day). >> >> A cascade of instanceof while slightly less fast does not exhibit that issue. >> >> In our example, this bimodal performance issue is hidden when using identity >> classes, because of the cache misses become the bottleneck. >> >> I'm a little worry here because I do not see how to fix this bug without >> changing the way switch on types are compiled by javac, so this problem has to >> be tackled before the JDK 21 is released otherwise, people will have to >> recompile their application to fix that bug. >> >> regards, >> R?mi >> >> --- >> >> public @ZeroDefault @Value record Population(int amount) { >> public static Population zero() { >> return new Population(0); >> } >> public Population add(Population other) { >> return new Population(this.amount + other.amount); >> } >> } >> public sealed interface Populated permits City, Department, Region {} >> public @ZeroDefault @Value record City(String name, @NonNull Population >> population) implements Populated {} >> public @ZeroDefault @Value record Department(String name, City[] cities) >> implements Populated { >> public Department { >> cities = Arrays.stream(cities).toArray(size -> RT.newNonNullArray(City.class, >> size)); >> } >> } >> public @ZeroDefault @Value record Region(String name, Department[] departments) >> implements Populated { >> public Region { >> departments = Arrays.stream(departments).toArray(size -> >> RT.newNonNullArray(Department.class, size)); >> } >> } >> >> public static Population sumPopulationOf(Populated populated) { >> return switch (populated) { >> case City(var name, var population) -> population; >> case Department(var name, var cities) -> { >> var sum = Population.zero(); >> for(var city: cities) { >> sum = sum.add(sumPopulationOf(city)); >> } >> yield sum; >> } >> case Region(var name, var departments) -> { >> var sum = Population.zero(); >> for(var department: departments) { >> sum = sum.add(sumPopulationOf(department)); >> } >> yield sum; >> } >> }; >> } >> >> public static Population sumPopulationOf(Populated[] populateds) { >> var sum = Population.zero(); >> for(var populated: populateds) { >> sum = sum.add(sumPopulationOf(populated)); >> } >> return sum; >> } >> >> --- >> >> public class BenchDOP { >> private Region[] regions; >> >> @Setup >> public void init() { >> var data = Data.readCities(); >> regions = data.regions().toArray(size -> RT.newNonNullArray(Region.class, >> size)); >> } >> >> @Benchmark >> public Population sumPopulations() { >> return Data.sumPopulationOf(regions); >> } >> } >> >> >> --- >> >> >> >> # Benchmark: >> org.paumard.amber.model.cityvaluenonnullablearraydrecords.BenchDOP.sumPopulations >> >> # Run progress: 0.00% complete, ETA 00:02:06 >> # Fork: 1 of 3 >> # Warmup Iteration 1: 47.845 us/op >> # Warmup Iteration 2: 38.199 us/op >> # Warmup Iteration 3: 38.130 us/op >> # Warmup Iteration 4: 37.909 us/op >> # Warmup Iteration 5: 38.345 us/op >> Iteration 1: 38.581 us/op >> Iteration 2: 37.946 us/op >> Iteration 3: 37.837 us/op >> Iteration 4: 38.013 us/op >> Iteration 5: 37.885 us/op >> Iteration 6: 37.853 us/op >> Iteration 7: 37.931 us/op >> Iteration 8: 37.874 us/op >> Iteration 9: 37.828 us/op >> Iteration 10: 37.925 us/op <--- good day >> >> # Run progress: 8.33% complete, ETA 00:02:00 >> # Fork: 2 of 3 >> # Warmup Iteration 1: 2871.011 us/op >> # Warmup Iteration 2: 2761.856 us/op >> # Warmup Iteration 3: 2759.977 us/op >> # Warmup Iteration 4: 2761.045 us/op >> # Warmup Iteration 5: 2756.167 us/op >> Iteration 1: 2755.180 us/op >> Iteration 2: 2781.178 us/op >> Iteration 3: 2759.068 us/op >> Iteration 4: 2755.737 us/op >> Iteration 5: 2755.112 us/op >> Iteration 6: 2754.553 us/op >> Iteration 7: 2761.759 us/op >> Iteration 8: 2750.829 us/op >> Iteration 9: 2751.265 us/op >> Iteration 10: 2749.668 us/op <--- bad day >> >> # Run progress: 16.67% complete, ETA 00:01:48 >> # Fork: 3 of 3 >> # Warmup Iteration 1: 42.359 us/op >> # Warmup Iteration 2: 38.322 us/op >> # Warmup Iteration 3: 38.311 us/op >> # Warmup Iteration 4: 37.990 us/op >> # Warmup Iteration 5: 37.988 us/op >> Iteration 1: 38.139 us/op >> Iteration 2: 38.052 us/op >> Iteration 3: 37.959 us/op >> Iteration 4: 38.037 us/op >> Iteration 5: 37.997 us/op >> Iteration 6: 37.957 us/op >> Iteration 7: 37.977 us/op >> Iteration 8: 37.905 us/op >> Iteration 9: 37.913 us/op >> Iteration 10: 37.976 us/op <--- good day From xgong at openjdk.org Thu Apr 6 01:51:17 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 6 Apr 2023 01:51:17 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6] In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > style test/hotspot/jtreg/compiler/vectorapi/TestVectorSlice.java line 466: > 464: @IR(counts = {IRNode.VECTOR_SLICE, "17"}) > 465: static void testB128(byte[][] dst, byte[] src1, byte[] src2) { > 466: var species = ByteVector.SPECIES_128; Suggest to define the species as a "`private static final`" field of this test class. It may make the intrinsification fail if the species is not a constant to the compiler. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1159206009 From xgong at openjdk.org Thu Apr 6 01:56:16 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 6 Apr 2023 01:56:16 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v5] In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 13:24:18 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add identity, fix flags test/hotspot/jtreg/compiler/vectorapi/TestVectorSlice.java line 327: > 325: > 326: @Test > 327: @IR(counts = {IRNode.VECTOR_SLICE, "7"}, applyIfCPUFeature = {"sse2", "true"}) How about separating the special cases (i.e. origin is `0/VLENGTH`), and using the `FailOn` check instead on them? Tests are more accurate. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1159208577 From xliu at openjdk.org Thu Apr 6 02:48:14 2023 From: xliu at openjdk.org (Xin Liu) Date: Thu, 6 Apr 2023 02:48:14 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v2] In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 17:01:10 GMT, Joshua Cao wrote: >> 1. Create an independent jar file with CtwRunner as the main class to make it easier to run >> 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. >> 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. >> >> ### Motivation >> CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run >> >> >> JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base >> >> >> To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. > > Joshua Cao has updated the pull request incrementally with two additional commits since the last revision: > > - Upgrade CTWRunner.java copyright header > - Remove ctwrunner.jar and default value for CTW extra args LGTM. I am not a reviewer. need other reviewers to approve this. ------------- Marked as reviewed by xliu (Committer). PR Review: https://git.openjdk.org/jdk/pull/13344#pullrequestreview-1373977066 From yzhu at openjdk.org Thu Apr 6 04:08:16 2023 From: yzhu at openjdk.org (Yanhong Zhu) Date: Thu, 6 Apr 2023 04:08:16 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 09:16:06 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [ ] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Fix comment src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1731: > 1729: if (bt == T_FLOAT || bt == T_DOUBLE) { > 1730: switch (cond) { > 1731: case BoolTest::eq: vmfeq_vv(vd, src1, src2, vm); break; `BoolTest::ge` and `BoolTest::gt` are implemented with `BoolTest::le` and `BoolTest::lt` by exchanging the operands, when one of the operands is NAN, will the results of comparisons be wrong? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159260641 From kvn at openjdk.org Thu Apr 6 04:46:25 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Apr 2023 04:46:25 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v7] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: <4uPGi8Ulap_QoQpkL1zTZUdP-jdL_WDEkpdP7asLow4=.9047ce21-688f-4d29-a643-f9acfd4344c7@github.com> On Wed, 5 Apr 2023 16:31:20 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Merge with Master > - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. > - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. > - Add support for SR'ing some inputs of merges used for field loads > - Fix some typos and do some small refactorings. > - Merge master > - Add support for rematerializing scalar replaced objects participating in allocation merges Thank you for adding new node - it is more clear now. src/hotspot/share/opto/callnode.hpp line 540: > 538: > 539: bool is_only_merge_sr_candidate() { return _only_merge_sr_candidate; } > 540: void set_only_merge_sr_candidate(bool only) { _only_merge_sr_candidate = only; } May be drop `_sr` from names. `SafePointScalarObjectNode` already represents scalarized object. src/hotspot/share/opto/escape.cpp line 633: > 631: > 632: SafePointScalarMergeNode* smerge = new SafePointScalarMergeNode(merge_t, merge_idx); > 633: smerge->init_req(0, _compile->root()); May be use ophi's control here, it should stay bellow merge point. Was there a reason you use `root`? src/hotspot/share/opto/escape.cpp line 640: > 638: > 639: // Add the selector so we know which direction the execution took > 640: sfpt->add_req(selector); May be added comment that we adding debug info for merge point here (2 values described in the comment for `_merge_pointer_idx`). src/hotspot/share/opto/escape.cpp line 655: > 653: SafePointScalarObjectNode* sobj = mexp.create_scalarized_object_description(alloc, sfpt); > 654: if (sobj == nullptr) { > 655: fatal("Failed to create SafePointScalarObjectNode!"); This is brutal! May be exit this compilation and recompile without `ReduceAllocationMerges`. src/hotspot/share/opto/escape.cpp line 658: > 656: } > 657: > 658: jvms->set_endoff(sfpt->req()); add comment explaining this line src/hotspot/share/opto/escape.cpp line 677: > 675: > 676: // Replaces debug information references to "ophi" in "sfpt" with references to "smerge" > 677: int debug_end = jvms->debug_end(); May be add comment that debug info changed (and `debug_end`) due to added scalarized objects info. src/hotspot/share/opto/escape.cpp line 679: > 677: int debug_end = jvms->debug_end(); > 678: sfpt->replace_edges_in_range(ophi, smerge, debug_start, debug_end, _igvn); > 679: sfpt->set_req(smerge->merge_pointer_idx(jvms), ophi); So you trying to restore `ophi` in debug info which was added at line 637 but then in previous line may be replaced with `smerge`. May add comment explaining that. src/hotspot/share/opto/output.cpp line 755: > 753: ciKlass* cik = t->is_oopptr()->exact_klass(); > 754: assert(cik->is_instance_klass() || > 755: cik->is_array_klass(), "Not supported allocation."); Why spacing changed? src/hotspot/share/opto/output.cpp line 789: > 787: > 788: for (uint i = 1; i < smerge->req(); i++) { > 789: Node* fld_node = smerge->in(i); It is not `fld_node` but `obj_node`. ------------- PR Review: https://git.openjdk.org/jdk/pull/12897#pullrequestreview-1374000788 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159249159 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159245961 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159246463 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159255417 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159253457 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159256643 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159270793 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159272308 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1159271887 From gli at openjdk.org Thu Apr 6 05:44:20 2023 From: gli at openjdk.org (Guoxiong Li) Date: Thu, 6 Apr 2023 05:44:20 GMT Subject: RFR: 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix Message-ID: Hi all, This patch prevents `Assembler::prefix` from emitting two `REX prefixes`. The current code in mainline works well because the corresponding code path is not triggered. Thanks for the review. Best Regards, -- Guoxiong ------------- Commit messages: - 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix Changes: https://git.openjdk.org/jdk/pull/13369/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13369&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305690 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13369.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13369/head:pull/13369 PR: https://git.openjdk.org/jdk/pull/13369 From dzhang at openjdk.org Thu Apr 6 08:53:17 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 6 Apr 2023 08:53:17 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 04:04:50 GMT, Yanhong Zhu wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix comment > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1731: > >> 1729: if (bt == T_FLOAT || bt == T_DOUBLE) { >> 1730: switch (cond) { >> 1731: case BoolTest::eq: vmfeq_vv(vd, src1, src2, vm); break; > > `BoolTest::ge` and `BoolTest::gt` are implemented with `BoolTest::le` and `BoolTest::lt` by exchanging the operands, when one of the operands is NAN, will the results of comparisons be wrong? Thanks for the review! I think there should be no problem here. The foating-point compare instructions follow the semantics of the scalar floating-point compare instructions[1] in RVV. For all three instructions (FEQ.S, FLT.S, FLE.S), the result is 0 if either operand is NaN[2]. So when one of the operands is NaN, `BoolTest::ge`, `BoolTest::gt`, `BoolTest::le` and `BoolTest::lt` will all generate a 0 on the corresponding bit. Also a jtreg test case[3] proves that our current logic is fine. `GTFloat512VectorTests` covers the case where the input is Nan. The test will pass properly and generate the following compilation log which contains `vmaskcmp_rvv`: 1ac B20: # out( B49 B21 ) <- in( B48 B19 ) Freq: 4188.06 1ac vmaskcmp_rvv V0, V4, V5, #3 1b8 1b8 MEMBAR-store-store #@membar_storestore 1bc # checkcastPP of R11, #@checkCastPP 1bc vstoremask V1, V0 1c8 addi R7, R11, #16 # ptr, #@addP_reg_imm 1cc spill R11 -> [sp, #104] # spill size = 64 1ce storeV [R7], V1 # vector (rvv) 1d6 ld R19, [R23, #264] # ptr, #@loadP 1da ld R7, [R23, #280] # ptr, #@loadP 1de addi R28, R19, #16 # ptr, #@addP_reg_imm 1e2 bgeu R28, R7, B49 #@cmpP_branch P=0.000100 C=-1.000000 [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#1313-vector-floating-point-compare-instructions [2] https://github.com/riscv/riscv-isa-manual/releases/download/draft-20230131-c0b298a/riscv-spec.pdf [3] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector/Float512VectorTests.java ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159483619 From duke at openjdk.org Thu Apr 6 09:22:15 2023 From: duke at openjdk.org (Chang Peng) Date: Thu, 6 Apr 2023 09:22:15 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 02:06:41 GMT, Chang Peng wrote: > We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. > > The following instruction sequence > > > movi v17.16b, #12 > cmpgt p0.b, p7/z, z16.b, z17.b > > > can be optimized to: > > > cmpgt p0.b, p7/z, z16.b, #12 > > > This patch does the following: > 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. > SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to > 127)or 5bit signed integer immediate (range from -16 to 15). > > 2. Add optimized match rules to generate the compare-with-immediate instructions. > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- @theRealAph Hello. Could you please help to review this patch ? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13200#issuecomment-1498748889 From pli at openjdk.org Thu Apr 6 10:04:23 2023 From: pli at openjdk.org (Pengfei Li) Date: Thu, 6 Apr 2023 10:04:23 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: On Thu, 30 Mar 2023 09:16:06 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [ ] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Fix comment Hi, I'm not familiar with RISC-V but have some comments based on my experience on AArch64. src/hotspot/cpu/riscv/riscv.ad line 1936: > 1934: case Op_XorVMask: > 1935: case Op_OrVMask: > 1936: return true; I'm not quite familiar with RISC-V but my perception is that, for any `opcode, vlen, bt` combination, if the non-masked support check `Matcher::match_rule_supported_vector()` returns false, this masked version should return false as well. So I suggest calling `Matcher::match_rule_supported_vector()` instead of directly returning true here. You may refer the code in `aarch64_vector.ad`. src/hotspot/cpu/riscv/riscv_v.ad line 348: > 346: // vector add - predicated > 347: > 348: instruct vadd_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ Should this be named "vaddB_masked"? src/hotspot/cpu/riscv/riscv_v.ad line 501: > 499: // vector float div - predicated > 500: > 501: instruct vdivF_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ Perhaps adding a predicate of UseRVV, for all these matching rules? src/hotspot/cpu/riscv/riscv_v.ad line 2422: > 2420: %} > 2421: > 2422: instruct vmask_gen_I(vRegMask dst, iRegI src) %{ Just a reminder that your following new rules will enable array operations partial inlining. Have you tested this feature on RISC-V? ------------- PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1374459844 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159546863 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159554836 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159560228 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159568904 From roland at openjdk.org Thu Apr 6 11:56:23 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 6 Apr 2023 11:56:23 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:44:17 GMT, Emanuel Peter wrote: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. That looks reasonable to me. Is the PhaseIdealLoop::adjust_limit() change required or is it some cleanup? Have you run performance testing to be safe? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1498943284 From epeter at openjdk.org Thu Apr 6 12:02:23 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 6 Apr 2023 12:02:23 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: <8eIXpBQ2mNJySoC4kx_XEpMfanQUJC9VUjp7QKBxfNk=.7dd0a03c-7a1d-414d-a6e9-cbdd537abe7a@github.com> On Thu, 6 Apr 2023 11:53:18 GMT, Roland Westrelin wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > That looks reasonable to me. > Is the PhaseIdealLoop::adjust_limit() change required or is it some cleanup? > Have you run performance testing to be safe? @rwestrel > Is the PhaseIdealLoop::adjust_limit() change required or is it some cleanup? At first I only fixed `PhaseIdealLoop::do_unroll`. That fixed my regression test examles. But, once that fix was introduced, another test failed, and that was because now the type of the `CMove` in `PhaseIdealLoop::adjust_limit` was not precise enough. So it seemed the best solution to fix them together, since they both have issues with `CMove`, and change the `limit`. > Have you run performance testing to be safe? I'll do that, will report back. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1498949924 From roland at openjdk.org Thu Apr 6 12:08:06 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 6 Apr 2023 12:08:06 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:44:17 GMT, Emanuel Peter wrote: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13269#pullrequestreview-1374722921 From rrich at openjdk.org Thu Apr 6 13:41:14 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 6 Apr 2023 13:41:14 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 Message-ID: This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. Before: * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` * jit_abi is independent of `abi_minframe` * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) After changes: * prefixed structs that depend on `ABI_ELFv2` with `native_` * introduced `java_abi` which is independent of `ABI_ELFv2` * `frame::metadata_words` is the size in words of `java_abi` This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` Testing: PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. PPC64be Linux: hotspot tier1 tests ------------- Commit messages: - Update comments - Rename abi_reg_args_spill -> native_abi_reg_args_spill - Use correct abi definitions - Rename native abi size enum elements - Introduce common_abi - Derive parent_ijava_frame_abi from java_abi - java_abi - Native abi structs Changes: https://git.openjdk.org/jdk/pull/13372/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13372&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305668 Stats: 146 lines in 21 files changed: 13 ins; 13 del; 120 mod Patch: https://git.openjdk.org/jdk/pull/13372.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13372/head:pull/13372 PR: https://git.openjdk.org/jdk/pull/13372 From rrich at openjdk.org Thu Apr 6 13:41:15 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 6 Apr 2023 13:41:15 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 13:22:49 GMT, Richard Reingruber wrote: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Hi Tyler, @backwaterred, hope this will help with continuations on AIX... :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1499078364 From phh at openjdk.org Thu Apr 6 14:08:15 2023 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 6 Apr 2023 14:08:15 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v2] In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 16:55:59 GMT, Joshua Cao wrote: >> test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/CtwRunner.java line 60: >> >>> 58: * comma-separated arguments to pass to CTW subprocesses. >>> 59: */ >>> 60: public static final String CTW_EXTRA_ARGS >> >> how about you use getProperty("sun.hotspot.tools.ctwrunner.ctw_extra_args", ""). >> by giving it an empty string as default value, you can take iout if (null != CTW_EXTRA_ARGS) below. >> >> btw, you may also need to update the year in the copyrights header. > > Yup. Made these changes. Better to make CTW_EXTRA_ARGS private. If it's needed outside this class, add a public accessor. Don't think it's needed outside though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13344#discussion_r1159848680 From dzhang at openjdk.org Thu Apr 6 14:14:23 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 6 Apr 2023 14:14:23 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 09:41:03 GMT, Pengfei Li wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix comment > > src/hotspot/cpu/riscv/riscv.ad line 1936: > >> 1934: case Op_XorVMask: >> 1935: case Op_OrVMask: >> 1936: return true; > > I'm not quite familiar with RISC-V but my perception is that, for any `opcode, vlen, bt` combination, if the non-masked support check `Matcher::match_rule_supported_vector()` returns false, this masked version should return false as well. So I suggest calling `Matcher::match_rule_supported_vector()` instead of directly returning true here. You may refer the code in `aarch64_vector.ad`. Thanks for the review! Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 348: > >> 346: // vector add - predicated >> 347: >> 348: instruct vadd_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ > > Should this be named "vaddB_masked"? Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 501: > >> 499: // vector float div - predicated >> 500: >> 501: instruct vdivF_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ > > Perhaps adding a predicate of UseRVV, for all these matching rules? There are two cases of vector nodes in aarch64, `neon` and `sve`, but there doesn't seem to be two cases in riscv. `match_rule_supported_vector -> op_vec_supported` affects whether to generate a vector c2 node, where the default return value is `UseRVV` except for false. So I think even if the vector c2 node in riscv is a composite node (e.g. `vdivF_masked`), maybe there is no need to add the predicate `UseRVV` to each matching rule. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159853171 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159853422 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1159854187 From dzhang at openjdk.org Thu Apr 6 14:38:00 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 6 Apr 2023 14:38:00 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v13] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Fix typo and use match_rule_supported_vector instead of true ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/8083ede3..447ea6c9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=11-12 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From stuefe at openjdk.org Thu Apr 6 16:37:11 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Apr 2023 16:37:11 GMT Subject: RFR: JDK-8305711: Arm: C2 always enters slowpath for monitorexit Message-ID: A small bug in the C2 implementation of monitorexit for thin locks causes us to always enter the slow path. This seems to be a day zero bug of the arm port, since JEP 297: "Unified arm32/arm64 Port". It has a significant effect on locking performance, but its effect had been hidden until JDK 15 by biased locking. Biased locking removal made the bug appearant. With this patch, @rkennke's artificial microbenchmark that does nothing but uncontended locking improves greatly (see https://github.com/rkennke/fastlockbench): Benchmark (backoff) Mode Cnt Score Error Units FastLockingBenchmark.testSync 0 avgt 2 110.600 ns/op FastLockingBenchmark.testSync 1 avgt 2 105.725 ns/op FastLockingBenchmark.testSync 2 avgt 2 122.780 ns/op FastLockingBenchmark.testSync 4 avgt 2 125.133 ns/op FastLockingBenchmark.testSync 8 avgt 2 151.915 ns/op FastLockingBenchmark.testSync 16 avgt 2 206.458 ns/op FastLockingBenchmark.testSync 32 avgt 2 313.980 ns/op FastLockingBenchmark.testSync 64 avgt 2 522.206 ns/op New: Benchmark (backoff) Mode Cnt Score Error Units FastLockingBenchmark.testSync 0 avgt 2 60.102 ns/op FastLockingBenchmark.testSync 1 avgt 2 61.667 ns/op FastLockingBenchmark.testSync 2 avgt 2 74.950 ns/op FastLockingBenchmark.testSync 4 avgt 2 85.480 ns/op FastLockingBenchmark.testSync 8 avgt 2 115.019 ns/op FastLockingBenchmark.testSync 16 avgt 2 178.046 ns/op FastLockingBenchmark.testSync 32 avgt 2 273.376 ns/op FastLockingBenchmark.testSync 64 avgt 2 500.287 ns/op Please note that Arm remains broken since JDK-8301995; I based and tested this patch on the parent of that change. ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/13376/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13376&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305711 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13376.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13376/head:pull/13376 PR: https://git.openjdk.org/jdk/pull/13376 From dzhang at openjdk.org Thu Apr 6 16:45:16 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 6 Apr 2023 16:45:16 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 09:56:49 GMT, Pengfei Li wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix comment > > src/hotspot/cpu/riscv/riscv_v.ad line 2422: > >> 2420: %} >> 2421: >> 2422: instruct vmask_gen_I(vRegMask dst, iRegI src) %{ > > Just a reminder that your following new rules will enable array operations partial inlining. Have you tested this feature on RISC-V? Thanks for the reminder. From the tests so far, `ByteMaxVectorTests`[1] can pass and generate `VectorMaskGen` node as follows: 0b2e B70: # out( B103 B71 ) <- in( B68 ) Freq: 9.13565 0b2e spill R7 -> [sp, #320] # spill size = 64 0b30 vmask_gen_L V0, R7 0b38 vstoremask V6, V0 0b44 ld R30, [R23, #264] # ptr, #@loadP 0b48 ld R7, [R23, #280] # ptr, #@loadP 0b4c addi R28, R30, #80 # ptr, #@addP_reg_imm 0b50 bgeu R28, R7, B103 #@cmpP_branch P=0.000100 C=-1.000000 We will evaluate this feature next. [1] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector/ByteMaxVectorTests.java ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1160033082 From shade at openjdk.org Thu Apr 6 16:59:12 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 6 Apr 2023 16:59:12 GMT Subject: RFR: JDK-8305711: Arm: C2 always enters slowpath for monitorexit In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 16:29:57 GMT, Thomas Stuefe wrote: > A small bug in the C2 implementation of monitorexit for thin locks causes us to always enter the slow path. > > This seems to be a day zero bug of the arm port, since JEP 297: "Unified arm32/arm64 Port". It has a significant effect on locking performance, but its effect had been hidden until JDK 15 by biased locking. Biased locking removal made the bug appearant. > > With this patch, @rkennke's artificial microbenchmark that does nothing but uncontended locking improves greatly (see https://github.com/rkennke/fastlockbench): > > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 110.600 ns/op > FastLockingBenchmark.testSync 1 avgt 2 105.725 ns/op > FastLockingBenchmark.testSync 2 avgt 2 122.780 ns/op > FastLockingBenchmark.testSync 4 avgt 2 125.133 ns/op > FastLockingBenchmark.testSync 8 avgt 2 151.915 ns/op > FastLockingBenchmark.testSync 16 avgt 2 206.458 ns/op > FastLockingBenchmark.testSync 32 avgt 2 313.980 ns/op > FastLockingBenchmark.testSync 64 avgt 2 522.206 ns/op > > > New: > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 60.102 ns/op > FastLockingBenchmark.testSync 1 avgt 2 61.667 ns/op > FastLockingBenchmark.testSync 2 avgt 2 74.950 ns/op > FastLockingBenchmark.testSync 4 avgt 2 85.480 ns/op > FastLockingBenchmark.testSync 8 avgt 2 115.019 ns/op > FastLockingBenchmark.testSync 16 avgt 2 178.046 ns/op > FastLockingBenchmark.testSync 32 avgt 2 273.376 ns/op > FastLockingBenchmark.testSync 64 avgt 2 500.287 ns/op > > > Please note that Arm remains broken since JDK-8301995; I based and tested this patch on the parent of that change. Ouch! Good thing this does not blow up correctness-wise? The object header would almost never (famous last words) look like a displaced header when locked. `InterpreterMacroAssembler::unlock_object` does it correctly, and this code now matches the interpreter. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13376#pullrequestreview-1375254094 From shade at openjdk.org Thu Apr 6 17:03:13 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 6 Apr 2023 17:03:13 GMT Subject: RFR: JDK-8305711: Arm: C2 always enters slowpath for monitorexit In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 16:29:57 GMT, Thomas Stuefe wrote: > A small bug in the C2 implementation of monitorexit for thin locks causes us to always enter the slow path. > > This seems to be a day zero bug of the arm port, since JEP 297: "Unified arm32/arm64 Port". It has a significant effect on locking performance, but its effect had been hidden until JDK 15 by biased locking. Biased locking removal made the bug appearant. > > With this patch, @rkennke's artificial microbenchmark that does nothing but uncontended locking improves greatly (see https://github.com/rkennke/fastlockbench): > > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 110.600 ns/op > FastLockingBenchmark.testSync 1 avgt 2 105.725 ns/op > FastLockingBenchmark.testSync 2 avgt 2 122.780 ns/op > FastLockingBenchmark.testSync 4 avgt 2 125.133 ns/op > FastLockingBenchmark.testSync 8 avgt 2 151.915 ns/op > FastLockingBenchmark.testSync 16 avgt 2 206.458 ns/op > FastLockingBenchmark.testSync 32 avgt 2 313.980 ns/op > FastLockingBenchmark.testSync 64 avgt 2 522.206 ns/op > > > New: > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 60.102 ns/op > FastLockingBenchmark.testSync 1 avgt 2 61.667 ns/op > FastLockingBenchmark.testSync 2 avgt 2 74.950 ns/op > FastLockingBenchmark.testSync 4 avgt 2 85.480 ns/op > FastLockingBenchmark.testSync 8 avgt 2 115.019 ns/op > FastLockingBenchmark.testSync 16 avgt 2 178.046 ns/op > FastLockingBenchmark.testSync 32 avgt 2 273.376 ns/op > FastLockingBenchmark.testSync 64 avgt 2 500.287 ns/op > > > Please note that Arm remains broken since JDK-8301995; I based and tested this patch on the parent of that change. Usual synopsis for ARM32 bugs is "ARM32: C2 always...", I think :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/13376#issuecomment-1499361397 From duke at openjdk.org Thu Apr 6 17:27:19 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 6 Apr 2023 17:27:19 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v3] In-Reply-To: References: Message-ID: > 1. ~~Create an independent jar file with CtwRunner as the main class to make it easier to run~~. Not needed anymore thanks to @navyxliu 's comments. We can use the original `ctw.jar` because it already has ctwrunner class. > 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. > 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. > > ### Motivation > CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run > > > JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base > > > To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Make CTW_EXTRA_ARGS private ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13344/files - new: https://git.openjdk.org/jdk/pull/13344/files/35979be8..76f8434c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13344&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13344&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13344.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13344/head:pull/13344 PR: https://git.openjdk.org/jdk/pull/13344 From duke at openjdk.org Thu Apr 6 17:27:21 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 6 Apr 2023 17:27:21 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v3] In-Reply-To: References: Message-ID: <8kuAsPk-741UUTV2vbwDuIvxzBXIZX7N1NC5nyk0KJg=.ed2f37b2-1f72-4bb4-9f4d-57c778ae441d@github.com> On Thu, 6 Apr 2023 14:05:10 GMT, Paul Hohensee wrote: >> Yup. Made these changes. > > Better to make CTW_EXTRA_ARGS private. If it's needed outside this class, add a public accessor. Don't think it's needed outside though. Made it private. Not needed outside. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13344#discussion_r1160067458 From duke at openjdk.org Thu Apr 6 17:55:24 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 6 Apr 2023 17:55:24 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v4] In-Reply-To: References: Message-ID: > 1. ~~Create an independent jar file with CtwRunner as the main class to make it easier to run~~. Not needed anymore thanks to @navyxliu 's comments. We can use the original `ctw.jar` because it already has ctwrunner class. > 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. > 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. > > ### Motivation > CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run > > > JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base > > > To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Don't create unnecessary output dir ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13344/files - new: https://git.openjdk.org/jdk/pull/13344/files/76f8434c..bed0c81c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13344&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13344&range=02-03 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13344.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13344/head:pull/13344 PR: https://git.openjdk.org/jdk/pull/13344 From phh at openjdk.org Thu Apr 6 18:12:06 2023 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 6 Apr 2023 18:12:06 GMT Subject: RFR: 8300829: Make CtwRunner available as an independent tool [v4] In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 17:55:24 GMT, Joshua Cao wrote: >> 1. ~~Create an independent jar file with CtwRunner as the main class to make it easier to run~~. Not needed anymore thanks to @navyxliu 's comments. We can use the original `ctw.jar` because it already has ctwrunner class. >> 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. >> 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. >> >> ### Motivation >> CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run >> >> >> JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base >> >> >> To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Don't create unnecessary output dir Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13344#pullrequestreview-1375352014 From duke at openjdk.org Thu Apr 6 18:41:25 2023 From: duke at openjdk.org (duke) Date: Thu, 6 Apr 2023 18:41:25 GMT Subject: [jdk20] Withdrawn: 8298176: remove OpaqueZeroTripGuardPostLoop once main-loop disappears In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 07:08:59 GMT, Emanuel Peter wrote: > **Working on new fix... Will update this later** > > We recently removed `Opaque2` nodes in [JDK-8294540](https://bugs.openjdk.org/browse/JDK-8294540). `Opaque2` nodes prevented some optimizations during loop-opts. The original idea was to prevent the use of both the un-incremented and incremented value of a loop phi after the loop, to reduce register pressure. But `Opaque2` also had the effect that the limit of the loop would not be optimized, which meant that the iv-value (entry value of phi) in post loop would never collapse (either to constant or TOP), but always remain a range. Now that `Opaque2` is gone, it can happen that when the main-loop disappears, the limit collapses. The zero-trip guard of the post-loop would be false, but does not collapse because of the `OpaqueZeroTripGuard`. The post-loop can half-collapse, leaving an inconsistent graph below the zero-trip guard if. > > **Solution** > Have `OpaqueZeroTripGuardMainLoop` for main loop zero-trip guard, and `OpaqueZeroTripGuardPostLoop` for post-loop zero trip guard. Let `OpaqueZeroTripGuardPostLoop` remove itself once it cannot find the main-loop above it. We have these opaque nodes there to prevent collapsing of the zero-trip guards as long as the limits may still change, but after the main-loop is removed, no unrolling is done anymore, so the limit of the post-loop cannot change anymore, hence it is safe to remove the opaque node there. > > An alternative approach was to let the main-loop remove the opaque node of the post-loop's zero-trip guard. But that does not work reliably, as the main-loop may get removed during PhaseCCP, and the main-loop is simply removed as "useless". Hence the LoopNode of the main-loop does not have a chance to detect its death during IGVN. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk20/pull/22 From kvn at openjdk.org Thu Apr 6 19:06:08 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Apr 2023 19:06:08 GMT Subject: RFR: JDK-8305711: Arm: C2 always enters slowpath for monitorexit In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 16:29:57 GMT, Thomas Stuefe wrote: > A small bug in the C2 implementation of monitorexit for thin locks causes us to always enter the slow path. > > This seems to be a day zero bug of the arm port, since JEP 297: "Unified arm32/arm64 Port". It has a significant effect on locking performance, but its effect had been hidden until JDK 15 by biased locking. Biased locking removal made the bug appearant. > > With this patch, @rkennke's artificial microbenchmark that does nothing but uncontended locking improves greatly (see https://github.com/rkennke/fastlockbench): > > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 110.600 ns/op > FastLockingBenchmark.testSync 1 avgt 2 105.725 ns/op > FastLockingBenchmark.testSync 2 avgt 2 122.780 ns/op > FastLockingBenchmark.testSync 4 avgt 2 125.133 ns/op > FastLockingBenchmark.testSync 8 avgt 2 151.915 ns/op > FastLockingBenchmark.testSync 16 avgt 2 206.458 ns/op > FastLockingBenchmark.testSync 32 avgt 2 313.980 ns/op > FastLockingBenchmark.testSync 64 avgt 2 522.206 ns/op > > > New: > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 60.102 ns/op > FastLockingBenchmark.testSync 1 avgt 2 61.667 ns/op > FastLockingBenchmark.testSync 2 avgt 2 74.950 ns/op > FastLockingBenchmark.testSync 4 avgt 2 85.480 ns/op > FastLockingBenchmark.testSync 8 avgt 2 115.019 ns/op > FastLockingBenchmark.testSync 16 avgt 2 178.046 ns/op > FastLockingBenchmark.testSync 32 avgt 2 273.376 ns/op > FastLockingBenchmark.testSync 64 avgt 2 500.287 ns/op > > > Please note that Arm remains broken since JDK-8301995; I based and tested this patch on the parent of that change. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13376#pullrequestreview-1375421016 From duke at openjdk.org Fri Apr 7 00:30:57 2023 From: duke at openjdk.org (Joshua Cao) Date: Fri, 7 Apr 2023 00:30:57 GMT Subject: Integrated: 8300829: Make CtwRunner available as an independent tool In-Reply-To: References: Message-ID: <_VXj2wYmJ2K6pX8c222rIDT4m0N3YIIvFIaI1EAVU8M=.e9150efd-8587-4f37-b8ce-c00c1ae62da6@github.com> On Wed, 5 Apr 2023 04:46:21 GMT, Joshua Cao wrote: > 1. ~~Create an independent jar file with CtwRunner as the main class to make it easier to run~~. Not needed anymore thanks to @navyxliu 's comments. We can use the original `ctw.jar` because it already has ctwrunner class. > 2. Output the class files directly into the destination directory. Currently, CTW expects a `wb.jar`, but the jtreg tests that use CTWRunner has class files outside of a jar. > 3. Introduce `sun.hotspot.tools.ctwrunner.ctw_extra_args` option to pass extra arguments to CTW. Arguments are comma separated because working with spaces in bash can be kind of awkward, but I'm open to changing this part. > > ### Motivation > CTWRunner is a wrapper around CTW that will continue compiling beyond failure. It can be useful for testing compilation with certain flags. For example, one could run > > > JAVA_OPTIONS="-Dsun.hotspot.tools.ctwrunner.ctw_extra_args=-XX:+StressLCM,-XX:+StressGCM" ./ctwrunner.sh modules:java.base > > > To test compiling the java.base module with `-XX:+StressLCM -XX:+StressGCM`. This is advantageous over uses CTW because we can see the full list of crashes for the entire module. This pull request has now been integrated. Changeset: 314e9b3d Author: Joshua Cao Committer: Paul Hohensee URL: https://git.openjdk.org/jdk/commit/314e9b3dcca16d84cf85851cb6f8f7af76ae88db Stats: 104 lines in 3 files changed: 60 ins; 37 del; 7 mod 8300829: Make CtwRunner available as an independent tool Reviewed-by: xliu, phh ------------- PR: https://git.openjdk.org/jdk/pull/13344 From pli at openjdk.org Fri Apr 7 01:31:59 2023 From: pli at openjdk.org (Pengfei Li) Date: Fri, 7 Apr 2023 01:31:59 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v13] In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 14:38:00 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [ ] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo and use match_rule_supported_vector instead of true src/hotspot/cpu/riscv/riscv_v.ad line 2457: > 2455: %} > 2456: > 2457: instruct vmask_gen_sub(vRegMask dst, iRegL src1, iRegL src2) %{ (One more comment) This rule looks redundant for RISC-V. We have this on AArch64 because we can match the nodes to SVE `whileXX` to save a `sub` instruction. But it looks no instruction is saved according to your RISC-V implementation. Actually, `SubL` and `VectorMaskGen` can be matched separately without this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1160370730 From gcao at openjdk.org Fri Apr 7 02:46:46 2023 From: gcao at openjdk.org (Gui Cao) Date: Fri, 7 Apr 2023 02:46:46 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v13] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 01:29:18 GMT, Pengfei Li wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo and use match_rule_supported_vector instead of true > > src/hotspot/cpu/riscv/riscv_v.ad line 2457: > >> 2455: %} >> 2456: >> 2457: instruct vmask_gen_sub(vRegMask dst, iRegL src1, iRegL src2) %{ > > (One more comment) This rule looks redundant for RISC-V. We have this on AArch64 because we can match the nodes to SVE `whileXX` to save a `sub` instruction. But it looks no instruction is saved according to your RISC-V implementation. Actually, `SubL` and `VectorMaskGen` can be matched separately without this. Hi, thanks for the review. this node is really not needed here,`SubL` and `VectorMaskGen` can be matched separately without this. After removing this node, `SubL` and `VectorMaskGen` can be matched separately, The compilation log is as follows: 148 B11: # out( B21 B12 ) <- in( B9 ) Freq: 0.00838607 148 addw R7, R29, zr #@convI2L_reg_reg 14c addw R28, R28, zr #@convI2L_reg_reg 150 sub R7, R7, R28 #@subL_reg_reg 154 vmask_gen_L V0, R7 15c vstoremask V5, V0 168 ld R30, [R23, #264] # ptr, #@loadP 16c ld R7, [R23, #280] # ptr, #@loadP 170 addi R28, R30, #48 # ptr, #@addP_reg_imm 174 bgeu R28, R7, B21 #@cmpP_branch P=0.000100 C=-1.000000 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1160396416 From stuefe at openjdk.org Fri Apr 7 06:00:55 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 7 Apr 2023 06:00:55 GMT Subject: Integrated: JDK-8305711: Arm: C2 always enters slowpath for monitorexit In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 16:29:57 GMT, Thomas Stuefe wrote: > A small bug in the C2 implementation of monitorexit for thin locks causes us to always enter the slow path. > > This seems to be a day zero bug of the arm port, since JEP 297: "Unified arm32/arm64 Port". It has a significant effect on locking performance, but its effect had been hidden until JDK 15 by biased locking. Biased locking removal made the bug appearant. > > With this patch, @rkennke's artificial microbenchmark that does nothing but uncontended locking improves greatly (see https://github.com/rkennke/fastlockbench): > > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 110.600 ns/op > FastLockingBenchmark.testSync 1 avgt 2 105.725 ns/op > FastLockingBenchmark.testSync 2 avgt 2 122.780 ns/op > FastLockingBenchmark.testSync 4 avgt 2 125.133 ns/op > FastLockingBenchmark.testSync 8 avgt 2 151.915 ns/op > FastLockingBenchmark.testSync 16 avgt 2 206.458 ns/op > FastLockingBenchmark.testSync 32 avgt 2 313.980 ns/op > FastLockingBenchmark.testSync 64 avgt 2 522.206 ns/op > > > New: > > Benchmark (backoff) Mode Cnt Score Error Units > FastLockingBenchmark.testSync 0 avgt 2 60.102 ns/op > FastLockingBenchmark.testSync 1 avgt 2 61.667 ns/op > FastLockingBenchmark.testSync 2 avgt 2 74.950 ns/op > FastLockingBenchmark.testSync 4 avgt 2 85.480 ns/op > FastLockingBenchmark.testSync 8 avgt 2 115.019 ns/op > FastLockingBenchmark.testSync 16 avgt 2 178.046 ns/op > FastLockingBenchmark.testSync 32 avgt 2 273.376 ns/op > FastLockingBenchmark.testSync 64 avgt 2 500.287 ns/op > > > Please note that Arm remains broken since JDK-8301995; I based and tested this patch on the parent of that change. This pull request has now been integrated. Changeset: c67bbcea Author: Thomas Stuefe URL: https://git.openjdk.org/jdk/commit/c67bbcea92919fea9b6f7bbcde8ba4488289d174 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8305711: Arm: C2 always enters slowpath for monitorexit Reviewed-by: shade, kvn ------------- PR: https://git.openjdk.org/jdk/pull/13376 From stuefe at openjdk.org Fri Apr 7 06:00:53 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 7 Apr 2023 06:00:53 GMT Subject: RFR: JDK-8305711: Arm: C2 always enters slowpath for monitorexit In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 17:00:31 GMT, Aleksey Shipilev wrote: >> A small bug in the C2 implementation of monitorexit for thin locks causes us to always enter the slow path. >> >> This seems to be a day zero bug of the arm port, since JEP 297: "Unified arm32/arm64 Port". It has a significant effect on locking performance, but its effect had been hidden until JDK 15 by biased locking. Biased locking removal made the bug appearant. >> >> With this patch, @rkennke's artificial microbenchmark that does nothing but uncontended locking improves greatly (see https://github.com/rkennke/fastlockbench): >> >> >> Benchmark (backoff) Mode Cnt Score Error Units >> FastLockingBenchmark.testSync 0 avgt 2 110.600 ns/op >> FastLockingBenchmark.testSync 1 avgt 2 105.725 ns/op >> FastLockingBenchmark.testSync 2 avgt 2 122.780 ns/op >> FastLockingBenchmark.testSync 4 avgt 2 125.133 ns/op >> FastLockingBenchmark.testSync 8 avgt 2 151.915 ns/op >> FastLockingBenchmark.testSync 16 avgt 2 206.458 ns/op >> FastLockingBenchmark.testSync 32 avgt 2 313.980 ns/op >> FastLockingBenchmark.testSync 64 avgt 2 522.206 ns/op >> >> >> New: >> >> Benchmark (backoff) Mode Cnt Score Error Units >> FastLockingBenchmark.testSync 0 avgt 2 60.102 ns/op >> FastLockingBenchmark.testSync 1 avgt 2 61.667 ns/op >> FastLockingBenchmark.testSync 2 avgt 2 74.950 ns/op >> FastLockingBenchmark.testSync 4 avgt 2 85.480 ns/op >> FastLockingBenchmark.testSync 8 avgt 2 115.019 ns/op >> FastLockingBenchmark.testSync 16 avgt 2 178.046 ns/op >> FastLockingBenchmark.testSync 32 avgt 2 273.376 ns/op >> FastLockingBenchmark.testSync 64 avgt 2 500.287 ns/op >> >> >> Please note that Arm remains broken since JDK-8301995; I based and tested this patch on the parent of that change. > > Usual synopsis for ARM32 bugs is "ARM32: C2 always...", I think :) Thanks @shipilev and @vnkozlov! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13376#issuecomment-1499964632 From pli at openjdk.org Fri Apr 7 07:28:55 2023 From: pli at openjdk.org (Pengfei Li) Date: Fri, 7 Apr 2023 07:28:55 GMT Subject: RFR: 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub Message-ID: >From recent tests, we find that `System.arraycopy()` call with a negated variable as its length argument does not perform the copy. This issue is reproducible by below test case on AArch64 platforms with SVE. public class Test { static char[] src = {'A', 'A', 'A', 'A', 'A'}; static char[] dst = {'B', 'B', 'B', 'B', 'B'}; static void copy(int nlen) { System.arraycopy(src, 0, dst, 0, -nlen); } public static void main(String[] args) { for (int i = 0; i < 25000; i++) { copy(0); } copy(-5); for (char c : dst) { if (c != 'A') { throw new RuntimeException("Wrong value!"); } } System.out.println("PASS"); } } /* $ java -Xint Test PASS $ java -Xbatch Test Exception in thread "main" java.lang.RuntimeException: Wrong value! at Test.main(Test.java:16) */ Cause of this is a new AArch64 matching rule `vmask_gen_sub` introduced by JDK-8293198. It matches `VectorMaskGen (SubL src1 src2)` on AArch64 platforms with SVE and generates SVE `whilelo` instructions. Current C2 compiler uses a technique called "partial inlining" to vectorize small array copy operations by generating vector masks. In above test case, a negated variable `-nlen` is used as the length argument of the call and `-nlen` has a small positive value, so it is a "partial inlining" case. C2 will transform the ideal graph to `VectorMaskGen (SubL 0 nlen)` and eventually output an instruction of `whilelo p0, nlen, zr` which always generates an all-false vector mask. That's why arraycopy does nothing. The problem of that matching rule is that it regards inputs `src1` and `src2` as unsigned long integers but they can be signed in use cases of arraycopy. To fix the issue, this patch replaces `whilelo` instruction by `whilelt` in that rule as well as some other places. We tested tier1~3 on SVE and found no new failure. A jtreg math library test jdk/internal/math/FloatingDecimal/TestFloatingDecimal.java which fails on SVE before can pass now. ------------- Commit messages: - 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub Changes: https://git.openjdk.org/jdk/pull/13382/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13382&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305524 Stats: 62 lines in 4 files changed: 52 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/13382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13382/head:pull/13382 PR: https://git.openjdk.org/jdk/pull/13382 From roland at openjdk.org Fri Apr 7 08:58:45 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Apr 2023 08:58:45 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v5] In-Reply-To: References: Message-ID: <-gMu3Sj29DRnyRQ3CWQmFUPkcwGgiNBwpy98pTn1hvo=.3197ec6e-e9ae-47b5-8722-8000b5dda2bf@github.com> > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'master' into JDK-8300257 - review - Merge branch 'master' into JDK-8300257 - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java Co-authored-by: Tobias Hartmann - Update test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationMultiInvar.java Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/superword.hpp Co-authored-by: Tobias Hartmann - NULL -> nullptr - Merge branch 'master' into JDK-8300257 - fix & test ------------- Changes: https://git.openjdk.org/jdk/pull/12942/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12942&range=04 Stats: 273 lines in 3 files changed: 211 ins; 23 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/12942.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12942/head:pull/12942 PR: https://git.openjdk.org/jdk/pull/12942 From roland at openjdk.org Fri Apr 7 08:58:46 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Apr 2023 08:58:46 GMT Subject: RFR: 8300257: C2: vectorization fails on some simple Memory Segment loops [v4] In-Reply-To: References: <1qWMOwI1VZiAf-DY2tRw8LqFe93CS7omKZam2Ro5fJ8=.a3c51e8d-c2b5-4d89-bfdb-3ed20beaa9d6@github.com> Message-ID: On Mon, 20 Mar 2023 13:16:19 GMT, Tobias Hartmann wrote: > Thanks for making these changes. `SWPointer::maybe_negate_invar` could now be removed as it has only one user but I'm also fine with leaving it as is. Thanks for looking at this again. Because it has a single use? The method is fairly simple but non trivial so I'll leave it as a separate method. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12942#issuecomment-1500084286 From dzhang at openjdk.org Fri Apr 7 12:16:44 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 7 Apr 2023 12:16:44 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v14] In-Reply-To: References: Message-ID: <7-vzGGq80CYZvbB9N5qrn5mrhVhxxh97oo7AkgYRn1k=.8b9fe40f-d858-43d0-bdb8-4b050009876f@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Remove unneeded combination nodes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/447ea6c9..2ef39c07 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=12-13 Stats: 13 lines in 1 file changed: 0 ins; 13 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Fri Apr 7 12:16:46 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 7 Apr 2023 12:16:46 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v13] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 02:43:52 GMT, Gui Cao wrote: >> src/hotspot/cpu/riscv/riscv_v.ad line 2457: >> >>> 2455: %} >>> 2456: >>> 2457: instruct vmask_gen_sub(vRegMask dst, iRegL src1, iRegL src2) %{ >> >> (One more comment) This rule looks redundant for RISC-V. We have this on AArch64 because we can match the nodes to SVE `whileXX` to save a `sub` instruction. But it looks no instruction is saved according to your RISC-V implementation. Actually, `SubL` and `VectorMaskGen` can be matched separately without this. > > Hi, thanks for the review. this node is really not needed here,`SubL` and `VectorMaskGen` can be matched separately without this. > After removing this node, `SubL` and `VectorMaskGen` can be matched separately, The compilation log is as follows: > > 148 B11: # out( B21 B12 ) <- in( B9 ) Freq: 0.00838607 > 148 addw R7, R29, zr #@convI2L_reg_reg > 14c addw R28, R28, zr #@convI2L_reg_reg > 150 sub R7, R7, R28 #@subL_reg_reg > 154 vmask_gen_L V0, R7 > 15c vstoremask V5, V0 > 168 ld R30, [R23, #264] # ptr, #@loadP > 16c ld R7, [R23, #280] # ptr, #@loadP > 170 addi R28, R30, #48 # ptr, #@addP_reg_imm > 174 bgeu R28, R7, B21 #@cmpP_branch P=0.000100 C=-1.000000 Thanks and fixed. The 75 tests under `test/jdk/jdk/incubator/vector` passed as usual. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1160660600 From roland at openjdk.org Fri Apr 7 12:55:00 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 7 Apr 2023 12:55:00 GMT Subject: Integrated: 8300257: C2: vectorization fails on some simple Memory Segment loops In-Reply-To: References: Message-ID: On Thu, 9 Mar 2023 10:48:06 GMT, Roland Westrelin wrote: > In the test case `testByteLong1` (that's extracted from a memory > segment micro benchmark), the address of the store is initially: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LshiftI#107 iv#101))) invar#163)) > > > (#numbers are node numbers to help the discussion). > > `iv#101` is the `Phi` of a counted loop. `invar#163` is the > `baseOffset` load. > > To eliminate the range check, the loop is transformed into a loop nest > and as a consequence the address above becomes: > > > (AddP#204 base#195 base#195 (AddL#164 (ConvI2L#158 (CastII#157 (LShiftI#107 (AddI#326 invar#308 iv#321)))) invar#163)) > > > `invar#308` is some expression from a `Phi` of the outer loop. > > That `AddP` is transformed multiple times to push the invariants out of loop: > > > (AddP#568 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#158 (CastII#157 (AddI#566 (LShiftI#565 iv#321) invar#577)))) > > > then: > > > (AddP#568 base#195 (AddP#847 (AddP#556 base#195 base#195 invar#163) (AddL#838 (ConvI2L#793 (LShiftL#760 iv#767)) (ConvI2L#818 (CastII#779 invar#577))))) > > > and finally: > > > (AddP#568 base#195 (AddP#949 base#195 (AddP#855 base#195 (AddP#556 base#195 base#195 invar#163) (ConvI2L#818 (CastII#809 invar#577))) (ConvI2L#938 (LShiftI#896 iv#908)))) > > > `AddP#855` is out of the inner loop. > > This doesn't vectorize because: > > - there are 2 invariants in the address expression but superword only > support one (tracked by `_invar` in `SWPointer`) > > - there are more levels of `AddP` (4) than superword supports (3) > > To fix that, I propose to no longer track the address elements in > `_invar`, `_negate_invar` and `_invar_scale` but instead to have a > single `_invar` which is an expression built by superword as it > follows chains of `addP` nodes. I kept the previous `_invar`, > `_negate_invar` and `_invar_scale` as debugging and use them to check > that what vectorized with the previous scheme still does. > > I also propose lifting the restriction on 3 levels of `AddP` entirely. This pull request has now been integrated. Changeset: 6b2a86a6 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/6b2a86a65ef530002aea35ded45d75e04c223802 Stats: 273 lines in 3 files changed: 211 ins; 23 del; 39 mod 8300257: C2: vectorization fails on some simple Memory Segment loops Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/12942 From jbhateja at openjdk.org Fri Apr 7 14:22:56 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 7 Apr 2023 14:22:56 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:25:16 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > small cosmetics src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractShuffle.java line 96: > 94: } > 95: Vector shufvec = this.toBitsVector(); > 96: VectorMask vecmask = shufvec.compare(VectorOperators.LT, 0); This may impact the intrinsification over AVX1 targets for floating point shuffles. Since bits vector is an integral vector and AVX1 does support 32 byte floats but not 32 byte integral vectors. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 226: > 224: > 225: AbstractSpecies species = vspecies().asIntegral(); > 226: Vector iota = species.iota(); we can do an early exist by returning species..iota() if start = 0 and step = 1 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1160650526 PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1160672743 From jbhateja at openjdk.org Fri Apr 7 14:22:57 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 7 Apr 2023 14:22:57 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 12:36:04 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> small cosmetics > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 226: > >> 224: >> 225: AbstractSpecies species = vspecies().asIntegral(); >> 226: Vector iota = species.iota(); > > we can do an early exist by returning species..iota() if start = 0 and step = 1 Power of two step count may be replaced by logical right shifts. But special handling may impact generic path , currently c2 inline expander handles these special cases. Alternatively we can keep this implementation at its and enhance vector idealizations to handle identity scenarios, multiply by 1, addition by 0, shift replacement for power of two multiply, since their scalar counterparts do handle these cases and SLP generated code gets a benefit of that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1160706670 From qamai at openjdk.org Fri Apr 7 17:13:50 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 7 Apr 2023 17:13:50 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v7] In-Reply-To: References: Message-ID: > Hi, > > This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: > > 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. > 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. > 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. > 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. > > Upon these changes, a `rearrange` can emit more efficient code: > > var species = IntVector.SPECIES_128; > var v1 = IntVector.fromArray(species, SRC1, 0); > var v2 = IntVector.fromArray(species, SRC2, 0); > v1.rearrange(v2.toShuffle()).intoArray(DST, 0); > > Before: > movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} > vmovdqu 0x10(%r10),%xmm2 > movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} > vmovdqu 0x10(%r10),%xmm0 > vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask > ; {external_word} > vpackusdw %xmm0,%xmm0,%xmm0 > vpackuswb %xmm0,%xmm0,%xmm0 > vpmovsxbd %xmm0,%xmm3 > vpcmpgtd %xmm3,%xmm1,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fc2acb4e0d8 > vpmovzxbd %xmm0,%xmm0 > vpermd %ymm2,%ymm0,%ymm0 > movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} > vmovdqu %xmm0,0x10(%r10) > > After: > movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} > vmovdqu 0x10(%r10),%xmm2 > vpxor %xmm0,%xmm0,%xmm0 > vpcmpgtd %xmm2,%xmm0,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fa818b27cb1 > vpermd %ymm1,%ymm2,%ymm0 > movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} > vmovdqu %xmm0,0x10(%r10) > > Please take a look and leave reviews. Thanks a lot. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: special case iotaShuffle ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13093/files - new: https://git.openjdk.org/jdk/pull/13093/files/97c8fabf..079a6b5f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=05-06 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13093.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13093/head:pull/13093 PR: https://git.openjdk.org/jdk/pull/13093 From qamai at openjdk.org Fri Apr 7 17:14:14 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 7 Apr 2023 17:14:14 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 13:36:22 GMT, Jatin Bhateja wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 226: >> >>> 224: >>> 225: AbstractSpecies species = vspecies().asIntegral(); >>> 226: Vector iota = species.iota(); >> >> we can do an early exist by returning species..iota() if start = 0 and step = 1 > > Power of two step count may be replaced by logical right shifts. But special handling may impact generic path > , currently c2 inline expander handles these special cases. > > Alternatively we can keep this implementation at its and enhance vector idealizations to handle identity scenarios, multiply by 1, addition by 0, shift replacement for power of two multiply, since their scalar counterparts do handle these cases and SLP generated code gets a benefit of that. Thanks a lot for your review, I think that transforming a multiplication by a power of 2 into a shift can be done by the C2 compiler. I have added the special case for `start = 0 && step == 1` since it may be more common and can be optimised away when the arguments are constants. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1160841447 From never at openjdk.org Fri Apr 7 17:37:36 2023 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 7 Apr 2023 17:37:36 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent Message-ID: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. ------------- Commit messages: - Fix benign assertion failure - 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent Changes: https://git.openjdk.org/jdk/pull/13389/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13389&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305755 Stats: 6 lines in 2 files changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13389.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13389/head:pull/13389 PR: https://git.openjdk.org/jdk/pull/13389 From qamai at openjdk.org Fri Apr 7 18:06:53 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 7 Apr 2023 18:06:53 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 11:51:21 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> small cosmetics > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractShuffle.java line 96: > >> 94: } >> 95: Vector shufvec = this.toBitsVector(); >> 96: VectorMask vecmask = shufvec.compare(VectorOperators.LT, 0); > > This may impact the intrinsification over AVX1 targets for floating point shuffles. Since bits vector is an integral vector and AVX1 does support 32 byte floats but not 32 byte integral vectors. Yes I think it is a drawback of this approach, however currently we do not support shuffling for 256-bit vectors on AVX1 machines either, and AVX1 seems to be a special case in this regard. This species of float and double may also be less common in the usage of Vector API since it is larger than SPECIES_PREFERRED. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1160868954 From eosterlund at openjdk.org Fri Apr 7 19:25:42 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 7 Apr 2023 19:25:42 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent In-Reply-To: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: On Fri, 7 Apr 2023 17:30:39 GMT, Tom Rodriguez wrote: > Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. There are comments suggesting we need to read with Java volatile semantics. It was acquire before, which isn't always strong enough for Java volatiles. With the acquire dropped, I suppose it's even less okay. Looks like you want to use MO_SEQ_CST on these accesses, which precisely matches Java volatiles. ------------- Changes requested by eosterlund (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13389#pullrequestreview-1376513021 From never at openjdk.org Fri Apr 7 19:48:38 2023 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 7 Apr 2023 19:48:38 GMT Subject: RFR: 8305419: JDK-8301995 broke building libgraal Message-ID: There were some minor mismatches is where the indy index was passed and where the real constant pool index was used. There's no general test of `loadReferencedType` but I modified TestDynamicConstant to exercise it. The arguments to the invokedynamic in that test were mildly broken as well which was only exposed once we tried to resolve it. ------------- Commit messages: - Fix invokedynamic index handling - Add test to exercise loadReferencedType for invokedynamic Changes: https://git.openjdk.org/jdk/pull/13392/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13392&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305419 Stats: 53 lines in 4 files changed: 24 ins; 17 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/13392.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13392/head:pull/13392 PR: https://git.openjdk.org/jdk/pull/13392 From never at openjdk.org Fri Apr 7 20:16:43 2023 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 7 Apr 2023 20:16:43 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent In-Reply-To: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: On Fri, 7 Apr 2023 17:30:39 GMT, Tom Rodriguez wrote: > Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. And that will include the G1 barrier required for `Reference.get()` semantics as well? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13389#issuecomment-1500607688 From matsaave at openjdk.org Fri Apr 7 21:13:43 2023 From: matsaave at openjdk.org (Matias Saavedra Silva) Date: Fri, 7 Apr 2023 21:13:43 GMT Subject: RFR: 8305419: JDK-8301995 broke building libgraal In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 19:41:45 GMT, Tom Rodriguez wrote: > There were some minor mismatches is where the indy index was passed and where the real constant pool index was used. There's no general test of `loadReferencedType` but I modified TestDynamicConstant to exercise it. The arguments to the invokedynamic in that test were mildly broken as well which was only exposed once we tried to resolve it. LGTM! Thanks Tom! ------------- Marked as reviewed by matsaave (Committer). PR Review: https://git.openjdk.org/jdk/pull/13392#pullrequestreview-1376587202 From eosterlund at openjdk.org Sat Apr 8 07:47:42 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Sat, 8 Apr 2023 07:47:42 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent In-Reply-To: References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: On Fri, 7 Apr 2023 20:13:55 GMT, Tom Rodriguez wrote: > And that will include the G1 barrier required for `Reference.get()` semantics as well? And a second question is whether we can safely perform this read for all Reference subtypes. Yes, as long as you keep ON_UNKNOWN_OOP_REF. And yes, you can safely do this for all reference types. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13389#issuecomment-1500816879 From qamai at openjdk.org Sun Apr 9 09:48:49 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 9 Apr 2023 09:48:49 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL Message-ID: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Hi, This patch optimizes the sequence emitted by `AbsINode` and `AbsLNode` to save some instructions and 1 temp register. Please take a look and kindly leave your reviews. Thanks a lot. ------------- Commit messages: - more efficient abs implementations Changes: https://git.openjdk.org/jdk/pull/13402/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13402&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305783 Stats: 28 lines in 1 file changed: 0 ins; 10 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/13402.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13402/head:pull/13402 PR: https://git.openjdk.org/jdk/pull/13402 From qamai at openjdk.org Sun Apr 9 09:48:49 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 9 Apr 2023 09:48:49 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL In-Reply-To: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Message-ID: On Sun, 9 Apr 2023 09:41:08 GMT, Quan Anh Mai wrote: > Hi, > > This patch optimizes the sequence emitted by `AbsINode` and `AbsLNode` to save some instructions and 1 temp register. Please take a look and kindly leave your reviews. > > Thanks a lot. Running the patch on `org.openjdk.bench.java.lang.MathBench` shows noticeable improvement: Before After Benchmark Mode Cnt Score Error Score Error Units Change MathBench.absInt thrpt 8 2848176.368 ? 18552.810 3442310.540 ? 65838.237 ops/ms +20.86% MathBench.absLong thrpt 8 2851886.891 ? 17565.736 3389983.543 ? 36466.553 ops/ms +18.87% ------------- PR Comment: https://git.openjdk.org/jdk/pull/13402#issuecomment-1501088764 From stuefe at openjdk.org Sun Apr 9 19:25:59 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Sun, 9 Apr 2023 19:25:59 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 Message-ID: The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. ------------- Commit messages: - implement breakpoint Changes: https://git.openjdk.org/jdk/pull/13401/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13401&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305782 Stats: 10 lines in 3 files changed: 9 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13401.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13401/head:pull/13401 PR: https://git.openjdk.org/jdk/pull/13401 From jkarthikeyan at openjdk.org Mon Apr 10 02:55:53 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 10 Apr 2023 02:55:53 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL In-Reply-To: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Message-ID: <2Kj9pZx14cO6sk7eatJUrNQaZhWOq6NRQNvSGkQPX3Y=.c2623675-c582-4163-be39-d5eeefe44b3f@github.com> On Sun, 9 Apr 2023 09:41:08 GMT, Quan Anh Mai wrote: > Hi, > > This patch optimizes the sequence emitted by `AbsINode` and `AbsLNode` to save some instructions and 1 temp register. Please take a look and kindly leave your reviews. > > Thanks a lot. Patch looks nice! (I am not a reviewer.) I have one comment with the code. I was also thinking that the unconditional usage of cmov here may lead to performance issues if the abs is part of a loop-carried dependency, as I've seen that happen before with MinI/MaxI nodes before. I did a [small test](https://gist.github.com/jaskarth/d502adfe3a0e82c30b885ce660262161) to see if that was the case here and it seems there's no regression, so very nice! src/hotspot/cpu/x86/x86_64.ad line 8469: > 8467: match(Set dst (AbsI src)); > 8468: effect(TEMP dst, KILL cr); > 8469: format %{ "xorl $dst, $dst\n\t" Suggestion: format %{ "xorl $dst, $dst\t # int abs\n\t" It would be nice to add a comment like this to indicate in the OptoAssembly where the code is coming from, same with the long version. ------------- PR Review: https://git.openjdk.org/jdk/pull/13402#pullrequestreview-1376995455 PR Review Comment: https://git.openjdk.org/jdk/pull/13402#discussion_r1161375772 From fyang at openjdk.org Mon Apr 10 03:23:43 2023 From: fyang at openjdk.org (Fei Yang) Date: Mon, 10 Apr 2023 03:23:43 GMT Subject: RFR: 8305728: RISC-V: Use bexti instruction to do single-bit testing In-Reply-To: References: Message-ID: <0jZ-rpWX0ISXgblJsfIXiktoNbVZPFis4NcT4r5_sUk=.5affa6b7-80cd-48b7-a9f7-abacc85b6f3d@github.com> On Thu, 6 Apr 2023 00:52:15 GMT, Feilong Jiang wrote: > Current RISC-V port tests bit masks with `andi` instruction. But for those mask values not in the range of `simm12` (`andi` > only accepts sign-extended 12-bit immediate [1]), we need an extra temp register (`t0` as default for `andi`) to store the mask value [2]. > Since we now support Zbs extension of Bit-Manipulation, we have a more convenient way to test power-of-two bit > masks with the single instruction `bexti` [3] without any temp register. > > 1. https://github.com/riscv/riscv-isa-manual/blob/f6b8d5c7d2dcd935b48689a337c8f5bc2be4b5e5/src/rv32.tex#L519-L521 > 2. https://github.com/openjdk/jdk/blob/ce6e7461dc5ac56459a79e75d5de76929d1be0a3/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1852-L1860 > 3. https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > Testing: > - [x] `hotspot_tier1`, `jdk_tier1` on QEMU-User w/ and w/o `UseZbs` (release build) > - [x] tier1 tests on unmatched board w/o `UseZbs` (release build) Looks good. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13368#pullrequestreview-1377039293 From jkarthikeyan at openjdk.org Mon Apr 10 03:49:41 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 10 Apr 2023 03:49:41 GMT Subject: RFR: 8305787: Wrong debugging information printed with TraceOptoOutput Message-ID: This patch fixes a minor bug in aldc where the wrong resource names are printed when the flag TraceOptoOutput is enabled to debug instruction scheduling. As an example, the output: *** Bundle: 1 instr, resources: D0 BR 126 salI_rReg_imm === _ 240 |271 [[ 127 125 ]] #5/0x00000005 states that the bundle is using resources D0 and BR, but the second resource used is actually ALU0. The issue is caused because `pipeline->_rescount` is only incremented for discrete resources [(here)](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/adlparse.cpp#L1612), resources specified without `=`. However, the list of names is added to for *all* resources [(here)](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/adlparse.cpp#L1652), so using `_rescount` to index the names causes it to go out of sync. The fix is found in [output_h.cpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/output_h.cpp#L2231), where it uses the iterator to go through all the resources and use only the ones that are discrete. I applied that fix to this case, and also fixed the other instances of this bug. Reviews on this fix would be appreciated! ------------- Commit messages: - Ensure correct pipeline resources are used in adlc Changes: https://git.openjdk.org/jdk/pull/13403/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13403&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305787 Stats: 73 lines in 3 files changed: 37 ins; 2 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/13403.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13403/head:pull/13403 PR: https://git.openjdk.org/jdk/pull/13403 From fjiang at openjdk.org Mon Apr 10 06:43:43 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 10 Apr 2023 06:43:43 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 In-Reply-To: References: Message-ID: On Sun, 9 Apr 2023 07:45:46 GMT, Thomas Stuefe wrote: > The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. > > Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 2525: > 2523: void MacroAssembler::breakpoint(uint16_t hint_imm16) { > 2524: // BRK with hint > 2525: const uint32_t c = 0xD4200000 | (((uint32_t)hint_imm16) << 5); aarch64 already provides `brk` at assembler [1], for readability, maybe we can just simply use `brk(hint_imm16)` here. 1. https://github.com/openjdk/jdk/blob/969a6b9fd7f7afc60250309f3ada205c1473cf8e/src/hotspot/cpu/aarch64/assembler_aarch64.hpp#L1036 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13401#discussion_r1161478302 From epeter at openjdk.org Mon Apr 10 09:10:46 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Apr 2023 09:10:46 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 11:53:18 GMT, Roland Westrelin wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Testing** >> >> I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. >> >> Tested up to `tier5` and stress testing. Performance testing **running...** >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > That looks reasonable to me. > Is the PhaseIdealLoop::adjust_limit() change required or is it some cleanup? > Have you run performance testing to be safe? @rwestrel thanks for the review! The performance testing looks ok, I cannot see a significant runtime change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1501585583 From epeter at openjdk.org Mon Apr 10 09:21:46 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 10 Apr 2023 09:21:46 GMT Subject: RFR: 8305740: C2: add print statements to assert: Can't determine return type. Message-ID: I added this assert before the bailout, because it probably hides bugs. Now we have failure reports with [JDK-8305185](https://bugs.openjdk.org/browse/JDK-8305185). It is difficult to reproduce, so I'd like to add some print statements to get at least a bit of info. Passed tests up to tier5 and stress testing. ------------- Commit messages: - 8305740: C2: add print statements to assert: Can't determine return type. Changes: https://git.openjdk.org/jdk/pull/13385/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13385&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305740 Stats: 9 lines in 1 file changed: 9 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13385.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13385/head:pull/13385 PR: https://git.openjdk.org/jdk/pull/13385 From duke at openjdk.org Mon Apr 10 10:55:39 2023 From: duke at openjdk.org (Daohan Qu) Date: Mon, 10 Apr 2023 10:55:39 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes Message-ID: This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). `SuperWord::compute_vector_element_type()` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher order bits of an integer and should be prevented from being vectorized like `Math.abs()`( which is `Op_AbsI` in the following code). https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 I have tested this patch for tier 1-3 on x86-64. ------------- Commit messages: - Add a jtreg test for 8305324 - Prevent integer narrowed type backward propagation from passing through Op_ReverseBytesI operation, which should fix 8305324 Changes: https://git.openjdk.org/jdk/pull/13404/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13404&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305324 Stats: 57 lines in 2 files changed: 56 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13404.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13404/head:pull/13404 PR: https://git.openjdk.org/jdk/pull/13404 From duke at openjdk.org Mon Apr 10 12:59:51 2023 From: duke at openjdk.org (Daohan Qu) Date: Mon, 10 Apr 2023 12:59:51 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 10:49:24 GMT, Daohan Qu wrote: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher order bits of an integer and should be prevented from being vectorized like `Math.abs()`( which is `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. Sorry, I forgot to enable the pre-submit test. I will close this PR and create a new one. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13404#issuecomment-1501782098 From duke at openjdk.org Mon Apr 10 12:59:52 2023 From: duke at openjdk.org (Daohan Qu) Date: Mon, 10 Apr 2023 12:59:52 GMT Subject: Withdrawn: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 10:49:24 GMT, Daohan Qu wrote: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher order bits of an integer and should be prevented from being vectorized like `Math.abs()`( which is `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/13404 From qamai at openjdk.org Mon Apr 10 13:04:43 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 10 Apr 2023 13:04:43 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL [v2] In-Reply-To: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Message-ID: > Hi, > > This patch optimizes the sequence emitted by `AbsINode` and `AbsLNode` to save some instructions and 1 temp register. Please take a look and kindly leave your reviews. > > Thanks a lot. Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: - description - description ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13402/files - new: https://git.openjdk.org/jdk/pull/13402/files/723eb6ce..4767979d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13402&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13402&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/13402.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13402/head:pull/13402 PR: https://git.openjdk.org/jdk/pull/13402 From qamai at openjdk.org Mon Apr 10 13:09:44 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 10 Apr 2023 13:09:44 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL [v2] In-Reply-To: <2Kj9pZx14cO6sk7eatJUrNQaZhWOq6NRQNvSGkQPX3Y=.c2623675-c582-4163-be39-d5eeefe44b3f@github.com> References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> <2Kj9pZx14cO6sk7eatJUrNQaZhWOq6NRQNvSGkQPX3Y=.c2623675-c582-4163-be39-d5eeefe44b3f@github.com> Message-ID: On Mon, 10 Apr 2023 02:53:25 GMT, Jasmine Karthikeyan wrote: >> Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: >> >> - description >> - description > > Patch looks nice! (I am not a reviewer.) I have one comment with the code. > > I was also thinking that the unconditional usage of cmov here may lead to performance issues if the abs is part of a loop-carried dependency, as I've seen that happen before with MinI/MaxI nodes before. I did a [small test](https://gist.github.com/jaskarth/d502adfe3a0e82c30b885ce660262161) to see if that was the case here and it seems there's no regression, so very nice! @jaskarth Thanks a lot for your review, I think this cannot introduce more dependency into the operation since it depends on `src` already. I also added more description to the `format` sections ------------- PR Comment: https://git.openjdk.org/jdk/pull/13402#issuecomment-1501794690 From duke at openjdk.org Mon Apr 10 13:26:52 2023 From: duke at openjdk.org (Daohan Qu) Date: Mon, 10 Apr 2023 13:26:52 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: <2C1JXvFQ46-0uZ7vh9OcxR_6AemiPganvcXIncJb8RQ=.d40ae92e-848b-4cc3-807c-957bff42caff@github.com> On Mon, 10 Apr 2023 10:49:24 GMT, Daohan Qu wrote: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher order bits of an integer and should be prevented from being vectorized like `Math.abs()`( which is `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. Please refer to the new PR #13406. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13404#issuecomment-1501811713 From duke at openjdk.org Mon Apr 10 13:29:54 2023 From: duke at openjdk.org (Daohan Qu) Date: Mon, 10 Apr 2023 13:29:54 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes Message-ID: This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()`( which is represented by `Op_AbsI` in the following code). https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 I have tested this patch for tier 1-3 on x86-64. ------------- Commit messages: - Add a jtreg test for 8305324 - Prevent integer narrowed type backward propagation from passing through Op_ReverseBytesI operation, which should fix 8305324 Changes: https://git.openjdk.org/jdk/pull/13406/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305324 Stats: 57 lines in 2 files changed: 56 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13406.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13406/head:pull/13406 PR: https://git.openjdk.org/jdk/pull/13406 From jbhateja at openjdk.org Mon Apr 10 15:14:51 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Apr 2023 15:14:51 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 17:12:08 GMT, Quan Anh Mai wrote: >> Power of two step count may be replaced by logical right shifts. But special handling may impact generic path >> , currently c2 inline expander handles these special cases. >> >> Alternatively we can keep this implementation at its and enhance vector idealizations to handle identity scenarios, multiply by 1, addition by 0, shift replacement for power of two multiply, since their scalar counterparts do handle these cases and SLP generated code gets a benefit of that. > > Thanks a lot for your review, I think that transforming a multiplication by a power of 2 into a shift can be done by the C2 compiler. I have added the special case for `start = 0 && step == 1` since it may be more common and can be optimised away when the arguments are constants. For x86 byte vector multiplication is done at granularity of short lanes, this case shows regression with power of two multiplications which are strength reduced to shifts currently. please file a follow up bug report for this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1161806381 From xliu at openjdk.org Mon Apr 10 15:54:44 2023 From: xliu at openjdk.org (Xin Liu) Date: Mon, 10 Apr 2023 15:54:44 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v2] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 19:42:42 GMT, Xin Liu wrote: >> This patch improves how Region::Ideal trims unreachable paths. >> >> 1. Don't restart from beginning. Trimming doesn't change the DU-chain. >> 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. >> 3. Don't call add_users_to_worklist(this) repeatly. >> 4. Reduce its strength from add_users_to_worklist to >> add_users_to_worklist0 because RegionNode has no special logic. >> >> This patch also includes a cosmetic change: rename n to 'use' inside of the loop. >> Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Update coding style according to reviewer's feedback. hi, @vnkozlov Could you take a look this PR? thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13238#issuecomment-1501979313 From jkarthikeyan at openjdk.org Mon Apr 10 17:06:46 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 10 Apr 2023 17:06:46 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL [v2] In-Reply-To: References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Message-ID: On Mon, 10 Apr 2023 13:04:43 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch optimizes the sequence emitted by `AbsINode` and `AbsLNode` to save some instructions and 1 temp register. Please take a look and kindly leave your reviews. >> >> Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - description > - description Marked as reviewed by jkarthikeyan (Author). Great, thank you! ------------- PR Review: https://git.openjdk.org/jdk/pull/13402#pullrequestreview-1377836409 PR Comment: https://git.openjdk.org/jdk/pull/13402#issuecomment-1502063193 From jbhateja at openjdk.org Mon Apr 10 17:24:54 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 10 Apr 2023 17:24:54 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 18:04:16 GMT, Quan Anh Mai wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractShuffle.java line 96: >> >>> 94: } >>> 95: Vector shufvec = this.toBitsVector(); >>> 96: VectorMask vecmask = shufvec.compare(VectorOperators.LT, 0); >> >> This may impact the intrinsification over AVX1 targets for floating point shuffles. Since bits vector is an integral vector and AVX1 does support 32 byte floats but not 32 byte integral vectors. > > Yes I think it is a drawback of this approach, however currently we do not support shuffling for 256-bit vectors on AVX1 machines either, and AVX1 seems to be a special case in this regard. This species of float and double may also be less common in the usage of Vector API since it is larger than SPECIES_PREFERRED. Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for vector algorithms intercepting both integral and floating point vectors. FTR, we see a perf regression with Float256 based micro now on AVX=1 targets, public static short micro() { VectorShuffle iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, true); return iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1); } CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef CompileCommand: compileonly shufflef.micro bool compileonly = true ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0 ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0 @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) @ 34 java.lang.Object::getClass (0 bytes) (intrinsic) @ 54 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) [time] 386ms [res]3392 CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/ CPROMPT>export PATH=$JAVA_HOME/bin:$PATH CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef CompileCommand: compileonly shufflef.micro bool compileonly = true WARNING: Using incubator modules: jdk.incubator.vector @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) @ 17 jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes) (intrinsic) @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) [time] 7ms [res]3392 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1161810585 From qamai at openjdk.org Mon Apr 10 18:43:25 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 10 Apr 2023 18:43:25 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 15:11:55 GMT, Jatin Bhateja wrote: >> Thanks a lot for your review, I think that transforming a multiplication by a power of 2 into a shift can be done by the C2 compiler. I have added the special case for `start = 0 && step == 1` since it may be more common and can be optimised away when the arguments are constants. > > For x86 byte vector multiplication is done at granularity of short lanes, this case shows regression with power of two multiplications which are strength reduced to shifts currently. please file a follow up bug report for this. @jatin-bhateja I have created [JDK-8305810](https://bugs.openjdk.org/browse/JDK-8305810) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1161976583 From kvn at openjdk.org Mon Apr 10 18:45:38 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Apr 2023 18:45:38 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:44:17 GMT, Emanuel Peter wrote: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. If we have several unrolls we will have chain of `MaxL/MinL` nodes. Will the chain be folded by IGVN? src/hotspot/share/opto/loopTransform.cpp line 2296: > 2294: register_new_node(limit_l, get_ctrl(limit)); > 2295: Node* stride_l = new ConvI2LNode(stride); > 2296: register_new_node(stride_l, get_ctrl(limit)); Can we make Long constant (`_igvn.longcon(stride)`) instead since `stride` is constant? Similar to `underflow_clamp_l`. My concern is you set control to constant which is not `Root`. ------------- PR Review: https://git.openjdk.org/jdk/pull/13269#pullrequestreview-1377935449 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1161965268 From kvn at openjdk.org Mon Apr 10 18:48:00 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Apr 2023 18:48:00 GMT Subject: RFR: 8305740: C2: add print statements to assert: Can't determine return type. In-Reply-To: References: Message-ID: <7w453XpFls8FMawBknRZOjVtfySrYr5wmeDJl5jUMZM=.8d7eeeaf-2c00-483d-9f60-1f65d7f02c21@github.com> On Fri, 7 Apr 2023 09:08:55 GMT, Emanuel Peter wrote: > I added this assert before the bailout, because it probably hides bugs. Now we have failure reports with [JDK-8305185](https://bugs.openjdk.org/browse/JDK-8305185). > > It is difficult to reproduce, so I'd like to add some print statements to get at least a bit of info. > > Passed tests up to tier5 and stress testing. Yes. When you are adding assert always consider what information you will need to debug it. Marked as reviewed by kvn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13385#pullrequestreview-1377958239 PR Review: https://git.openjdk.org/jdk/pull/13385#pullrequestreview-1377958553 From kvn at openjdk.org Mon Apr 10 18:50:29 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Apr 2023 18:50:29 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: <-44ykJ9X7C-62lsILtrcer9h34EPBANNgOcnJb-RiQ4=.95176a9b-672b-4e11-a3ce-11b19931dd2d@github.com> On Mon, 10 Apr 2023 13:21:29 GMT, Daohan Qu wrote: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13406#pullrequestreview-1377961189 From kvn at openjdk.org Mon Apr 10 18:53:52 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Apr 2023 18:53:52 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 13:21:29 GMT, Daohan Qu wrote: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. One comment. We are not using bug id in test names mostly now. Please rename it to something meaningful and place test into `compiler/vectorization/` or `compiler/loopopts/superword/` directory. ------------- PR Review: https://git.openjdk.org/jdk/pull/13406#pullrequestreview-1377965263 From qamai at openjdk.org Mon Apr 10 19:05:35 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 10 Apr 2023 19:05:35 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 15:16:59 GMT, Jatin Bhateja wrote: >> Yes I think it is a drawback of this approach, however currently we do not support shuffling for 256-bit vectors on AVX1 machines either, and AVX1 seems to be a special case in this regard. This species of float and double may also be less common in the usage of Vector API since it is larger than SPECIES_PREFERRED. > > Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for vector algorithms intercepting both integral and floating point vectors. > > FTR, we see a perf regression with Float256 based micro now on AVX=1 targets, > > > public static short micro() { > VectorShuffle iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, true); > return iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1); > } > > CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef > CompileCommand: compileonly shufflef.micro bool compileonly = true > ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0 > ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0 > @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) > @ 34 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 54 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) > @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) > [time] 386ms [res]3392 > CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/ > CPROMPT>export PATH=$JAVA_HOME/bin:$PATH > CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef > CompileCommand: compileonly shufflef.micro bool compileonly = true > WARNING: Using incubator modules: jdk.incubator.vector > @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) > @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) > @ 17 jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes) (intrinsic) > @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) > [time] 7ms [res]3392 I see, what do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1161994748 From kvn at openjdk.org Mon Apr 10 19:49:32 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 10 Apr 2023 19:49:32 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v2] In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 19:42:42 GMT, Xin Liu wrote: >> This patch improves how Region::Ideal trims unreachable paths. >> >> 1. Don't restart from beginning. Trimming doesn't change the DU-chain. >> 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. >> 3. Don't call add_users_to_worklist(this) repeatly. >> 4. Reduce its strength from add_users_to_worklist to >> add_users_to_worklist0 because RegionNode has no special logic. >> >> This patch also includes a cosmetic change: rename n to 'use' inside of the loop. >> Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Update coding style according to reviewer's feedback. This looks reasonable - the code does not remove uses so we should not expect number of them is changing. May be add assert for that (check outcnt() before and after your loop). [JDK-8284358](https://bugs.openjdk.org/browse/JDK-8284358) changed logic for `add_users_to_worklist` call. Your change for that is correct. ------------- PR Review: https://git.openjdk.org/jdk/pull/13238#pullrequestreview-1378041125 From never at openjdk.org Tue Apr 11 00:15:41 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 11 Apr 2023 00:15:41 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent [v2] In-Reply-To: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: > Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: Add MO_SEQ_CST ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13389/files - new: https://git.openjdk.org/jdk/pull/13389/files/5236ddd6..3cd12b39 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13389&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13389&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13389.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13389/head:pull/13389 PR: https://git.openjdk.org/jdk/pull/13389 From pli at openjdk.org Tue Apr 11 02:27:37 2023 From: pli at openjdk.org (Pengfei Li) Date: Tue, 11 Apr 2023 02:27:37 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 13:21:29 GMT, Daohan Qu wrote: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. src/hotspot/share/opto/superword.cpp line 3941: > 3939: // RShiftI or AbsI operations, the compiler has to know the precise > 3940: // signedness info of the 1st operand. These operations shouldn't be > 3941: // vectorized if the signedness info is imprecise. Could you update the comments I wrote before? src/hotspot/share/opto/superword.cpp line 3944: > 3942: const Type* vt = vtn; > 3943: int op = in->Opcode(); > 3944: if (VectorNode::is_shift_opcode(op) || op == Op_AbsI || op == Op_ReverseBytesI) { (another suggestion) This list may be longer and longer as we vectorize more operations. Shall we move this check into a static function of `VectorNode`, like `VectorNode::requires_higher_order_bits(op)`, and put the comments inside? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1162232224 PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1162236801 From yzhu at openjdk.org Tue Apr 11 02:36:33 2023 From: yzhu at openjdk.org (Yanhong Zhu) Date: Tue, 11 Apr 2023 02:36:33 GMT Subject: RFR: 8305728: RISC-V: Use bexti instruction to do single-bit testing In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 00:52:15 GMT, Feilong Jiang wrote: > Current RISC-V port tests bit masks with `andi` instruction. But for those mask values not in the range of `simm12` (`andi` > only accepts sign-extended 12-bit immediate [1]), we need an extra temp register (`t0` as default for `andi`) to store the mask value [2]. > Since we now support Zbs extension of Bit-Manipulation, we have a more convenient way to test power-of-two bit > masks with the single instruction `bexti` [3] without any temp register. > > 1. https://github.com/riscv/riscv-isa-manual/blob/f6b8d5c7d2dcd935b48689a337c8f5bc2be4b5e5/src/rv32.tex#L519-L521 > 2. https://github.com/openjdk/jdk/blob/ce6e7461dc5ac56459a79e75d5de76929d1be0a3/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1852-L1860 > 3. https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > Testing: > - [x] `hotspot_tier1`, `jdk_tier1` on QEMU-User w/ and w/o `UseZbs` (release build) > - [x] tier1 tests on unmatched board w/o `UseZbs` (release build) Looks good. ------------- Marked as reviewed by yzhu (Author). PR Review: https://git.openjdk.org/jdk/pull/13368#pullrequestreview-1378366066 From fjiang at openjdk.org Tue Apr 11 02:58:40 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 11 Apr 2023 02:58:40 GMT Subject: RFR: 8305728: RISC-V: Use bexti instruction to do single-bit testing In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 00:52:15 GMT, Feilong Jiang wrote: > Current RISC-V port tests bit masks with `andi` instruction. But for those mask values not in the range of `simm12` (`andi` > only accepts sign-extended 12-bit immediate [1]), we need an extra temp register (`t0` as default for `andi`) to store the mask value [2]. > Since we now support Zbs extension of Bit-Manipulation, we have a more convenient way to test power-of-two bit > masks with the single instruction `bexti` [3] without any temp register. > > 1. https://github.com/riscv/riscv-isa-manual/blob/f6b8d5c7d2dcd935b48689a337c8f5bc2be4b5e5/src/rv32.tex#L519-L521 > 2. https://github.com/openjdk/jdk/blob/ce6e7461dc5ac56459a79e75d5de76929d1be0a3/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1852-L1860 > 3. https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > Testing: > - [x] `hotspot_tier1`, `jdk_tier1` on QEMU-User w/ and w/o `UseZbs` (release build) > - [x] tier1-3 tests on unmatched board w/o `UseZbs` (release build) hotspot/jdk tier2-3 are also good w/ `UseZbs` on QEMU-User with release build. @yhzhu20 @RealFYang -- Thank you all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13368#issuecomment-1502617836 From fyang at openjdk.org Tue Apr 11 02:59:39 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 11 Apr 2023 02:59:39 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v14] In-Reply-To: <7-vzGGq80CYZvbB9N5qrn5mrhVhxxh97oo7AkgYRn1k=.8b9fe40f-d858-43d0-bdb8-4b050009876f@github.com> References: <7-vzGGq80CYZvbB9N5qrn5mrhVhxxh97oo7AkgYRn1k=.8b9fe40f-d858-43d0-bdb8-4b050009876f@github.com> Message-ID: <6xhQpQfUt6G3C7GES9UhRMCSFyfNGIF1ANCQ1u_xcQA=.acc3520f-df68-494d-ab30-c4af7334c8b8@github.com> On Fri, 7 Apr 2023 12:16:44 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [ ] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Remove unneeded combination nodes Hi, I would suggest add support for nodes like LoadVectorMasked, StoreVectorMaskedSome at the same time. Some initial comments after a a brief look. src/hotspot/cpu/riscv/riscv_v.ad line 162: > 160: instruct vmaskcmp(vRegMask dst, vReg src1, vReg src2, immI cond) %{ > 161: match(Set dst (VectorMaskCmp (Binary src1 src2) cond)); > 162: format %{ "vmaskcmp_rvv $dst, $src1, $src2, $cond" %} Suggestion: s/vmaskcmp_rvv/vmaskcmp/ src/hotspot/cpu/riscv/riscv_v.ad line 175: > 173: match(Set dst (VectorMaskCmp (Binary src1 src2) (Binary cond vmask))); > 174: effect(TEMP tmp); > 175: format %{ "vmaskcmp_rvv_masked $dst, $src1, $src2, $vmask, $tmp, $cond" %} Suggestion: s/vmaskcmp_rvv_masked/vmaskcmp_masked/ src/hotspot/cpu/riscv/riscv_v.ad line 2561: > 2559: %} > 2560: > 2561: instruct vmaskcast_same_esize_rvv(vRegMask dst_src) %{ Suggestion: s/vmaskcast_same_esize_rvv/vmaskcast_same_esize/ src/hotspot/cpu/riscv/riscv_v.ad line 2562: > 2560: > 2561: instruct vmaskcast_same_esize_rvv(vRegMask dst_src) %{ > 2562: predicate(Matcher::vector_length_in_bytes(n) == Matcher::vector_length_in_bytes(n->in(1))); Can we add support for other cases for VectorMaskCast at the same time? src/hotspot/cpu/riscv/riscv_v.ad line 2565: > 2563: match(Set dst_src (VectorMaskCast dst_src)); > 2564: ins_cost(0); > 2565: format %{ "vmaskcast_same_esize_rvv $dst_src\t# do nothing" %} Suggestion: s/vmaskcast_same_esize_rvv/vmaskcast_same_esize/ ------------- Changes requested by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1307347158 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1162247035 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1162247224 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1162247490 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1162248169 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1162247571 From fyang at openjdk.org Tue Apr 11 02:59:42 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 11 Apr 2023 02:59:42 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v2] In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 09:40:55 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [ ] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has refreshed the contents of this pull request, and previous commits have been removed. Incremental views are not available. src/hotspot/cpu/riscv/riscv.ad line 3540: > 3538: %} > 3539: > 3540: operand pRegGov() I think it's better to rename "pRegGov" into something like "vReg_V0". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1113009360 From fjiang at openjdk.org Tue Apr 11 03:03:52 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 11 Apr 2023 03:03:52 GMT Subject: Integrated: 8305728: RISC-V: Use bexti instruction to do single-bit testing In-Reply-To: References: Message-ID: <_8NWAHG3MNTG80W0804YRn0jhrM2PV1_y9Q3zHJNeu0=.cf738c1e-1c7a-435d-9b0f-fffd936db0af@github.com> On Thu, 6 Apr 2023 00:52:15 GMT, Feilong Jiang wrote: > Current RISC-V port tests bit masks with `andi` instruction. But for those mask values not in the range of `simm12` (`andi` > only accepts sign-extended 12-bit immediate [1]), we need an extra temp register (`t0` as default for `andi`) to store the mask value [2]. > Since we now support Zbs extension of Bit-Manipulation, we have a more convenient way to test power-of-two bit > masks with the single instruction `bexti` [3] without any temp register. > > 1. https://github.com/riscv/riscv-isa-manual/blob/f6b8d5c7d2dcd935b48689a337c8f5bc2be4b5e5/src/rv32.tex#L519-L521 > 2. https://github.com/openjdk/jdk/blob/ce6e7461dc5ac56459a79e75d5de76929d1be0a3/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1852-L1860 > 3. https://github.com/riscv/riscv-bitmanip/blob/main/bitmanip/insns/bexti.adoc > > Testing: > - [x] `hotspot_tier1`, `jdk_tier1` on QEMU-User w/ and w/o `UseZbs` (release build) > - [x] tier1-3 tests on unmatched board w/o `UseZbs` (release build) This pull request has now been integrated. Changeset: 13751302 Author: Feilong Jiang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/137513025dad06fc08818fa832edb4a487298f81 Stats: 86 lines in 15 files changed: 12 ins; 0 del; 74 mod 8305728: RISC-V: Use bexti instruction to do single-bit testing Reviewed-by: fyang, yzhu ------------- PR: https://git.openjdk.org/jdk/pull/13368 From qamai at openjdk.org Tue Apr 11 04:13:32 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Apr 2023 04:13:32 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 04:52:14 GMT, Jasmine Karthikeyan wrote: > Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruct ions that are generated, as that can help in cases where applications are I-Cache bound. > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) > > > This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA I think this should be done in the middle-end instead. May I ask what are the advantages of `Conv2B` over `CMove` that we need to have it all the way to matching? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1502661307 From thartmann at openjdk.org Tue Apr 11 06:06:31 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Apr 2023 06:06:31 GMT Subject: RFR: 8305740: C2: add print statements to assert: Can't determine return type. In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 09:08:55 GMT, Emanuel Peter wrote: > I added this assert before the bailout, because it probably hides bugs. Now we have failure reports with [JDK-8305185](https://bugs.openjdk.org/browse/JDK-8305185). > > It is difficult to reproduce, so I'd like to add some print statements to get at least a bit of info. > > Passed tests up to tier5 and stress testing. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13385#pullrequestreview-1378502703 From chagedorn at openjdk.org Tue Apr 11 06:50:41 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 11 Apr 2023 06:50:41 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:44:17 GMT, Emanuel Peter wrote: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. That looks good to me and indeed much cleaner! I only have some minor code style comments. > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. I agree with that. > I have already recently fixed a bug around this CMoveI. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. I've had a feeling that we are revisiting this again at some point. > Should we add such an assert during IGVN? I think after IGVN, we should never have a MultiBranchNode that does not have the required number of outputs, right? We could add it to VerifyIterativeGVN. That would be a good idea to investigate in an RFE - maybe also for other nodes to assert on well-known input/output patterns. We've had such problems before after CCP with `If` nodes with only one out projection. src/hotspot/share/opto/addnode.cpp line 1265: > 1263: const TypeLong* r1 = t1->is_long(); > 1264: > 1265: return TypeLong::make( MAX2(r0->_lo,r1->_lo), MAX2(r0->_hi,r1->_hi), MAX2(r0->_widen,r1->_widen) ); Suggestion: return TypeLong::make(MAX2(r0->_lo, r1->_lo), MAX2(r0->_hi, r1->_hi), MAX2(r0->_widen, r1->_widen)); src/hotspot/share/opto/addnode.cpp line 1286: > 1284: const TypeLong* r1 = t1->is_long(); > 1285: > 1286: return TypeLong::make( MIN2(r0->_lo,r1->_lo), MIN2(r0->_hi,r1->_hi), MIN2(r0->_widen,r1->_widen) ); Suggestion: return TypeLong::make(MIN2(r0->_lo, r1->_lo), MIN2(r0->_hi, r1->_hi), MIN2(r0->_widen, r1->_widen)); src/hotspot/share/opto/addnode.hpp line 325: > 323: class MaxLNode : public MaxNode { > 324: public: > 325: MaxLNode(Compile* C, Node *in1, Node *in2) : MaxNode(in1, in2) { Suggestion: MaxLNode(Compile* C, Node* in1, Node* in2) : MaxNode(in1, in2) { src/hotspot/share/opto/addnode.hpp line 330: > 328: } > 329: virtual int Opcode() const; > 330: virtual const Type* add_ring(const Type*, const Type*) const; Suggestion: virtual const Type* add_ring(const Type* t0, const Type* t1) const; src/hotspot/share/opto/addnode.hpp line 343: > 341: class MinLNode : public MaxNode { > 342: public: > 343: MinLNode(Compile* C, Node *in1, Node *in2) : MaxNode(in1, in2) { Suggestion: MinLNode(Compile* C, Node* in1, Node* in2) : MaxNode(in1, in2) { src/hotspot/share/opto/addnode.hpp line 348: > 346: } > 347: virtual int Opcode() const; > 348: virtual const Type* add_ring(const Type*, const Type*) const; Suggestion: virtual const Type* add_ring(const Type* t0, const Type* t1) const; ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13269#pullrequestreview-1368479697 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162351912 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162352744 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162353139 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162353467 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162353721 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162353830 From stuefe at openjdk.org Tue Apr 11 06:55:30 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 11 Apr 2023 06:55:30 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 [v2] In-Reply-To: References: Message-ID: > The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. > > Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: reuse Assembler::brk ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13401/files - new: https://git.openjdk.org/jdk/pull/13401/files/bb5b3592..c9adf3bc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13401&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13401&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13401.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13401/head:pull/13401 PR: https://git.openjdk.org/jdk/pull/13401 From stuefe at openjdk.org Tue Apr 11 06:55:32 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 11 Apr 2023 06:55:32 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 In-Reply-To: References: Message-ID: On Sun, 9 Apr 2023 07:45:46 GMT, Thomas Stuefe wrote: > The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. > > Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. Thanks @feilongjiang; I reuse brk now. Re-tested with various hints, still works. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13401#issuecomment-1502771412 From epeter at openjdk.org Tue Apr 11 07:39:33 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 11 Apr 2023 07:39:33 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 18:43:00 GMT, Vladimir Kozlov wrote: >If we have several unrolls we will have chain of MaxL/MinL nodes. Will the chain be folded by IGVN? @vnkozlov I fear it would not fold currently. The CMove would not fold before either, but with repeated unrolling, the CMove was reused, and so there was only ever a single CMove (unless some RC got in between). I think in many cases, the type does not underflow, and the `MaxL/MinL` can be removed completely. However, if that does not work, I think it now also fails to remove the repeated `ConvI2L / ConvL2I`. We would have to add more IGVN optimizations to fold things more. I think the performance impact is now insignificant, if it does not fold. Because the limits are only calculated once per loop. We can still improve the folding, if you want. I can also do that in a follow-up RFE, and try to add some IR tests that target type-limit underflow, and count the `MaxL/MinL` nodes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1502838643 From epeter at openjdk.org Tue Apr 11 07:47:34 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 11 Apr 2023 07:47:34 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v2] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review (Christian) Thanks for the suggestions! Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/fdf6b08a..2027a3e1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=00-01 Stats: 6 lines in 2 files changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From epeter at openjdk.org Tue Apr 11 07:47:37 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 11 Apr 2023 07:47:37 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: On Fri, 31 Mar 2023 12:44:17 GMT, Emanuel Peter wrote: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. I added the idea about verifying out-proj of `MultiBranch` to this RFE [JDK-8298951](https://bugs.openjdk.org/browse/JDK-8298951). ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1502844815 From epeter at openjdk.org Tue Apr 11 08:00:35 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 11 Apr 2023 08:00:35 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v3] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: stride_l should be longcon ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/2027a3e1..e08a9ef7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From epeter at openjdk.org Tue Apr 11 08:00:38 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 11 Apr 2023 08:00:38 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v3] In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 18:25:43 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> stride_l should be longcon > > src/hotspot/share/opto/loopTransform.cpp line 2296: > >> 2294: register_new_node(limit_l, get_ctrl(limit)); >> 2295: Node* stride_l = new ConvI2LNode(stride); >> 2296: register_new_node(stride_l, get_ctrl(limit)); > > Can we make Long constant (`_igvn.longcon(stride)`) instead since `stride` is constant? Similar to `underflow_clamp_l`. My concern is you set control to constant which is not `Root`. Good point, will replace it with constant. Yes, I had the ctrl wrong, it should be root. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162432158 From dzhang at openjdk.org Tue Apr 11 08:26:38 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 11 Apr 2023 08:26:38 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v14] In-Reply-To: <6xhQpQfUt6G3C7GES9UhRMCSFyfNGIF1ANCQ1u_xcQA=.acc3520f-df68-494d-ab30-c4af7334c8b8@github.com> References: <7-vzGGq80CYZvbB9N5qrn5mrhVhxxh97oo7AkgYRn1k=.8b9fe40f-d858-43d0-bdb8-4b050009876f@github.com> <6xhQpQfUt6G3C7GES9UhRMCSFyfNGIF1ANCQ1u_xcQA=.acc3520f-df68-494d-ab30-c4af7334c8b8@github.com> Message-ID: On Tue, 11 Apr 2023 02:48:55 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove unneeded combination nodes > > src/hotspot/cpu/riscv/riscv_v.ad line 162: > >> 160: instruct vmaskcmp(vRegMask dst, vReg src1, vReg src2, immI cond) %{ >> 161: match(Set dst (VectorMaskCmp (Binary src1 src2) cond)); >> 162: format %{ "vmaskcmp_rvv $dst, $src1, $src2, $cond" %} > > Suggestion: s/vmaskcmp_rvv/vmaskcmp/ Thanks for the review! We will fix it in the next commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1162467441 From aph at openjdk.org Tue Apr 11 08:32:36 2023 From: aph at openjdk.org (Andrew Haley) Date: Tue, 11 Apr 2023 08:32:36 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: <_EkA8118lmQT0X8Ju47AZA70EsFT0g_NPs_yUJ4Lqrg=.3f3b9c23-73a9-439e-8473-9cd2ff0c722e@github.com> On Tue, 28 Mar 2023 02:06:41 GMT, Chang Peng wrote: > We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. > > The following instruction sequence > > > movi v17.16b, #12 > cmpgt p0.b, p7/z, z16.b, z17.b > > > can be optimized to: > > > cmpgt p0.b, p7/z, z16.b, #12 > > > This patch does the following: > 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. > SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to > 127)or 5bit signed integer immediate (range from -16 to 15). > > 2. Add optimized match rules to generate the compare-with-immediate instructions. > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- src/hotspot/cpu/aarch64/aarch64.ad line 4328: > 4326: operand immI_cmpU_cond() > 4327: %{ > 4328: predicate(n->get_int() > (int)(BoolTest::unsigned_compare)); This function is in x86.ad. I suggest you move it to shared code, then use it here. static inline bool is_unsigned_booltest_pred(int bt) { return ((bt & BoolTest::unsigned_compare) == BoolTest::unsigned_compare); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1162473468 From aph at openjdk.org Tue Apr 11 08:35:33 2023 From: aph at openjdk.org (Andrew Haley) Date: Tue, 11 Apr 2023 08:35:33 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 02:06:41 GMT, Chang Peng wrote: > We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. > > The following instruction sequence > > > movi v17.16b, #12 > cmpgt p0.b, p7/z, z16.b, z17.b > > > can be optimized to: > > > cmpgt p0.b, p7/z, z16.b, #12 > > > This patch does the following: > 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. > SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to > 127)or 5bit signed integer immediate (range from -16 to 15). > > 2. Add optimized match rules to generate the compare-with-immediate instructions. > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- src/hotspot/cpu/aarch64/aarch64.ad line 4438: > 4436: operand immI5() > 4437: %{ > 4438: predicate(((-(1 << 4)) <= n->get_int()) && (n->get_int() < (1 << 4))); At some point someone must realize that `((-(1 << size)) <= n->get_int()) && (n->get_int() < (1 << size))` could be a function. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1162477246 From dzhang at openjdk.org Tue Apr 11 08:39:38 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 11 Apr 2023 08:39:38 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v14] In-Reply-To: <6xhQpQfUt6G3C7GES9UhRMCSFyfNGIF1ANCQ1u_xcQA=.acc3520f-df68-494d-ab30-c4af7334c8b8@github.com> References: <7-vzGGq80CYZvbB9N5qrn5mrhVhxxh97oo7AkgYRn1k=.8b9fe40f-d858-43d0-bdb8-4b050009876f@github.com> <6xhQpQfUt6G3C7GES9UhRMCSFyfNGIF1ANCQ1u_xcQA=.acc3520f-df68-494d-ab30-c4af7334c8b8@github.com> Message-ID: On Tue, 11 Apr 2023 02:51:35 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove unneeded combination nodes > > src/hotspot/cpu/riscv/riscv_v.ad line 2562: > >> 2560: >> 2561: instruct vmaskcast_same_esize_rvv(vRegMask dst_src) %{ >> 2562: predicate(Matcher::vector_length_in_bytes(n) == Matcher::vector_length_in_bytes(n->in(1))); > > Can we add support for other cases for VectorMaskCast at the same time? Each bit in the mask register in RVV always corresponds to an element, and there is no need to convert the width of the mask according to the type of vector element, as in aarch64. So we just need to remove the predicate here and change the instruct name to make it match the logic, which is the same as x86 AVX512[1]. [1] https://github.com/openjdk/jdk/blob/cd7d53c88c27eedbe16020b88c2219708d170a1e/src/hotspot/cpu/x86/x86.ad#L8326-L8334 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1162481559 From aph at openjdk.org Tue Apr 11 08:45:35 2023 From: aph at openjdk.org (Andrew Haley) Date: Tue, 11 Apr 2023 08:45:35 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 02:06:41 GMT, Chang Peng wrote: > We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. > > The following instruction sequence > > > movi v17.16b, #12 > cmpgt p0.b, p7/z, z16.b, z17.b > > > can be optimized to: > > > cmpgt p0.b, p7/z, z16.b, #12 > > > This patch does the following: > 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. > SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to > 127)or 5bit signed integer immediate (range from -16 to 15). > > 2. Add optimized match rules to generate the compare-with-immediate instructions. > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3823: > 3821: f((cond_op >> 1) & 0x7, 15, 13), pgrf(Pg, 10), rf(Zn, 5); > 3822: f(cond_op & 0x1, 4), prf(Pd, 0); > 3823: } This entire block of code needs extensive refactoring. Please write a function from Condition -> int. Use a simple boolean is_unsigned. Extract the common code from the two arms of this if statement. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1162489309 From dnsimon at openjdk.org Tue Apr 11 08:53:36 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 11 Apr 2023 08:53:36 GMT Subject: RFR: 8305419: JDK-8301995 broke building libgraal In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 19:41:45 GMT, Tom Rodriguez wrote: > There were some minor mismatches is where the indy index was passed and where the real constant pool index was used. There's no general test of `loadReferencedType` but I modified TestDynamicConstant to exercise it. The arguments to the invokedynamic in that test were mildly broken as well which was only exposed once we tried to resolve it. Marked as reviewed by dnsimon (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13392#pullrequestreview-1378748318 From aph at openjdk.org Tue Apr 11 08:53:35 2023 From: aph at openjdk.org (Andrew Haley) Date: Tue, 11 Apr 2023 08:53:35 GMT Subject: RFR: 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 07:21:05 GMT, Pengfei Li wrote: > From recent tests, we find that `System.arraycopy()` call with a negated variable as its length argument does not perform the copy. This issue is reproducible by below test case on AArch64 platforms with SVE. > > > public class Test { > static char[] src = {'A', 'A', 'A', 'A', 'A'}; > static char[] dst = {'B', 'B', 'B', 'B', 'B'}; > > static void copy(int nlen) { > System.arraycopy(src, 0, dst, 0, -nlen); > } > > public static void main(String[] args) { > for (int i = 0; i < 25000; i++) { > copy(0); > } > copy(-5); > for (char c : dst) { > if (c != 'A') { > throw new RuntimeException("Wrong value!"); > } > } > System.out.println("PASS"); > } > } > > /* > $ java -Xint Test > PASS > $ java -Xbatch Test > Exception in thread "main" java.lang.RuntimeException: Wrong value! > at Test.main(Test.java:16) > */ > > > Cause of this is a new AArch64 matching rule `vmask_gen_sub` introduced by JDK-8293198. It matches `VectorMaskGen (SubL src1 src2)` on AArch64 platforms with SVE and generates SVE `whilelo` instructions. Current C2 compiler uses a technique called "partial inlining" to vectorize small array copy operations by generating vector masks. In above test case, a negated variable `-nlen` is used as the length argument of the call and `-nlen` has a small positive value, so it is a "partial inlining" case. C2 will transform the ideal graph to `VectorMaskGen (SubL 0 nlen)` and eventually output an instruction of `whilelo p0, nlen, zr` which always generates an all-false vector mask. That's why arraycopy does nothing. > > The problem of that matching rule is that it regards inputs `src1` and `src2` as unsigned long integers but they can be signed in use cases of arraycopy. To fix the issue, this patch replaces `whilelo` instruction by `whilelt` in that rule as well as some other places. > > We tested tier1~3 on SVE and found no new failure. A jtreg math library test jdk/internal/math/FloatingDecimal/TestFloatingDecimal.java which fails on SVE before can pass now. Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13382#pullrequestreview-1378748623 From aph at openjdk.org Tue Apr 11 08:57:36 2023 From: aph at openjdk.org (Andrew Haley) Date: Tue, 11 Apr 2023 08:57:36 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 [v2] In-Reply-To: References: Message-ID: <2GMaeVAv8EfUb9OdDDKByBFuFF8gs13gtccwGCUKu10=.22cd7186-6cd4-4ccc-a358-cfe3004eb155@github.com> On Tue, 11 Apr 2023 06:55:30 GMT, Thomas Stuefe wrote: >> The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. >> >> Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > reuse Assembler::brk src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 143: > 141: } > 142: > 143: void LIR_Assembler::breakpoint() { __ breakpoint(); } Suggestion: void LIR_Assembler::breakpoint() { __ brk(0); } I don't think we need the rest of this patch. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13401#discussion_r1162503145 From thartmann at openjdk.org Tue Apr 11 09:02:34 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Apr 2023 09:02:34 GMT Subject: RFR: 8305419: JDK-8301995 broke building libgraal In-Reply-To: References: Message-ID: <11j4h-td_1Wc32gXCbWmfI7pDl3ImqNnr0QyIAoW7zg=.0e0d05ac-fe1d-456d-8da7-460215f45a48@github.com> On Fri, 7 Apr 2023 19:41:45 GMT, Tom Rodriguez wrote: > There were some minor mismatches is where the indy index was passed and where the real constant pool index was used. There's no general test of `loadReferencedType` but I modified TestDynamicConstant to exercise it. The arguments to the invokedynamic in that test were mildly broken as well which was only exposed once we tried to resolve it. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13392#pullrequestreview-1378766107 From thartmann at openjdk.org Tue Apr 11 09:07:40 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Apr 2023 09:07:40 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v3] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 08:00:35 GMT, Emanuel Peter wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Testing** >> >> I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. >> >> Tested up to `tier5` and stress testing. Performance testing **running...** >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > stride_l should be longcon src/hotspot/share/opto/addnode.cpp line 1272: > 1270: const TypeLong* t2 = phase->type(in(2))->is_long(); > 1271: > 1272: // Can we determine minimum statically? Suggestion: // Can we determine maximum statically? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1162513651 From duke at openjdk.org Tue Apr 11 09:15:34 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 09:15:34 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 18:50:52 GMT, Vladimir Kozlov wrote: > One comment. We are not using bug id in test names mostly now. Please rename it to something meaningful and place test into `compiler/vectorization/` or `compiler/loopopts/superword/` directory. Thanks for your review. I'll rename it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1502965794 From duke at openjdk.org Tue Apr 11 09:18:35 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 09:18:35 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 02:14:11 GMT, Pengfei Li wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > src/hotspot/share/opto/superword.cpp line 3941: > >> 3939: // RShiftI or AbsI operations, the compiler has to know the precise >> 3940: // signedness info of the 1st operand. These operations shouldn't be >> 3941: // vectorized if the signedness info is imprecise. > > Could you update the comments I wrote before? Sorry, I forget to update the comments. I'll do it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1162529920 From duke at openjdk.org Tue Apr 11 09:26:34 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 09:26:34 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 02:24:42 GMT, Pengfei Li wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > src/hotspot/share/opto/superword.cpp line 3944: > >> 3942: const Type* vt = vtn; >> 3943: int op = in->Opcode(); >> 3944: if (VectorNode::is_shift_opcode(op) || op == Op_AbsI || op == Op_ReverseBytesI) { > > (another suggestion) This list may be longer and longer as we vectorize more operations. Shall we move this check into a static function of `VectorNode`, like `VectorNode::requires_higher_order_bits(op)`, and put the comments inside? Good suggestion, I think a function can make this condition self-documented. Thank you. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1162539678 From duke at openjdk.org Tue Apr 11 09:30:33 2023 From: duke at openjdk.org (Chang Peng) Date: Tue, 11 Apr 2023 09:30:33 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: <0-kFcOGOTwD4aBOrKu5xSAGoMTOQHrC1Eb6buucSNfo=.d813a3fe-dd74-430d-8251-dc324f5fd485@github.com> On Tue, 11 Apr 2023 08:43:04 GMT, Andrew Haley wrote: > This entire block of code needs extensive refactoring. > > Please write a function from Condition -> int. Use a simple boolean is_unsigned. Extract the common code from the two arms of this if statement. Ok, thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1162544493 From epeter at openjdk.org Tue Apr 11 09:37:28 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 11 Apr 2023 09:37:28 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Review suggestion by Tobias Hartmann Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/e08a9ef7..ecdff09b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From qamai at openjdk.org Tue Apr 11 09:38:50 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Apr 2023 09:38:50 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 15:16:59 GMT, Jatin Bhateja wrote: >> Yes I think it is a drawback of this approach, however currently we do not support shuffling for 256-bit vectors on AVX1 machines either, and AVX1 seems to be a special case in this regard. This species of float and double may also be less common in the usage of Vector API since it is larger than SPECIES_PREFERRED. > > Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for vector algorithms intercepting both integral and floating point vectors. > > FTR, we see a perf regression with Float256 based micro now on AVX=1 targets, > > > public static short micro() { > VectorShuffle iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, true); > return iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1); > } > > CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef > CompileCommand: compileonly shufflef.micro bool compileonly = true > ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0 > ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0 > @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) > @ 34 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 54 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) > @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) > [time] 386ms [res]3392 > CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/ > CPROMPT>export PATH=$JAVA_HOME/bin:$PATH > CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef > CompileCommand: compileonly shufflef.micro bool compileonly = true > WARNING: Using incubator modules: jdk.incubator.vector > @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) > @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) > @ 17 jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes) (intrinsic) > @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) > @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) > @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) > [time] 7ms [res]3392 @jatin-bhateja Since `Float256Shuffle` is represented as a 256-bit int vector, which is not supported by AVX1, the compiled code falls back to Java implementation, which explains the regression. However, having a `VectorShuffle` but not for `Vector::rearrange` is not really useful, and the code snippet is similar to `ShortVector.SPECIES_128.iotaShuffle(0, 1, true).toVector().reinterpretAsShorts().lane(1)`. As a result, I think having some regressions in edge cases of AVX1 is acceptable in contrast with the improvement in all other operations on all platforms. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1162555106 From thartmann at openjdk.org Tue Apr 11 09:49:37 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Apr 2023 09:49:37 GMT Subject: RFR: JDK-8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process [v2] In-Reply-To: <6fLO1cz_N50WWJ_qEPVtAyXdVwrwKuzKNI_hkEr-3kg=.9743f6fb-0eda-4ad7-bce6-3c5c78c1a129@github.com> References: <6fLO1cz_N50WWJ_qEPVtAyXdVwrwKuzKNI_hkEr-3kg=.9743f6fb-0eda-4ad7-bce6-3c5c78c1a129@github.com> Message-ID: <8CEvFxq0MCzx6ImrdR7ZSyA9Qxj0_IKULkRCDvJTfw0=.e9f6b96c-4fb2-4664-af37-df14de4d9d3e@github.com> On Mon, 3 Apr 2023 15:41:56 GMT, Justin King wrote: >> Remove unnecessary usage of Arena for globals that live the lifetime of the process and are initialized once. > > Justin King has updated the pull request incrementally with three additional commits since the last revision: > > - Remove now unused include > > Signed-off-by: Justin King > - Remove incorrect comment > > Signed-off-by: Justin King > - Fix typo > > Signed-off-by: Justin King Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13300#pullrequestreview-1378851697 From thartmann at openjdk.org Tue Apr 11 09:51:35 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 11 Apr 2023 09:51:35 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 09:37:28 GMT, Emanuel Peter wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Testing** >> >> I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. >> >> Tested up to `tier5` and stress testing. Performance testing **running...** >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Review suggestion by Tobias Hartmann > > Co-authored-by: Tobias Hartmann Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13269#pullrequestreview-1378854862 From duke at openjdk.org Tue Apr 11 10:10:48 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 10:10:48 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v2] In-Reply-To: References: Message-ID: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: Rename the jtreg test according to a comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13406/files - new: https://git.openjdk.org/jdk/pull/13406/files/8cad83db..cafec83b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=00-01 Stats: 112 lines in 2 files changed: 56 ins; 56 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13406.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13406/head:pull/13406 PR: https://git.openjdk.org/jdk/pull/13406 From duke at openjdk.org Tue Apr 11 11:26:39 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 11:26:39 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v3] In-Reply-To: References: Message-ID: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: Use a function in branch condition as suggested ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13406/files - new: https://git.openjdk.org/jdk/pull/13406/files/cafec83b..7d90bfaa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=01-02 Stats: 24 lines in 3 files changed: 17 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/13406.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13406/head:pull/13406 PR: https://git.openjdk.org/jdk/pull/13406 From duke at openjdk.org Tue Apr 11 11:26:57 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 11:26:57 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v3] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 09:23:37 GMT, Daohan Qu wrote: >> src/hotspot/share/opto/superword.cpp line 3944: >> >>> 3942: const Type* vt = vtn; >>> 3943: int op = in->Opcode(); >>> 3944: if (VectorNode::is_shift_opcode(op) || op == Op_AbsI || op == Op_ReverseBytesI) { >> >> (another suggestion) This list may be longer and longer as we vectorize more operations. Shall we move this check into a static function of `VectorNode`, like `VectorNode::requires_higher_order_bits(op)`, and put the comments inside? > > Good suggestion, I think a function can make this condition self-documented. Thank you. I have made the changes but I leave the updated comments here. This comment seems to be more related to the context here instead of that new function. Could you please review it again? Thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1162671106 From duke at openjdk.org Tue Apr 11 11:48:26 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 11:48:26 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v4] In-Reply-To: References: Message-ID: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: Update full name ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13406/files - new: https://git.openjdk.org/jdk/pull/13406/files/7d90bfaa..2d92210e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=02-03 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13406.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13406/head:pull/13406 PR: https://git.openjdk.org/jdk/pull/13406 From stuefe at openjdk.org Tue Apr 11 14:12:37 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 11 Apr 2023 14:12:37 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 [v2] In-Reply-To: <2GMaeVAv8EfUb9OdDDKByBFuFF8gs13gtccwGCUKu10=.22cd7186-6cd4-4ccc-a358-cfe3004eb155@github.com> References: <2GMaeVAv8EfUb9OdDDKByBFuFF8gs13gtccwGCUKu10=.22cd7186-6cd4-4ccc-a358-cfe3004eb155@github.com> Message-ID: <7ns_ObOwKgL8H2TlGhh8PWEQQo3HQqyJlQIDlEhFDBw=.7eafe449-1928-465c-aaec-3c49d1b74dd2@github.com> On Tue, 11 Apr 2023 08:54:24 GMT, Andrew Haley wrote: >> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: >> >> reuse Assembler::brk > > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 143: > >> 141: } >> 142: >> 143: void LIR_Assembler::breakpoint() { __ breakpoint(); } > > Suggestion: > > void LIR_Assembler::breakpoint() { __ brk(0); } > > > I don't think we need the rest of this patch. "Assembler::breakpoint" does not require me to remember the concrete trap instruction on the platform I'm on. Otherwise, I need to remember brk(), illtrap(), emit(0xCC), ... etc. But right now, the only platform implementing this is Arm. I don't have a lot of emotions here though. This is just a debugging aid. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13401#discussion_r1162879467 From rrich at openjdk.org Tue Apr 11 14:25:32 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 11 Apr 2023 14:25:32 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: Message-ID: <-PvmUW02zjXLVrTELOX6IJ7mtthBK6ryAxtyd-57LtQ=.1caa1d2d-4fcb-4492-ae81-99a22035ab3f@github.com> On Thu, 6 Apr 2023 13:22:49 GMT, Richard Reingruber wrote: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Going back to draft because the following fails with this pr: time make test TEST=test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03 ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1503460257 From never at openjdk.org Tue Apr 11 14:59:45 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 11 Apr 2023 14:59:45 GMT Subject: RFR: 8305419: JDK-8301995 broke building libgraal In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 19:41:45 GMT, Tom Rodriguez wrote: > There were some minor mismatches is where the indy index was passed and where the real constant pool index was used. There's no general test of `loadReferencedType` but I modified TestDynamicConstant to exercise it. The arguments to the invokedynamic in that test were mildly broken as well which was only exposed once we tried to resolve it. Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13392#issuecomment-1503527915 From never at openjdk.org Tue Apr 11 14:59:46 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 11 Apr 2023 14:59:46 GMT Subject: Integrated: 8305419: JDK-8301995 broke building libgraal In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 19:41:45 GMT, Tom Rodriguez wrote: > There were some minor mismatches is where the indy index was passed and where the real constant pool index was used. There's no general test of `loadReferencedType` but I modified TestDynamicConstant to exercise it. The arguments to the invokedynamic in that test were mildly broken as well which was only exposed once we tried to resolve it. This pull request has now been integrated. Changeset: 12946f57 Author: Tom Rodriguez URL: https://git.openjdk.org/jdk/commit/12946f5748c819f436e9d16a150313656d059ec2 Stats: 53 lines in 4 files changed: 24 ins; 17 del; 12 mod 8305419: JDK-8301995 broke building libgraal Reviewed-by: matsaave, dnsimon, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13392 From kvn at openjdk.org Tue Apr 11 16:08:44 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Apr 2023 16:08:44 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v4] In-Reply-To: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> References: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> Message-ID: On Tue, 11 Apr 2023 11:48:26 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Update full name src/hotspot/share/opto/vectornode.cpp line 440: > 438: // of its higher order bits/bytes > 439: bool VectorNode::requires_higher_order_bits_of_integer(int opc) { > 440: if (is_shift_opcode(opc) && opc != Op_LShiftI) { Checking Op_LShiftI here will change behavior for `Short s = LShiftI(LoadB)` case and similar. May be it is okay and previous code worked because we did not vectorize due to different sizes of destination and Load. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1163041647 From kvn at openjdk.org Tue Apr 11 16:10:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Apr 2023 16:10:34 GMT Subject: RFR: JDK-8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process [v2] In-Reply-To: <6fLO1cz_N50WWJ_qEPVtAyXdVwrwKuzKNI_hkEr-3kg=.9743f6fb-0eda-4ad7-bce6-3c5c78c1a129@github.com> References: <6fLO1cz_N50WWJ_qEPVtAyXdVwrwKuzKNI_hkEr-3kg=.9743f6fb-0eda-4ad7-bce6-3c5c78c1a129@github.com> Message-ID: On Mon, 3 Apr 2023 15:41:56 GMT, Justin King wrote: >> Remove unnecessary usage of Arena for globals that live the lifetime of the process and are initialized once. > > Justin King has updated the pull request incrementally with three additional commits since the last revision: > > - Remove now unused include > > Signed-off-by: Justin King > - Remove incorrect comment > > Signed-off-by: Justin King > - Fix typo > > Signed-off-by: Justin King It is good for me too. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13300#pullrequestreview-1379639454 From kvn at openjdk.org Tue Apr 11 16:20:36 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Apr 2023 16:20:36 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 07:36:58 GMT, Emanuel Peter wrote: > I think in many cases, the type does not underflow, and the `MaxL/MinL` can be removed completely. I would disagree. In many cases `limit` is variable. At least I want you to check generated code for such case and add it to your test. Even if additional latency of several CMov (to which you convert Max/Min nodes) vs one node is negligible the bigger size of generated code may affect inlining. > TLDR: @vnkozlov is it ok if I investingate & test `MaxL/MinL` and `ConvI2L / ConvL2I` folding in a follow-up RFE? Depending on result of investigation of generated code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1503701465 From duke at openjdk.org Tue Apr 11 17:10:38 2023 From: duke at openjdk.org (Daohan Qu) Date: Tue, 11 Apr 2023 17:10:38 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v4] In-Reply-To: References: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> Message-ID: On Tue, 11 Apr 2023 16:05:24 GMT, Vladimir Kozlov wrote: >> Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: >> >> Update full name > > src/hotspot/share/opto/vectornode.cpp line 440: > >> 438: // of its higher order bits/bytes >> 439: bool VectorNode::requires_higher_order_bits_of_integer(int opc) { >> 440: if (is_shift_opcode(opc) && opc != Op_LShiftI) { > > Checking Op_LShiftI here will change behavior for `Short s = LShiftI(LoadB)` case and similar. May be it is okay and previous code worked because we did not vectorize due to different sizes of destination and Load. Yes, I see. AFAICS, the if-condition calling this function wants to check whether higher order bits are needed. So I distill the condition content into a function. The `Op_LShiftI` is excluded since it doesn't need such info. Do I miss something? Or does the if-condition should have checked more than what I thought? Thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1163107639 From jkarthikeyan at openjdk.org Tue Apr 11 17:14:34 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 11 Apr 2023 17:14:34 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 04:52:14 GMT, Jasmine Karthikeyan wrote: > Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruct ions that are generated, as that can help in cases where applications are I-Cache bound. > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) > > > This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA As far as I know, `Conv2B` is a special-case convert node where it does either `c == 0 ? 0 : 1` or `p == null ? 0 : 1`. I think an advantage of Conv2B over CMove here is that the Conv2B has more specialized rules for value() and identity(), so it can prune more types of inputs than an equivalent Cmove can. I agree that it's not great having to shuffle the node from the middle-end to the backend, but I think it's still helpful as it can remove some dead bools that CMove wouldn't be able to. Hope this clarifies a bit! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1503794159 From jkarthikeyan at openjdk.org Tue Apr 11 17:41:32 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 11 Apr 2023 17:41:32 GMT Subject: RFR: 8305787: Wrong debugging information printed with TraceOptoOutput [v2] In-Reply-To: References: Message-ID: > This patch fixes a minor bug in aldc where the wrong resource names are printed when the flag TraceOptoOutput is enabled to debug instruction scheduling. > As an example, the output: > > *** Bundle: 1 instr, resources: D0 BR > 126 salI_rReg_imm === _ 240 |271 [[ 127 125 ]] #5/0x00000005 > > states that the bundle is using resources D0 and BR, but the second resource used is actually ALU0. > > The issue is caused because `pipeline->_rescount` is only incremented for discrete resources [(here)](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/adlparse.cpp#L1612), resources specified without `=`. However, the list of names is added to for *all* resources [(here)](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/adlparse.cpp#L1652), so using `_rescount` to index the names causes it to go out of sync. The fix is found in [output_h.cpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/adlc/output_h.cpp#L2231), where it uses the iterator to go through all the resources and use only the ones that are discrete. I applied that fix to this case, and also fixed the other instances of this bug. Reviews on this fix would be appreciated! Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Update copyright years ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13403/files - new: https://git.openjdk.org/jdk/pull/13403/files/99613b46..79e3f744 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13403&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13403&range=00-01 Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13403.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13403/head:pull/13403 PR: https://git.openjdk.org/jdk/pull/13403 From jbhateja at openjdk.org Tue Apr 11 17:50:45 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Apr 2023 17:50:45 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v7] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 17:13:50 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > special case iotaShuffle Marked as reviewed by jbhateja (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13093#pullrequestreview-1379800113 From jbhateja at openjdk.org Tue Apr 11 17:50:48 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 11 Apr 2023 17:50:48 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v6] In-Reply-To: References: Message-ID: <92ZNVJBTNzBNw4reI-1HMd5nvWlNdz_-0wfvGvGe5nk=.e601910e-8c04-48f4-a604-cd14e7b75ee0@github.com> On Tue, 11 Apr 2023 09:36:06 GMT, Quan Anh Mai wrote: >> Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for vector algorithms intercepting both integral and floating point vectors. >> >> FTR, we see a perf regression with Float256 based micro now on AVX=1 targets, >> >> >> public static short micro() { >> VectorShuffle iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, true); >> return iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1); >> } >> >> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef >> CompileCommand: compileonly shufflef.micro bool compileonly = true >> ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0 >> ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0 >> @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) >> @ 34 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 54 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline (intrinsic) >> @ 17 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 24 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 45 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) >> [time] 386ms [res]3392 >> CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/ >> CPROMPT>export PATH=$JAVA_HOME/bin:$PATH >> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . shufflef >> CompileCommand: compileonly shufflef.micro bool compileonly = true >> WARNING: Using incubator modules: jdk.incubator.vector >> @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) >> @ 3 jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) >> @ 17 jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes) (intrinsic) >> @ 292 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 298 java.lang.Object::getClass (0 bytes) (intrinsic) >> @ 322 jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 16 jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) >> [time] 7ms [res]3392 > > @jatin-bhateja Since `Float256Shuffle` is represented as a 256-bit int vector, which is not supported by AVX1, the compiled code falls back to Java implementation, which explains the regression. However, having a `VectorShuffle` but not for `Vector::rearrange` is not really useful, and the code snippet is similar to `ShortVector.SPECIES_128.iotaShuffle(0, 1, true).toVector().reinterpretAsShorts().lane(1)`. As a result, I think having some regressions in edge cases of AVX1 is acceptable in contrast with the improvement in all other operations on all platforms. Agree, this is also fixing less than 32 bit shuffle vectors case, i.e. shuffles involving Long128, Int64 and Float64 will get benefitted on x86. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1163147535 From kvn at openjdk.org Tue Apr 11 17:55:35 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Apr 2023 17:55:35 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v4] In-Reply-To: References: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> Message-ID: On Tue, 11 Apr 2023 17:07:36 GMT, Daohan Qu wrote: >> src/hotspot/share/opto/vectornode.cpp line 440: >> >>> 438: // of its higher order bits/bytes >>> 439: bool VectorNode::requires_higher_order_bits_of_integer(int opc) { >>> 440: if (is_shift_opcode(opc) && opc != Op_LShiftI) { >> >> Checking Op_LShiftI here will change behavior for `Short s = LShiftI(LoadB)` case and similar. May be it is okay and previous code worked because we did not vectorize due to different sizes of destination and Load. > > Thanks for your review. Yes, I see. AFAICS, the if-condition calling this function wants to check whether higher order bits are needed. So I distill the condition content into a function. The `Op_LShiftI` is excluded since it doesn't need such info. Do I miss something? Or should the if-condition have checked more than what I thought? What I am trying to say is that before this change `vt` will be set to `velt_type(load)` even if `in` is `LShiftI` node. With your changes `vt` will stay `== vtn` if `in` is `LShiftI` node. `velt_type(load)` could be different from `vtn` and as result your change may introduce difference in code generation in other than `ReverseBytesI` cases. This needs to be tested to see if number of generated vectors is not reduced for such cases. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1163152069 From qamai at openjdk.org Tue Apr 11 18:26:32 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 11 Apr 2023 18:26:32 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 17:11:57 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruc tions that are generated, as that can help in cases where applications are I-Cache bound. >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) >> >> >> This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > As far as I know, `Conv2B` is a special-case convert node where it does either `c == 0 ? 0 : 1` or `p == null ? 0 : 1`. I think an advantage of Conv2B over CMove here is that the Conv2B has more specialized rules for value() and identity(), so it can prune more types of inputs than an equivalent Cmove can. I agree that it's not great having to shuffle the node from the middle-end to the backend, but I think it's still helpful as it can remove some dead bools that CMove wouldn't be able to. Hope this clarifies a bit! @jaskarth Yes I also think that is the case, but without advantages in the back-end maybe it's best us lowering it in macro expansion phase so the compiler can have chances to transform more primitive `CMove`, what do you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1503882160 From xliu at openjdk.org Tue Apr 11 18:48:22 2023 From: xliu at openjdk.org (Xin Liu) Date: Tue, 11 Apr 2023 18:48:22 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v3] In-Reply-To: References: Message-ID: <4zSRbJEzYL1zLoRbQfHcmYtHTSZfRVrd7LBk0TWRS5w=.9dd259cf-4bb4-4d44-abd3-d671e87905d7@github.com> > This patch improves how Region::Ideal trims unreachable paths. > > 1. Don't restart from beginning. Trimming doesn't change the DU-chain. > 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. > 3. Don't call add_users_to_worklist(this) repeatly. > 4. ~~Reduce its strength from add_users_to_worklist to > add_users_to_worklist0 because RegionNode has no special logic.~~(we can't measure any change of compilation time, so there's no point to simplify it) > > This patch also includes a cosmetic change: rename n to 'use' inside of the loop. > Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: Check outcnt() after loop. We avoid from calling add_users_to_worklist repeatly because we assume we don't delete any use of RegionNode. Assert that after loop. This patch also changes back to add_users_to_worklist. We can't measure any compilation time change. There's no point to use add_users_to_worklist0. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13238/files - new: https://git.openjdk.org/jdk/pull/13238/files/8732ee71..27f1b2ce Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13238&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13238&range=01-02 Stats: 5 lines in 1 file changed: 3 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13238.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13238/head:pull/13238 PR: https://git.openjdk.org/jdk/pull/13238 From jkarthikeyan at openjdk.org Tue Apr 11 18:55:38 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 11 Apr 2023 18:55:38 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE In-Reply-To: References: Message-ID: <-wYuaT3bGddwvehTh4zMj54aCxASJaMrhtCJ_6VVao0=.4412e653-6f31-48ce-b879-77de848bbed4@github.com> On Wed, 5 Apr 2023 04:52:14 GMT, Jasmine Karthikeyan wrote: > Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruct ions that are generated, as that can help in cases where applications are I-Cache bound. > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) > > > This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA Oh, I had forgotten to consider the macro expansion step! That sounds reasonable to me, I'll make this change and see what the performance is like. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1503924626 From vlivanov at openjdk.org Tue Apr 11 19:01:48 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 11 Apr 2023 19:01:48 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v7] In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 17:13:50 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: >> >> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. >> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. >> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. >> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. >> >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask >> ; {external_word} >> vpackusdw %xmm0,%xmm0,%xmm0 >> vpackuswb %xmm0,%xmm0,%xmm0 >> vpmovsxbd %xmm0,%xmm3 >> vpcmpgtd %xmm3,%xmm1,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fc2acb4e0d8 >> vpmovzxbd %xmm0,%xmm0 >> vpermd %ymm2,%ymm0,%ymm0 >> movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} >> vmovdqu %xmm0,0x10(%r10) >> >> After: >> movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} >> vmovdqu 0x10(%r10),%xmm2 >> vpxor %xmm0,%xmm0,%xmm0 >> vpcmpgtd %xmm2,%xmm0,%xmm3 >> vtestps %xmm3,%xmm3 >> jne 0x00007fa818b27cb1 >> vpermd %ymm1,%ymm2,%ymm0 >> movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} >> vmovdqu %xmm0,0x10(%r10) >> >> Please take a look and leave reviews. Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > special case iotaShuffle Nice refactoring! Happy to see so much code gone. Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13093#pullrequestreview-1379896647 From vlivanov at openjdk.org Tue Apr 11 19:06:38 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 11 Apr 2023 19:06:38 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6] In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > style src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ShortVector.java line 2295: > 2293: // to be performant > 2294: @ForceInline > 2295: public ShortVector apply(ShortVector v1, ShortVector v2, int o) { Have you considered matching the corresponding IR during GVN to produce VectorSlice nodes rather than going through VM intrinsic? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1163216924 From jkarthikeyan at openjdk.org Tue Apr 11 19:21:34 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 11 Apr 2023 19:21:34 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 07:36:58 GMT, Emanuel Peter wrote: > However, if that does not work, I think it now also fails to remove the repeated ConvI2L / ConvL2I. We would have to add more IGVN optimizations to fold things more. I think you're running into an issue where some nodes created by counted loop expansion aren't properly passed onto the IGVN worklist- I found the same thing while trying to investigate some strange code generation from small loops. If you make that follow-up RFE I would be happy to attach the cases that I found as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1503969139 From jcking at openjdk.org Tue Apr 11 19:53:46 2023 From: jcking at openjdk.org (Justin King) Date: Tue, 11 Apr 2023 19:53:46 GMT Subject: Integrated: JDK-8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process In-Reply-To: References: Message-ID: On Mon, 3 Apr 2023 15:23:43 GMT, Justin King wrote: > Remove unnecessary usage of Arena for globals that live the lifetime of the process and are initialized once. This pull request has now been integrated. Changeset: 42fa000a Author: Justin King URL: https://git.openjdk.org/jdk/commit/42fa000a7d042e425913aab2842f8166a0c2172a Stats: 45 lines in 5 files changed: 12 ins; 5 del; 28 mod 8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/13300 From kvn at openjdk.org Tue Apr 11 21:31:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 11 Apr 2023 21:31:34 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v3] In-Reply-To: <4zSRbJEzYL1zLoRbQfHcmYtHTSZfRVrd7LBk0TWRS5w=.9dd259cf-4bb4-4d44-abd3-d671e87905d7@github.com> References: <4zSRbJEzYL1zLoRbQfHcmYtHTSZfRVrd7LBk0TWRS5w=.9dd259cf-4bb4-4d44-abd3-d671e87905d7@github.com> Message-ID: On Tue, 11 Apr 2023 18:48:22 GMT, Xin Liu wrote: >> This patch improves how Region::Ideal trims unreachable paths. >> >> 1. Don't restart from beginning. Trimming doesn't change the DU-chain. >> 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. >> 3. Don't call add_users_to_worklist(this) repeatly. >> 4. ~~Reduce its strength from add_users_to_worklist to >> add_users_to_worklist0 because RegionNode has no special logic.~~(we can't measure any change of compilation time, so there's no point to simplify it) >> >> This patch also includes a cosmetic change: rename n to 'use' inside of the loop. >> Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Check outcnt() after loop. > > We avoid from calling add_users_to_worklist repeatly because we assume > we don't delete any use of RegionNode. Assert that after loop. > > This patch also changes back to add_users_to_worklist. We can't measure > any compilation time change. There's no point to use > add_users_to_worklist0. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13238#pullrequestreview-1380112279 From cslucas at openjdk.org Tue Apr 11 22:06:42 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 11 Apr 2023 22:06:42 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v7] In-Reply-To: <4uPGi8Ulap_QoQpkL1zTZUdP-jdL_WDEkpdP7asLow4=.9047ce21-688f-4d29-a643-f9acfd4344c7@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <4uPGi8Ulap_QoQpkL1zTZUdP-jdL_WDEkpdP7asLow4=.9047ce21-688f-4d29-a643-f9acfd4344c7@github.com> Message-ID: On Thu, 6 Apr 2023 04:34:52 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - Merge with Master >> - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. >> - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. >> - Add support for SR'ing some inputs of merges used for field loads >> - Fix some typos and do some small refactorings. >> - Merge master >> - Add support for rematerializing scalar replaced objects participating in allocation merges > > src/hotspot/share/opto/output.cpp line 755: > >> 753: ciKlass* cik = t->is_oopptr()->exact_klass(); >> 754: assert(cik->is_instance_klass() || >> 755: cik->is_array_klass(), "Not supported allocation."); > > Why spacing changed? The identation level was incorrect before. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1163375015 From cslucas at openjdk.org Wed Apr 12 00:32:40 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 12 Apr 2023 00:32:40 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v7] In-Reply-To: <4uPGi8Ulap_QoQpkL1zTZUdP-jdL_WDEkpdP7asLow4=.9047ce21-688f-4d29-a643-f9acfd4344c7@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <4uPGi8Ulap_QoQpkL1zTZUdP-jdL_WDEkpdP7asLow4=.9047ce21-688f-4d29-a643-f9acfd4344c7@github.com> Message-ID: On Thu, 6 Apr 2023 03:25:31 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - Merge with Master >> - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. >> - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. >> - Add support for SR'ing some inputs of merges used for field loads >> - Fix some typos and do some small refactorings. >> - Merge master >> - Add support for rematerializing scalar replaced objects participating in allocation merges > > src/hotspot/share/opto/escape.cpp line 633: > >> 631: >> 632: SafePointScalarMergeNode* smerge = new SafePointScalarMergeNode(merge_t, merge_idx); >> 633: smerge->init_req(0, _compile->root()); > > May be use ophi's control here, it should stay bellow merge point. Was there a reason you use `root`? To be honest, for this Node, I thought it didn't matter. I actually just used the same pattern as in PhaseMacroExpand. I'll adjust the patch as you suggested, though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1163448361 From xliu at openjdk.org Wed Apr 12 00:41:02 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 12 Apr 2023 00:41:02 GMT Subject: RFR: 8305203: Simplify trimming operation in Region::Ideal [v3] In-Reply-To: <4zSRbJEzYL1zLoRbQfHcmYtHTSZfRVrd7LBk0TWRS5w=.9dd259cf-4bb4-4d44-abd3-d671e87905d7@github.com> References: <4zSRbJEzYL1zLoRbQfHcmYtHTSZfRVrd7LBk0TWRS5w=.9dd259cf-4bb4-4d44-abd3-d671e87905d7@github.com> Message-ID: On Tue, 11 Apr 2023 18:48:22 GMT, Xin Liu wrote: >> This patch improves how Region::Ideal trims unreachable paths. >> >> 1. Don't restart from beginning. Trimming doesn't change the DU-chain. >> 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. >> 3. Don't call add_users_to_worklist(this) repeatly. >> 4. ~~Reduce its strength from add_users_to_worklist to >> add_users_to_worklist0 because RegionNode has no special logic.~~(we can't measure any change of compilation time, so there's no point to simplify it) >> >> This patch also includes a cosmetic change: rename n to 'use' inside of the loop. >> Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Check outcnt() after loop. > > We avoid from calling add_users_to_worklist repeatly because we assume > we don't delete any use of RegionNode. Assert that after loop. > > This patch also changes back to add_users_to_worklist. We can't measure > any compilation time change. There's no point to use > add_users_to_worklist0. Thanks all reviewers for helping this PR. --lx ------------- PR Comment: https://git.openjdk.org/jdk/pull/13238#issuecomment-1504337297 From xliu at openjdk.org Wed Apr 12 00:41:02 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 12 Apr 2023 00:41:02 GMT Subject: Integrated: 8305203: Simplify trimming operation in Region::Ideal In-Reply-To: References: Message-ID: <1DM9m8G8BEd6B-PFpbWzh_jzxAXv1bA8OdGnwdTmsVM=.9e85a73e-419f-4d2e-a59c-10b0a142b1a8@github.com> On Thu, 30 Mar 2023 05:26:08 GMT, Xin Liu wrote: > This patch improves how Region::Ideal trims unreachable paths. > > 1. Don't restart from beginning. Trimming doesn't change the DU-chain. > 2. Replace DFIterator with DFIterator_Fast. The later is a raw pointer in release build. > 3. Don't call add_users_to_worklist(this) repeatly. > 4. ~~Reduce its strength from add_users_to_worklist to > add_users_to_worklist0 because RegionNode has no special logic.~~(we can't measure any change of compilation time, so there's no point to simplify it) > > This patch also includes a cosmetic change: rename n to 'use' inside of the loop. > Otherwise, we would overshadow Node* n = in(i). Nothing wrong but harder to read. This pull request has now been integrated. Changeset: 82e8b033 Author: Xin Liu URL: https://git.openjdk.org/jdk/commit/82e8b0332b5313dda26688c49434837374d233d6 Stats: 28 lines in 1 file changed: 7 ins; 10 del; 11 mod 8305203: Simplify trimming operation in Region::Ideal Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/13238 From duke at openjdk.org Wed Apr 12 02:36:33 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 02:36:33 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v4] In-Reply-To: References: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> Message-ID: <795PMcUlqhSk2UXQx4nC2nJGRd4mnzbuRt9ZimJPIUk=.a05fa713-a530-4951-8829-6ba56049f7f9@github.com> On Tue, 11 Apr 2023 17:52:26 GMT, Vladimir Kozlov wrote: >> Thanks for your review. Yes, I see. AFAICS, the if-condition calling this function wants to check whether higher order bits are needed. So I distill the condition content into a function. The `Op_LShiftI` is excluded since it doesn't need such info. Do I miss something? Or should the if-condition have checked more than what I thought? > > What I am trying to say is that before this change `vt` will be set to `velt_type(load)` even if `in` is `LShiftI` node. With your changes `vt` will stay `== vtn` if `in` is `LShiftI` node. `velt_type(load)` could be different from `vtn` and as result your change may introduce difference in code generation in other than `ReverseBytesI` cases. > This needs to be tested to see if number of generated vectors is not reduced for such cases. I agree. Now that I'm not a hundred per cent sure if number of generated vectors is reduced, I'd better revert some changes. I don't want to make this new function's name misleading (as `Op_LShiftI` doesn't require higher bits info), so I'd rather remove this function. Thanks for your review! BTW, I notice that you added this if condition at about 2012 in `jdk8u`, do you remember why you test `is_shift` in the if condition instead of something like `is_rshift` or so? if (same_type) { // For right shifts of small integer types (bool, byte, char, short) // we need precise information about sign-ness. Only Load nodes have // this information because Store nodes are the same for signed and // unsigned values. And any arithmetic operation after a load may // expand a value to signed Int so such right shifts can't be used // because vector elements do not have upper bits of Int. const Type* vt = vtn; if (VectorNode::is_shift(in)) { Node* load = in->in(1); if (load->is_Load() && in_bb(load) && (velt_type(load)->basic_type() == T_INT)) { vt = velt_type(load); } else if (in->Opcode() != Op_LShiftI) { // Widen type to Int to avoid creation of right shift vector // (align + data_size(s1) check in stmts_can_pack() will fail). // Note, left shifts work regardless type. vt = TypeInt::INT; } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1163510555 From xgong at openjdk.org Wed Apr 12 02:41:40 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 12 Apr 2023 02:41:40 GMT Subject: RFR: 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 07:21:05 GMT, Pengfei Li wrote: > From recent tests, we find that `System.arraycopy()` call with a negated variable as its length argument does not perform the copy. This issue is reproducible by below test case on AArch64 platforms with SVE. > > > public class Test { > static char[] src = {'A', 'A', 'A', 'A', 'A'}; > static char[] dst = {'B', 'B', 'B', 'B', 'B'}; > > static void copy(int nlen) { > System.arraycopy(src, 0, dst, 0, -nlen); > } > > public static void main(String[] args) { > for (int i = 0; i < 25000; i++) { > copy(0); > } > copy(-5); > for (char c : dst) { > if (c != 'A') { > throw new RuntimeException("Wrong value!"); > } > } > System.out.println("PASS"); > } > } > > /* > $ java -Xint Test > PASS > $ java -Xbatch Test > Exception in thread "main" java.lang.RuntimeException: Wrong value! > at Test.main(Test.java:16) > */ > > > Cause of this is a new AArch64 matching rule `vmask_gen_sub` introduced by JDK-8293198. It matches `VectorMaskGen (SubL src1 src2)` on AArch64 platforms with SVE and generates SVE `whilelo` instructions. Current C2 compiler uses a technique called "partial inlining" to vectorize small array copy operations by generating vector masks. In above test case, a negated variable `-nlen` is used as the length argument of the call and `-nlen` has a small positive value, so it is a "partial inlining" case. C2 will transform the ideal graph to `VectorMaskGen (SubL 0 nlen)` and eventually output an instruction of `whilelo p0, nlen, zr` which always generates an all-false vector mask. That's why arraycopy does nothing. > > The problem of that matching rule is that it regards inputs `src1` and `src2` as unsigned long integers but they can be signed in use cases of arraycopy. To fix the issue, this patch replaces `whilelo` instruction by `whilelt` in that rule as well as some other places. > > We tested tier1~3 on SVE and found no new failure. A jtreg math library test jdk/internal/math/FloatingDecimal/TestFloatingDecimal.java which fails on SVE before can pass now. Thanks for the fixing! Looks good to me. ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/13382#pullrequestreview-1380346111 From duke at openjdk.org Wed Apr 12 02:48:23 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 02:48:23 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: Revert some changes for safety ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13406/files - new: https://git.openjdk.org/jdk/pull/13406/files/2d92210e..26fe323f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13406&range=03-04 Stats: 18 lines in 3 files changed: 0 ins; 16 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13406.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13406/head:pull/13406 PR: https://git.openjdk.org/jdk/pull/13406 From pli at openjdk.org Wed Apr 12 03:19:53 2023 From: pli at openjdk.org (Pengfei Li) Date: Wed, 12 Apr 2023 03:19:53 GMT Subject: RFR: 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 08:50:36 GMT, Andrew Haley wrote: >> From recent tests, we find that `System.arraycopy()` call with a negated variable as its length argument does not perform the copy. This issue is reproducible by below test case on AArch64 platforms with SVE. >> >> >> public class Test { >> static char[] src = {'A', 'A', 'A', 'A', 'A'}; >> static char[] dst = {'B', 'B', 'B', 'B', 'B'}; >> >> static void copy(int nlen) { >> System.arraycopy(src, 0, dst, 0, -nlen); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 25000; i++) { >> copy(0); >> } >> copy(-5); >> for (char c : dst) { >> if (c != 'A') { >> throw new RuntimeException("Wrong value!"); >> } >> } >> System.out.println("PASS"); >> } >> } >> >> /* >> $ java -Xint Test >> PASS >> $ java -Xbatch Test >> Exception in thread "main" java.lang.RuntimeException: Wrong value! >> at Test.main(Test.java:16) >> */ >> >> >> Cause of this is a new AArch64 matching rule `vmask_gen_sub` introduced by JDK-8293198. It matches `VectorMaskGen (SubL src1 src2)` on AArch64 platforms with SVE and generates SVE `whilelo` instructions. Current C2 compiler uses a technique called "partial inlining" to vectorize small array copy operations by generating vector masks. In above test case, a negated variable `-nlen` is used as the length argument of the call and `-nlen` has a small positive value, so it is a "partial inlining" case. C2 will transform the ideal graph to `VectorMaskGen (SubL 0 nlen)` and eventually output an instruction of `whilelo p0, nlen, zr` which always generates an all-false vector mask. That's why arraycopy does nothing. >> >> The problem of that matching rule is that it regards inputs `src1` and `src2` as unsigned long integers but they can be signed in use cases of arraycopy. To fix the issue, this patch replaces `whilelo` instruction by `whilelt` in that rule as well as some other places. >> >> We tested tier1~3 on SVE and found no new failure. A jtreg math library test jdk/internal/math/FloatingDecimal/TestFloatingDecimal.java which fails on SVE before can pass now. > > Marked as reviewed by aph (Reviewer). @theRealAph @XiaohongGong Thanks for your review. I will integrate this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13382#issuecomment-1504509564 From pli at openjdk.org Wed Apr 12 03:19:55 2023 From: pli at openjdk.org (Pengfei Li) Date: Wed, 12 Apr 2023 03:19:55 GMT Subject: Integrated: 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 07:21:05 GMT, Pengfei Li wrote: > From recent tests, we find that `System.arraycopy()` call with a negated variable as its length argument does not perform the copy. This issue is reproducible by below test case on AArch64 platforms with SVE. > > > public class Test { > static char[] src = {'A', 'A', 'A', 'A', 'A'}; > static char[] dst = {'B', 'B', 'B', 'B', 'B'}; > > static void copy(int nlen) { > System.arraycopy(src, 0, dst, 0, -nlen); > } > > public static void main(String[] args) { > for (int i = 0; i < 25000; i++) { > copy(0); > } > copy(-5); > for (char c : dst) { > if (c != 'A') { > throw new RuntimeException("Wrong value!"); > } > } > System.out.println("PASS"); > } > } > > /* > $ java -Xint Test > PASS > $ java -Xbatch Test > Exception in thread "main" java.lang.RuntimeException: Wrong value! > at Test.main(Test.java:16) > */ > > > Cause of this is a new AArch64 matching rule `vmask_gen_sub` introduced by JDK-8293198. It matches `VectorMaskGen (SubL src1 src2)` on AArch64 platforms with SVE and generates SVE `whilelo` instructions. Current C2 compiler uses a technique called "partial inlining" to vectorize small array copy operations by generating vector masks. In above test case, a negated variable `-nlen` is used as the length argument of the call and `-nlen` has a small positive value, so it is a "partial inlining" case. C2 will transform the ideal graph to `VectorMaskGen (SubL 0 nlen)` and eventually output an instruction of `whilelo p0, nlen, zr` which always generates an all-false vector mask. That's why arraycopy does nothing. > > The problem of that matching rule is that it regards inputs `src1` and `src2` as unsigned long integers but they can be signed in use cases of arraycopy. To fix the issue, this patch replaces `whilelo` instruction by `whilelt` in that rule as well as some other places. > > We tested tier1~3 on SVE and found no new failure. A jtreg math library test jdk/internal/math/FloatingDecimal/TestFloatingDecimal.java which fails on SVE before can pass now. This pull request has now been integrated. Changeset: b9bdbe9a Author: Pengfei Li URL: https://git.openjdk.org/jdk/commit/b9bdbe9ab3922c4dc7a754200df2fe542b11359b Stats: 62 lines in 4 files changed: 52 ins; 0 del; 10 mod 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub Reviewed-by: aph, xgong ------------- PR: https://git.openjdk.org/jdk/pull/13382 From kvn at openjdk.org Wed Apr 12 03:30:36 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Apr 2023 03:30:36 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 02:48:23 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Revert some changes for safety Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13406#pullrequestreview-1380393532 From pli at openjdk.org Wed Apr 12 03:36:35 2023 From: pli at openjdk.org (Pengfei Li) Date: Wed, 12 Apr 2023 03:36:35 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 02:48:23 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Revert some changes for safety That's ok for me. We can refine the code later if the op list needs to be further extended. ------------- Marked as reviewed by pli (Committer). PR Review: https://git.openjdk.org/jdk/pull/13406#pullrequestreview-1380398239 From kvn at openjdk.org Wed Apr 12 03:54:35 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Apr 2023 03:54:35 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: <6SF3Gu8bGQJ4YL9yzgpyEMVvLv_C6YDwEB0V3uGEyrw=.580a358c-d6cb-431b-ab61-379c3ae87ad3@github.com> On Wed, 12 Apr 2023 02:48:23 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Revert some changes for safety This needs to be retested (at least tier1-3) before integration. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1504549142 From duke at openjdk.org Wed Apr 12 03:54:34 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 03:54:34 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 03:27:23 GMT, Vladimir Kozlov wrote: >> Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert some changes for safety > > Good. Thanks a bunch for your reviews! @vnkozlov @pfustc. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1504544594 From kvn at openjdk.org Wed Apr 12 03:54:37 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Apr 2023 03:54:37 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v4] In-Reply-To: <795PMcUlqhSk2UXQx4nC2nJGRd4mnzbuRt9ZimJPIUk=.a05fa713-a530-4951-8829-6ba56049f7f9@github.com> References: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> <795PMcUlqhSk2UXQx4nC2nJGRd4mnzbuRt9ZimJPIUk=.a05fa713-a530-4951-8829-6ba56049f7f9@github.com> Message-ID: On Wed, 12 Apr 2023 02:33:57 GMT, Daohan Qu wrote: > BTW, I notice that you added this if condition at about 2012 in `jdk8u`, do you remember why you test `is_shift` in the if condition instead of something like `is_rshift` or so? At that time we had only 3 shift `Int` vector operations: LShiftI, RShiftI, URShiftI. It did not make sense to have separate function only for right shift. For loads all operations work since we take type from load. For not loads we left with only RShiftI and URShiftI after excluding LShiftI. Note, I am not against executing this code only for right shifts but it needs to be tested. And as separate changes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1163564026 From duke at openjdk.org Wed Apr 12 03:54:38 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 03:54:38 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v4] In-Reply-To: References: <7Jd0AuPrzrUbfkT5PkQh1PIa8qCGsFVg9lWiqPLZ-gg=.b39299a9-4384-475f-b81f-09eccd60bd02@github.com> <795PMcUlqhSk2UXQx4nC2nJGRd4mnzbuRt9ZimJPIUk=.a05fa713-a530-4951-8829-6ba56049f7f9@github.com> Message-ID: <_QcUH2QPjlil0HRnsh0Bxp2JNRUtcWm1oBQVbXEbjB8=.f8c2139b-52cf-4d73-a08c-afe46c0a8610@github.com> On Wed, 12 Apr 2023 03:48:35 GMT, Vladimir Kozlov wrote: >> I agree. Now that I'm not a hundred per cent sure if number of generated vectors is reduced, I'd better revert some changes. I don't want to make this new function's name misleading (as `Op_LShiftI` doesn't require higher bits info), so I'd rather remove this function. Thanks for your review! >> >> BTW, I notice that you added this if condition at about 2012 in `jdk8u`, do you remember why you test `is_shift` in the if condition instead of something like `is_rshift` or so? >> >> if (same_type) { >> // For right shifts of small integer types (bool, byte, char, short) >> // we need precise information about sign-ness. Only Load nodes have >> // this information because Store nodes are the same for signed and >> // unsigned values. And any arithmetic operation after a load may >> // expand a value to signed Int so such right shifts can't be used >> // because vector elements do not have upper bits of Int. >> const Type* vt = vtn; >> if (VectorNode::is_shift(in)) { >> Node* load = in->in(1); >> if (load->is_Load() && in_bb(load) && (velt_type(load)->basic_type() == T_INT)) { >> vt = velt_type(load); >> } else if (in->Opcode() != Op_LShiftI) { >> // Widen type to Int to avoid creation of right shift vector >> // (align + data_size(s1) check in stmts_can_pack() will fail). >> // Note, left shifts work regardless type. >> vt = TypeInt::INT; >> } >> } > >> BTW, I notice that you added this if condition at about 2012 in `jdk8u`, do you remember why you test `is_shift` in the if condition instead of something like `is_rshift` or so? > > At that time we had only 3 shift `Int` vector operations: LShiftI, RShiftI, URShiftI. It did not make sense to have separate function only for right shift. For loads all operations work since we take type from load. For not loads we left with only RShiftI and URShiftI after excluding LShiftI. > > Note, I am not against executing this code only for right shifts but it needs to be tested. And as separate changes. Thanks for your detailed explanations! It helps a lot! BTW, could you sponsor me? :P ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13406#discussion_r1163565515 From duke at openjdk.org Wed Apr 12 03:58:38 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 03:58:38 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: <6SF3Gu8bGQJ4YL9yzgpyEMVvLv_C6YDwEB0V3uGEyrw=.580a358c-d6cb-431b-ab61-379c3ae87ad3@github.com> References: <6SF3Gu8bGQJ4YL9yzgpyEMVvLv_C6YDwEB0V3uGEyrw=.580a358c-d6cb-431b-ab61-379c3ae87ad3@github.com> Message-ID: On Wed, 12 Apr 2023 03:51:15 GMT, Vladimir Kozlov wrote: > This needs to be retested (at least tier1-3) before integration. It contains a if condition change and a comment change. I have tested the if condition change for tier1-3 on x86-64 at the beginning. Could we just wait the github workflow to finsh? Or Do I need to test locally again? (I only have x86 linux PC and it may cost a long time.) ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1504553525 From thartmann at openjdk.org Wed Apr 12 05:59:33 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 12 Apr 2023 05:59:33 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL [v2] In-Reply-To: References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Message-ID: On Mon, 10 Apr 2023 13:04:43 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch optimizes the sequence emitted by `AbsINode` and `AbsLNode` to save some instructions and 1 temp register. Please take a look and kindly leave your reviews. >> >> Thanks a lot. > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - description > - description Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13402#pullrequestreview-1380566314 From eliu at openjdk.org Wed Apr 12 06:31:35 2023 From: eliu at openjdk.org (Eric Liu) Date: Wed, 12 Apr 2023 06:31:35 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 [v2] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 06:55:30 GMT, Thomas Stuefe wrote: >> The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. >> >> Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > reuse Assembler::brk LGTM. ------------- Marked as reviewed by eliu (Committer). PR Review: https://git.openjdk.org/jdk/pull/13401#pullrequestreview-1380601253 From qamai at openjdk.org Wed Apr 12 06:55:33 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Apr 2023 06:55:33 GMT Subject: RFR: 8305783: x86_64: Optimize AbsI and AbsL [v2] In-Reply-To: References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Message-ID: On Wed, 12 Apr 2023 05:57:07 GMT, Tobias Hartmann wrote: >> Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: >> >> - description >> - description > > Looks good to me. @TobiHartmann Thanks for your review, I will integrate the change ------------- PR Comment: https://git.openjdk.org/jdk/pull/13402#issuecomment-1504757424 From qamai at openjdk.org Wed Apr 12 06:56:51 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Apr 2023 06:56:51 GMT Subject: RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v7] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 17:47:56 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> special case iotaShuffle > > Marked as reviewed by jbhateja (Reviewer). @jatin-bhateja @iwanowww Thanks a lot for your approvals, I will integrate the patch ------------- PR Comment: https://git.openjdk.org/jdk/pull/13093#issuecomment-1504758215 From duke at openjdk.org Wed Apr 12 08:49:35 2023 From: duke at openjdk.org (Chang Peng) Date: Wed, 12 Apr 2023 08:49:35 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 08:32:38 GMT, Andrew Haley wrote: > At some point someone must realize that > > `((-(1 << size)) <= n->get_int()) && (n->get_int() < (1 << size))` > > could be a function. I think this function can make this predicate more clearly, but I found there are many predicates to modify in aarch64.ad. I suggest using an extra patch to add this function and modify all similar predicates. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1163820799 From qamai at openjdk.org Wed Apr 12 09:07:35 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Apr 2023 09:07:35 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 08:47:06 GMT, Chang Peng wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 4438: >> >>> 4436: operand immI5() >>> 4437: %{ >>> 4438: predicate(((-(1 << 4)) <= n->get_int()) && (n->get_int() < (1 << 4))); >> >> At some point someone must realize that >> >> `((-(1 << size)) <= n->get_int()) && (n->get_int() < (1 << size))` >> >> could be a function. > >> At some point someone must realize that >> >> `((-(1 << size)) <= n->get_int()) && (n->get_int() < (1 << size))` >> >> could be a function. > > I think this function can make this predicate more clearly, but I found there are many predicates to modify in aarch64.ad. > I suggest using an extra patch to add this function and modify all similar predicates. We have `AbstractAssembler::is_simm(int64_t, uint)` which does exactly this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1163841475 From duke at openjdk.org Wed Apr 12 09:13:34 2023 From: duke at openjdk.org (Chang Peng) Date: Wed, 12 Apr 2023 09:13:34 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 09:04:33 GMT, Quan Anh Mai wrote: > We have `AbstractAssembler::is_simm(int64_t, uint)` which does exactly this. Thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1163849301 From duke at openjdk.org Wed Apr 12 09:19:24 2023 From: duke at openjdk.org (Afshin Zafari) Date: Wed, 12 Apr 2023 09:19:24 GMT Subject: RFR: 8305080: Remove finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests Message-ID: The finalize() methods are removed and replaced by Cleaner callbacks. Note: `test/hotspot/jtreg/compiler/jvmci/compilerToVM/HasFinalizableSubclassTest.java` may be removed since there is no need to test if finalize() exists in the subclasses or not.. ------------- Commit messages: - 8305080: Remove finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests Changes: https://git.openjdk.org/jdk/pull/13419/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13419&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305080 Stats: 35 lines in 7 files changed: 3 ins; 20 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/13419.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13419/head:pull/13419 PR: https://git.openjdk.org/jdk/pull/13419 From duke at openjdk.org Wed Apr 12 09:19:47 2023 From: duke at openjdk.org (Afshin Zafari) Date: Wed, 12 Apr 2023 09:19:47 GMT Subject: RFR: 8305079: Remove finalize() from compiler/c2/Test719030 Message-ID: The `finalize()` method is replaced by a Cleaner callback. ------------- Commit messages: - Removed finalize() method. Changes: https://git.openjdk.org/jdk/pull/13418/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13418&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305079 Stats: 8 lines in 1 file changed: 2 ins; 5 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13418.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13418/head:pull/13418 PR: https://git.openjdk.org/jdk/pull/13418 From dzhang at openjdk.org Wed Apr 12 11:08:40 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 12 Apr 2023 11:08:40 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v15] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > By the way, the current implementation of `VectorMaskCast` is for the case of equal width of the parameter data, other cases depend on the subsequent cast node. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Add loadstoremask support ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/2ef39c07..45f499e3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=13-14 Stats: 262 lines in 3 files changed: 135 ins; 52 del; 75 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From duke at openjdk.org Wed Apr 12 13:39:40 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 13:39:40 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: <7NpJ9V_B1pZfWEuCYQxVtKR-df8F_AvVFFaZ97iepA4=.ba6a9cd9-99e6-4417-97be-7ff5dd766e4d@github.com> On Wed, 12 Apr 2023 02:48:23 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Revert some changes for safety This is my local test result of release build on Ubuntu 22.04 x86_64: $ make test TEST=tier1 ============================== Test summary ============================== TEST TOTAL PASS FAIL ERROR jtreg:test/hotspot/jtreg:tier1 2251 2251 0 0 jtreg:test/jdk:tier1 2328 2328 0 0 jtreg:test/langtools:tier1 4373 4373 0 0 jtreg:test/jaxp:tier1 0 0 0 0 jtreg:test/lib-test:tier1 28 28 0 0 ============================== TEST SUCCESS $ make test TEST=tier2 ============================== Test summary ============================== TEST TOTAL PASS FAIL ERROR jtreg:test/hotspot/jtreg:tier2 722 722 0 0 jtreg:test/jdk:tier2 4050 4050 0 0 jtreg:test/langtools:tier2 11 11 0 0 jtreg:test/jaxp:tier2 470 470 0 0 ============================== TEST SUCCESS $ make test TEST=tier3 ============================== Test summary ============================== TEST TOTAL PASS FAIL ERROR jtreg:test/hotspot/jtreg:tier3 227 227 0 0 jtreg:test/jdk:tier3 1307 1307 0 0 jtreg:test/langtools:tier3 0 0 0 0 jtreg:test/jaxp:tier3 0 0 0 0 ============================== TEST SUCCESS ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1505294494 From qamai at openjdk.org Wed Apr 12 13:51:50 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 12 Apr 2023 13:51:50 GMT Subject: Integrated: 8305783: x86_64: Optimize AbsI and AbsL In-Reply-To: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> References: <54TM3HtMQkkbfqhjZQ1FbbYZnAP6gMfWP019Ps7NjVU=.96b1f5d6-9673-45f0-8742-3d9ff8739cb8@github.com> Message-ID: On Sun, 9 Apr 2023 09:41:08 GMT, Quan Anh Mai wrote: > Hi, > > This patch optimizes the sequence emitted by `AbsINode` and `AbsLNode` to save some instructions and 1 temp register. Please take a look and kindly leave your reviews. > > Thanks a lot. This pull request has now been integrated. Changeset: 99a9dbc8 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/99a9dbc8f191d3c9a9e7569d8a6dd4cca7c9076c Stats: 28 lines in 1 file changed: 0 ins; 10 del; 18 mod 8305783: x86_64: Optimize AbsI and AbsL Reviewed-by: jkarthikeyan, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13402 From kvn at openjdk.org Wed Apr 12 15:29:20 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Apr 2023 15:29:20 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 02:48:23 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Revert some changes for safety Thank you for testing. I submitted our internal testing too. After it complete I will sponsor changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1505472623 From eliu at openjdk.org Wed Apr 12 16:09:41 2023 From: eliu at openjdk.org (Eric Liu) Date: Wed, 12 Apr 2023 16:09:41 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6] In-Reply-To: References: Message-ID: <4vt3o6jU1_qUlYB4YtkXOUmG8Gi9NzRUHXjYqboYlPU=.3edc8876-bf2a-40a0-bb3e-3c5ea2aea3d5@github.com> On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > style src/hotspot/share/opto/vectorIntrinsics.cpp line 1953: > 1951: Node* v1 = unbox_vector(argument(3), vbox_type, elem_bt, num_elem); > 1952: Node* v2 = unbox_vector(argument(4), vbox_type, elem_bt, num_elem); > 1953: if (v1 == NULL || v2 == NULL) { nullptr is more common. src/hotspot/share/opto/vectornode.cpp line 1999: > 1997: // (VectorSlice X Y 0) => X > 1998: // (VectorSlice X Y VLENGTH) => Y > 1999: if (origin->is_con(0)) { is_con(0) is pre defined as TypeInt::ZERO. src/hotspot/share/opto/vectornode.cpp line 2001: > 1999: if (origin->is_con(0)) { > 2000: return in(1); > 2001: } else if (origin->is_con(Matcher::vector_length(this))) { If they were the same, length() looks simple. Suggestion: } else if (origin->is_con(length())) { src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java line 635: > 633: } > 634: > 635: @ForceInline May I ask why `forceInline` here is necessary? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1164311234 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1164328793 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1164346771 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1164348507 From duke at openjdk.org Wed Apr 12 16:14:36 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 16:14:36 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 02:48:23 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Revert some changes for safety Okay, thanks a lot. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1505553896 From kvn at openjdk.org Wed Apr 12 17:08:42 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 12 Apr 2023 17:08:42 GMT Subject: RFR: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes [v5] In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 02:48:23 GMT, Daohan Qu wrote: >> This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). >> >> `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). >> >> https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 >> >> I have tested this patch for tier 1-3 on x86-64. > > Daohan Qu has updated the pull request incrementally with one additional commit since the last revision: > > Revert some changes for safety My testing passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13406#issuecomment-1505625045 From duke at openjdk.org Wed Apr 12 17:11:47 2023 From: duke at openjdk.org (Daohan Qu) Date: Wed, 12 Apr 2023 17:11:47 GMT Subject: Integrated: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes In-Reply-To: References: Message-ID: On Mon, 10 Apr 2023 13:21:29 GMT, Daohan Qu wrote: > This patch should fix [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). > > `SuperWord::compute_vector_element_type()` implemented in `jdk/src/hotspot/share/opto/superword.cpp` propagates backward a narrower integer type when the upper bits of the value are not needed. However, `Integer.reverseBytes()` depends on higher-order bits of an integer and should be prevented from being narrowed and vectorized. Instead, it needs to be treated like `Math.abs()` (which is represented by `Op_AbsI` in the following code). > > https://github.com/openjdk/jdk/blob/0243da2e4adc1b7ab6fcd5b10778532101158dce/src/hotspot/share/opto/superword.cpp#L3935-L3945 > > I have tested this patch for tier 1-3 on x86-64. This pull request has now been integrated. Changeset: 19380d74 Author: quadhier Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/19380d74e437c17c4d8292e2adfd0fb20f059bb0 Stats: 63 lines in 2 files changed: 57 ins; 0 del; 6 mod 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes Reviewed-by: kvn, pli ------------- PR: https://git.openjdk.org/jdk/pull/13406 From duke at openjdk.org Wed Apr 12 17:29:52 2023 From: duke at openjdk.org (duke) Date: Wed, 12 Apr 2023 17:29:52 GMT Subject: Withdrawn: 8298091: Dump native instruction along with nmethod name when using Compiler.codelist In-Reply-To: <1dSP2mbbyqiKLRmWwCfwGdb67ll7-K4cRtV_muUkS9I=.2d2dc2f8-acca-49de-89f5-70a825c057fa@github.com> References: <1dSP2mbbyqiKLRmWwCfwGdb67ll7-K4cRtV_muUkS9I=.2d2dc2f8-acca-49de-89f5-70a825c057fa@github.com> Message-ID: <9eHublccmZCiRVn1sPJZCfD6k8d6iQGA2txUt-Ne7cw=.fffa3d9c-9e65-487d-9f37-b093b1953392@github.com> On Thu, 2 Feb 2023 02:12:06 GMT, Yi Yang wrote: > This patch adds new functionality for Compiler.codelist, it optionally prints assembly code along with compiled method line. This allows us to inspect assembly code for specified JIT method on the fly, also a manageable flag ForceLoadDisassembler is added to load hs-dis if it was not initially present when JVM starts. > > The output looks like this: > > $ jcmd Compiler.codelist decode=Thread.interrupt > 76900: > ... > 2678 3 0 com.sun.tools.javac.api.JavacTaskPool$ReusableContext$1.scan(Lcom/sun/source/tree/Tree;Lcom/sun/tools/javac/code/Symtab;)Ljava/lang/Void; [0x00007fbe85105590, 0x00007fbe85105780 - 0x00007fbe85105ec0] > 2683 3 0 java.lang.Thread.interrupted()Z [0x00007fbe85106090, 0x00007fbe85106220 - 0x00007fbe85106488] > [Disassembly] > -------------------------------------------------------------------------------- > [Constant Pool (empty)] > > -------------------------------------------------------------------------------- > > [MachCode] > 0x00007fbe85106220: 8984 2400 | c0fe ff55 | 4883 ec40 | 4181 7f20 | 0700 0000 | 7405 e8a5 | 8af9 0648 | be28 d2ca > 0x00007fbe85106240: 3cbe 7f00 | 008b bef4 | 0000 0083 | c702 89be | f400 0000 | 81e7 fe07 | 0000 83ff | 000f 8461 > 0x00007fbe85106260: 0100 0048 | be28 d2ca | 3cbe 7f00 | 0048 8386 | 3801 0000 | 0149 8bb7 | a802 0000 | 488b 3648 > 0x00007fbe85106280: 3b06 488b | fe48 bb28 | d2ca 3cbe | 7f00 008b | 7f08 49ba | 0000 0000 | 0800 0000 | 4903 fa48 > 0x00007fbe851062a0: 3bbb 5801 | 0000 750d | 4883 8360 | 0100 0001 | e960 0000 | 0048 3bbb | 6801 0000 | 750d 4883 > 0x00007fbe851062c0: 8370 0100 | 0001 e94a | 0000 0048 | 83bb 5801 | 0000 0075 | 1748 89bb | 5801 0000 | 48c7 8360 > 0x00007fbe851062e0: 0100 0001 | 0000 00e9 | 2900 0000 | 4883 bb68 | 0100 0000 | 7517 4889 | bb68 0100 | 0048 c783 > 0x00007fbe85106300: 7001 0000 | 0100 0000 | e908 0000 | 0048 8383 | 4801 0000 | 0148 bfc8 | 7036 3cbe | 7f00 008b > 0x00007fbe85106320: 9ff4 0000 | 0083 c302 | 899f f400 | 0000 81e3 | feff 1f00 | 83fb 000f | 84ad 0000 | 000f be7e > 0x00007fbe85106340: 3683 ff00 | 48bb c870 | 363c be7f | 0000 48b8 | 3801 0000 | 0000 0000 | 0f84 0a00 | 0000 48b8 > 0x00007fbe85106360: 4801 0000 | 0000 0000 | 488b 1403 | 488d 5201 | 4889 1403 | 0f84 2e00 | 0000 897c | 2428 bb00 > 0x00007fbe85106380: 0000 0088 | 5e36 f083 | 4424 c000 | 48be c870 | 363c be7f | 0000 4883 | 8658 0100 | 0001 90e8 > 0x00007fbe851063a0: 5c11 fb06 | 8b7c 2428 | 83e7 0183 | e701 488b | c748 83c4 | 405d 493b | a778 0300 | 000f 8748 > 0x00007fbe851063c0: 0000 00c3 | 49ba 98aa | 4200 0800 | 0000 4c89 | 5424 0848 | c704 24ff | ffff ffe8 | 20f7 0607 > 0x00007fbe851063e0: e97e feff | ffe8 1684 | 0607 49ba | 7880 0100 | 0800 0000 | 4c89 5424 | 0848 c704 | 24ff ffff > 0x00007fbe85106400: ffe8 faf6 | 0607 e932 | ffff ff49 | bab6 6310 | 85be 7f00 | 004d 8997 | 9003 0000 | e95f 77fb > 0x00007fbe85106420: 0649 8b87 | 2804 0000 | 49c7 8728 | 0400 0000 | 0000 0049 | c787 3004 | 0000 0000 | 0000 4883 > 0x00007fbe85106440: c440 5de9 | b86b 0607 | e833 af06 | 0748 bf42 | 29a7 a2be | 7f00 0048 | 83e4 f0e8 | b097 4b1d > 0x00007fbe85106460: f449 ba61 | 6410 85be | 7f00 0041 | 52e9 ae69 | fb06 48bb | 0000 0000 | 0000 0000 | e9fb ffff > 0x00007fbe85106480: fff4 f4f4 | f4f4 f4f4 > [/MachCode] > -------------------------------------------------------------------------------- > [/Disassembly] > 2684 3 0 java.io.FileInputStream.read()I [0x00007fbe85106590, 0x00007fbe85106740 - 0x00007fbe85106928] > 2686 3 0 jdk.internal.org.jline.utils.NonBlockingInputStream.read(J)I [0x00007fbe85106a90, 0x00007fbe85106c20 - 0x00007fbe85106de0] > 2687 3 0 jdk.internal.org.jline.terminal.impl.AbstractPty.checkInterrupted()V [0x00007fbe85106e90, 0x00007fbe85107060 - 0x00007fbe85107458] > > This is a common situation in production environment. Few applications will bring hsdis at startup, but when we really need it, we seem to have no good way except to restart application. Now, we can turn on ForceLoadDisassembler and load hsdis dynamically without restarting: > > $ jcmd Compiler.codelist decode=Thread.interrupt > 2679 3 0 com.sun.source.util.TreeScanner.scan(Lcom/sun/source/tree/Tree;Ljava/lang/Object;)Ljava/lang/Object; [0x00007fbe85105110, 0x00007fbe851052a0 - 0x00007fbe851054c8] > 2678 3 0 com.sun.tools.javac.api.JavacTaskPool$ReusableContext$1.scan(Lcom/sun/source/tree/Tree;Lcom/sun/tools/javac/code/Symtab;)Ljava/lang/Void; [0x00007fbe85105590, 0x00007fbe85105780 - 0x00007fbe85105ec0] > 2683 3 0 java.lang.Thread.interrupted()Z [0x00007fbe85106090, 0x00007fbe85106220 - 0x00007fbe85106488] > [Disassembly] > -------------------------------------------------------------------------------- > [Constant Pool (empty)] > > -------------------------------------------------------------------------------- > > [Verified Entry Point] > # {method} {0x000000080042aa98} 'interrupted' '()Z' in 'java/lang/Thread' > # [sp+0x50] (sp of caller) > 0x00007fbe85106220: mov %eax,-0x14000(%rsp) > 0x00007fbe85106227: push %rbp > 0x00007fbe85106228: sub $0x40,%rsp > 0x00007fbe8510622c: cmpl $0x7,0x20(%r15) > 0x00007fbe85106234: je 0x00007fbe8510623b > .... > 0x00007fbe8510643e: add $0x40,%rsp > 0x00007fbe85106442: pop %rbp > 0x00007fbe85106443: jmpq 0x00007fbe8c16d000 ; {runtime_call unwind_exception Runtime1 stub} > [Exception Handler] > 0x00007fbe85106448: callq 0x00007fbe8c171380 ; {no_reloc} > 0x00007fbe8510644d: mov $0x7fbea2a72942,%rdi ; {external_word} > 0x00007fbe85106457: and $0xfffffffffffffff0,%rsp > 0x00007fbe8510645b: callq 0x00007fbea25bfc10 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} > 0x00007fbe85106460: hlt > [Deopt Handler Code] > 0x00007fbe85106461: mov $0x7fbe85106461,%r10 ; {section_word} > 0x00007fbe8510646b: push %r10 > 0x00007fbe8510646d: jmpq 0x00007fbe8c0bce20 ; {runtime_call DeoptimizationBlob} > 0x00007fbe85106472: mov $0x0,%rbx ; {static_stub} > 0x00007fbe8510647c: jmpq 0x00007fbe8510647c ; {runtime_call} > 0x00007fbe85106481: hlt > 0x00007fbe85106482: hlt > 0x00007fbe85106483: hlt > 0x00007fbe85106484: hlt > 0x00007fbe85106485: hlt > 0x00007fbe85106486: hlt > 0x00007fbe85106487: hlt > -------------------------------------------------------------------------------- > [/Disassembly] > 2684 3 0 java.io.FileInputStream.read()I [0x00007fbe85106590, 0x00007fbe85106740 - 0x00007fbe85106928] > 2686 3 0 jdk.internal.org.jline.utils.NonBlockingInputStream.read(J)I [0x00007fbe85106a90, 0x00007fbe85106c20 - 0x00007fbe85106de0] > 2687 3 0 jdk.internal.org.jline.terminal.impl.AbstractPty.checkInterrupted()V [0x00007fbe85106e90, 0x00007fbe85107060 - 0x00007fbe85107458] > ... > > > A sample use case is we want to know where line of code we have high cache line contention once we know a JIT address from perf c2c tool: > > Cacheline 0x456017840 > -- Peer Snoop -- ------- Store Refs ------ ------- CL -------- ---------- cycles ---------- Total cpu > Rmt Lcl L1 Hit L1 Miss N/A Off Node PA cnt Code address rmt peer lcl peer load records cnt Symbol > 0.00% 35.59% 0.00% 0.00% 0.00% 0x0 1 1 0xffff688f2a84 0 406 324 199524 1 [.] 0x0000ffff688f2a84 [JIT] ti > 0.00% 33.12% 0.00% 0.00% 0.00% 0x0 1 1 0xffff688f2ab8 0 411 329 190202 1 [.] > ... > > But this example is too conservative. In fact, after adding this function, we can easily check the assembly representation of any JIT method, whether we find a potential performance problem with a JIT address, or we find it from the flame graph, or when we do some debugging. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/12381 From never at openjdk.org Wed Apr 12 19:32:40 2023 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 12 Apr 2023 19:32:40 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent [v2] In-Reply-To: References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: On Tue, 11 Apr 2023 00:15:41 GMT, Tom Rodriguez wrote: >> Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. > > Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: > > Add MO_SEQ_CST Thanks. I've added MO_SEQ_CST to the oop case and changed the other cases to use HeapAccess with MO_SEQ_CST. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13389#issuecomment-1505810573 From never at openjdk.org Wed Apr 12 19:32:38 2023 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 12 Apr 2023 19:32:38 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent [v3] In-Reply-To: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: > Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: Add MO_SEQ_CST for primtive types ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13389/files - new: https://git.openjdk.org/jdk/pull/13389/files/3cd12b39..16856123 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13389&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13389&range=01-02 Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13389.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13389/head:pull/13389 PR: https://git.openjdk.org/jdk/pull/13389 From eosterlund at openjdk.org Wed Apr 12 20:06:35 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 12 Apr 2023 20:06:35 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent [v3] In-Reply-To: References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: On Wed, 12 Apr 2023 19:32:38 GMT, Tom Rodriguez wrote: >> Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. > > Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: > > Add MO_SEQ_CST for primtive types Looks good. ------------- Marked as reviewed by eosterlund (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13389#pullrequestreview-1382035034 From rrich at openjdk.org Wed Apr 12 21:25:37 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 12 Apr 2023 21:25:37 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian Message-ID: Disable VMContinuations on PPC64 big endian because there are known failures in jdk:jdk_loom tests. ------------- Commit messages: - Disable VMContinuations on big endian PPC Changes: https://git.openjdk.org/jdk/pull/13449/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13449&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305934 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13449.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13449/head:pull/13449 PR: https://git.openjdk.org/jdk/pull/13449 From rrich at openjdk.org Wed Apr 12 21:27:32 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 12 Apr 2023 21:27:32 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 21:16:33 GMT, Richard Reingruber wrote: > Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. @backwaterred are you ok with disabling VMContinuations on PPC big endian in general (not only AIX)? This will simplify regression testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13449#issuecomment-1505969285 From thartmann at openjdk.org Thu Apr 13 06:23:32 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 13 Apr 2023 06:23:32 GMT Subject: RFR: 8305079: Remove finalize() from compiler/c2/Test719030 In-Reply-To: References: Message-ID: <5GLU3A66aFWARre8_19-7-CAKr_w_VSGud9sYg1fOZg=.9895f450-0f9f-4953-9b6c-8625f73a7bc8@github.com> On Tue, 11 Apr 2023 07:33:16 GMT, Afshin Zafari wrote: > The `finalize()` method is replaced by a Cleaner callback. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13418#pullrequestreview-1382708079 From thartmann at openjdk.org Thu Apr 13 06:26:35 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 13 Apr 2023 06:26:35 GMT Subject: RFR: 8305080: Remove finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 07:58:35 GMT, Afshin Zafari wrote: > The finalize() methods are removed and replaced by Cleaner callbacks. > > Note: > `test/hotspot/jtreg/compiler/jvmci/compilerToVM/HasFinalizableSubclassTest.java` may be removed since there is no need to test if finalize() exists in the subclasses or not.. The GraalVM team (@dougxc) should have a look at this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13419#issuecomment-1506413700 From qamai at openjdk.org Thu Apr 13 07:05:54 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 13 Apr 2023 07:05:54 GMT Subject: Integrated: 8304450: [vectorapi] Refactor VectorShuffle implementation In-Reply-To: References: Message-ID: <1_1SPocmj-NTrY9ZZ35vVCt7Gc4dVtZVIxeyJwXrBj0=.66358c02-6efa-4d04-9e8b-3ceb12c6af66@github.com> On Sun, 19 Mar 2023 13:04:19 GMT, Quan Anh Mai wrote: > Hi, > > This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks: > > 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically. > 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations. > 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler. > 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones. > > Upon these changes, a `rearrange` can emit more efficient code: > > var species = IntVector.SPECIES_128; > var v1 = IntVector.fromArray(species, SRC1, 0); > var v2 = IntVector.fromArray(species, SRC2, 0); > v1.rearrange(v2.toShuffle()).intoArray(DST, 0); > > Before: > movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} > vmovdqu 0x10(%r10),%xmm2 > movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} > vmovdqu 0x10(%r10),%xmm0 > vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byte_mask > ; {external_word} > vpackusdw %xmm0,%xmm0,%xmm0 > vpackuswb %xmm0,%xmm0,%xmm0 > vpmovsxbd %xmm0,%xmm3 > vpcmpgtd %xmm3,%xmm1,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fc2acb4e0d8 > vpmovzxbd %xmm0,%xmm0 > vpermd %ymm2,%ymm0,%ymm0 > movabs $0x751588f98,%r10 ; {oop([I{0x0000000751588f98})} > vmovdqu %xmm0,0x10(%r10) > > After: > movabs $0x751589c78,%r10 ; {oop([I{0x0000000751589c78})} > vmovdqu 0x10(%r10),%xmm1 > movabs $0x75158ac88,%r10 ; {oop([I{0x000000075158ac88})} > vmovdqu 0x10(%r10),%xmm2 > vpxor %xmm0,%xmm0,%xmm0 > vpcmpgtd %xmm2,%xmm0,%xmm3 > vtestps %xmm3,%xmm3 > jne 0x00007fa818b27cb1 > vpermd %ymm1,%ymm2,%ymm0 > movabs $0x751588c68,%r10 ; {oop([I{0x0000000751588c68})} > vmovdqu %xmm0,0x10(%r10) > > Please take a look and leave reviews. Thanks a lot. This pull request has now been integrated. Changeset: e846a1d7 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/e846a1d70043f7b57ae76847e85e5426c86539a5 Stats: 3690 lines in 64 files changed: 1615 ins; 1169 del; 906 mod 8304450: [vectorapi] Refactor VectorShuffle implementation Reviewed-by: psandoz, xgong, jbhateja, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/13093 From dnsimon at openjdk.org Thu Apr 13 07:30:34 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Apr 2023 07:30:34 GMT Subject: RFR: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent [v3] In-Reply-To: References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: <2CfSAETeFi90EUX2zXJlY_mU1uwaF1e71Spj8mQXkhU=.0c2dadee-44a9-4e15-b0f1-c64aad764434@github.com> On Wed, 12 Apr 2023 19:32:38 GMT, Tom Rodriguez wrote: >> Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. > > Tom Rodriguez has updated the pull request incrementally with one additional commit since the last revision: > > Add MO_SEQ_CST for primtive types Marked as reviewed by dnsimon (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13389#pullrequestreview-1382802080 From dnsimon at openjdk.org Thu Apr 13 08:00:33 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 13 Apr 2023 08:00:33 GMT Subject: RFR: 8305080: Remove finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 07:58:35 GMT, Afshin Zafari wrote: > The finalize() methods are removed and replaced by Cleaner callbacks. > > Note: > `test/hotspot/jtreg/compiler/jvmci/compilerToVM/HasFinalizableSubclassTest.java` may be removed since there is no need to test if finalize() exists in the subclasses or not.. These tests are for JVMCI functionality related to implementing finalizers properly. As David states [here](https://bugs.openjdk.org/browse/JDK-8305063?focusedCommentId=14570181&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14570181), these tests should only be changed when finalization itself is removed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13419#issuecomment-1506519024 From mdoerr at openjdk.org Thu Apr 13 09:30:32 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 13 Apr 2023 09:30:32 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 21:16:33 GMT, Richard Reingruber wrote: > Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13449#pullrequestreview-1383020245 From dzhang at openjdk.org Thu Apr 13 09:37:35 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 13 Apr 2023 09:37:35 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v16] In-Reply-To: References: Message-ID: <8j0il6k2xmB5s72N2EAiTijDDjN7FNoXrPux4ur9IdE=.485aa3bb-de42-4bbb-a374-1c2ec7340f71@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/45f499e3..bcbab448 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=14-15 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From mdoerr at openjdk.org Thu Apr 13 09:52:35 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 13 Apr 2023 09:52:35 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 12:40:14 GMT, Erik ?sterlund wrote: > The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. Thanks for fixing it! I guess we should also backport the fix for Shenandoah. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13324#pullrequestreview-1383059862 From coleenp at openjdk.org Thu Apr 13 12:33:18 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 13 Apr 2023 12:33:18 GMT Subject: RFR: 8305404: Compile_lock not needed for InstanceKlass::implementor() Message-ID: See CR for details. Tested with tier1-4, 7, 8. ------------- Commit messages: - 8305404: Compile_lock not needed for InstanceKlass::implementor() Changes: https://git.openjdk.org/jdk/pull/13458/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13458&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305404 Stats: 11 lines in 2 files changed: 0 ins; 5 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13458.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13458/head:pull/13458 PR: https://git.openjdk.org/jdk/pull/13458 From fjiang at openjdk.org Thu Apr 13 13:41:51 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 13 Apr 2023 13:41:51 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v16] In-Reply-To: <8j0il6k2xmB5s72N2EAiTijDDjN7FNoXrPux4ur9IdE=.485aa3bb-de42-4bbb-a374-1c2ec7340f71@github.com> References: <8j0il6k2xmB5s72N2EAiTijDDjN7FNoXrPux4ur9IdE=.485aa3bb-de42-4bbb-a374-1c2ec7340f71@github.com> Message-ID: On Thu, 13 Apr 2023 09:37:35 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo >From the cursory review, with some comments. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1726: > 1724: } > 1725: > 1726: void C2_MacroAssembler::rvv_compare(VectorRegister vd, BasicType bt, int length_in_bytes, VectorRegister src1, VectorRegister src2, int cond, VectorMask vm) { All RVV-related methods are named with the suffix `_v` in C2_MacroAssembler (except `rvv_vsetvli` and `rvv_reduce_integral`, which should be renamed too, IMO), I think we should follow this naming style. src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1734: > 1732: case BoolTest::ne: vmfne_vv(vd, src1, src2, vm); break; > 1733: case BoolTest::le: vmfle_vv(vd, src1, src2, vm); break; > 1734: case BoolTest::ge: vmfle_vv(vd, src2, src1, vm); break; Maybe we could add some pseudo instructions like `vmfge_vv`/`vmfgt_vv` [1] . 1. https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L3676-L3681 src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1747: > 1745: case BoolTest::ne: vmsne_vv(vd, src1, src2, vm); break; > 1746: case BoolTest::le: vmsle_vv(vd, src1, src2, vm); break; > 1747: case BoolTest::ge: vmsle_vv(vd, src2, src1, vm); break; Same here, `vmsge_vv`/`vmsgt_vv` [1] would be better. 1. https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L2724-L2729 src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 203: > 201: void rvv_vsetvli(BasicType bt, int length_in_bytes, Register tmp = t0); > 202: > 203: void rvv_compare(VectorRegister dst, BasicType bt, int length_in_bytes, Suggestion: void compare_v(VectorRegister dst, BasicType bt, int length_in_bytes, src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 230: > 228: > 229: // Clear vector registers independent of previous vl and vtype. > 230: void rvv_clear_register(VectorRegister v) { Suggestion: void clear_register_v(VectorRegister v) { src/hotspot/cpu/riscv/riscv.ad line 919: > 917: // The mask value used to control execution of a masked vector > 918: // instruction is always supplied by vector register v0. > 919: reg_class vectmask_reg_v0 ( Suggestion: reg_class vmask_reg_v0 ( src/hotspot/cpu/riscv/riscv.ad line 926: > 924: // We need two more vmask registers to do the vector mask logical ops, > 925: // so define v30, v31 as mask register too. > 926: reg_class vectmask_reg ( Suggestion: reg_class vmask_reg ( src/hotspot/cpu/riscv/riscv_v.ad line 2108: > 2106: > 2107: instruct vsubS_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ > 2108: match(Set dst_src1 (SubVS (Binary dst_src1 src2) vmask)); Can we just merge those match rules in one instruct just like `vlsrIL`? Looks like those instructs only differ from BasicType. ------------- Changes requested by fjiang (Author). PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1383275726 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165433737 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165499306 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165505023 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165447286 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165447547 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165482086 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165482301 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1165523204 From eosterlund at openjdk.org Thu Apr 13 15:51:34 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 13 Apr 2023 15:51:34 GMT Subject: RFR: 8305404: Compile_lock not needed for InstanceKlass::implementor() In-Reply-To: References: Message-ID: On Thu, 13 Apr 2023 12:25:27 GMT, Coleen Phillimore wrote: > See CR for details. > Tested with tier1-4, 7, 8. Looks good. ------------- Marked as reviewed by eosterlund (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13458#pullrequestreview-1383733329 From tsteele at openjdk.org Thu Apr 13 16:46:41 2023 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 13 Apr 2023 16:46:41 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: <-PvmUW02zjXLVrTELOX6IJ7mtthBK6ryAxtyd-57LtQ=.1caa1d2d-4fcb-4492-ae81-99a22035ab3f@github.com> References: <-PvmUW02zjXLVrTELOX6IJ7mtthBK6ryAxtyd-57LtQ=.1caa1d2d-4fcb-4492-ae81-99a22035ab3f@github.com> Message-ID: On Tue, 11 Apr 2023 14:22:46 GMT, Richard Reingruber wrote: >> This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. >> With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. >> >> Before: >> >> * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` >> * jit_abi is independent of `abi_minframe` >> * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) >> >> After changes: >> >> * prefixed structs that depend on `ABI_ELFv2` with `native_` >> * introduced `java_abi` which is independent of `ABI_ELFv2` >> * `frame::metadata_words` is the size in words of `java_abi` >> >> This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` >> >> Testing: >> >> PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. >> PPC64be Linux: hotspot tier1 tests > > Going back to draft because the following fails with this pr: > > > time make test TEST=test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03 @reinrich Thanks for spotting this! This issue was affecting my VThreads port in a way I hadn't correctly diagnosed. > Going back to draft because the following fails with this pr: > > ``` > time make test TEST=test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03 > ``` Out of curiosity, what is the problem with the above command? It seems to work as expected for me. Building target 'test' in configuration 'aix-ppc64-server-fastdebug' Test selection 'test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03', will run: * jtreg:test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03 Running test 'jtreg:test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03' Passed: serviceability/jvmti/thread/GetThreadState/thrstat03/thrstat03.java Test results: passed: 1 Report written to /home/hotspot/openjdk/jdk-tyler/build/aix-ppc64-server-fastdebug/test-results/jtreg_test_hotspot_jtreg_serviceability_jvmti_thread_GetThreadState_thrstat03/html/report.html Results written to /home/hotspot/openjdk/jdk-tyler/build/aix-ppc64-server-fastdebug/test-support/jtreg_test_hotspot_jtreg_serviceability_jvmti_thread_GetThreadState_thrstat03 Finished running test 'jtreg:test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03' Test report is stored in build/aix-ppc64-server-fastdebug/test-results/jtreg_test_hotspot_jtreg_serviceability_jvmti_thread_GetThreadState_thrstat03 ============================== Test summary ============================== TEST TOTAL PASS FAIL ERROR jtreg:test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03 1 1 0 0 ============================== TEST SUCCESS Finished building target 'test' in configuration 'aix-ppc64-server-fastdebug' real 0m41.091s user 0m42.847s sys 0m5.076s ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1503627092 PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1504222801 From rrich at openjdk.org Thu Apr 13 16:46:42 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 13 Apr 2023 16:46:42 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: Message-ID: <-_adzJAIBS7ZIqMGGNH3GP8Zz_JAT8Oe2TiWQJR4CEM=.abdd74c0-0ff2-49b4-8310-1efdb3712cda@github.com> On Thu, 6 Apr 2023 13:22:49 GMT, Richard Reingruber wrote: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Thanks for the testing. Are you sure though your build includes the changes of this pr? There is a problem in exception handling which is related to the changes. `test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03` crashes in my tests with SIGSEGV: Stack: [0x000004001a0d0000,0x000004001a2d0000], sp=0x000004001a2cd310, free space=2036k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xedc318] frame::interpreter_frame_method() const+0x58 (frame.cpp:398) V [libjvm.so+0x11bdd30] InterpreterRuntime::exception_handler_for_exception(JavaThread*, oopDesc*)+0x120 (interpreterRuntime.cpp:473) Other tests seem to hang. Also the failures of `test/jdk/jdk/internal/vm/Continuation/BasicExt.java` are related to exception handling. BTW: does the COMP_ALL variant of BasicExt.java succeed on AIX too with these changes? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1504730860 From tsteele at openjdk.org Thu Apr 13 16:46:43 2023 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 13 Apr 2023 16:46:43 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: Message-ID: <00jGuPkoV8d38IYv8qUcsAjC9_D6P2UC1GZo38QsI5o=.aaf3ec24-199f-4b71-ae63-2b0f34470bf8@github.com> On Thu, 6 Apr 2023 13:22:49 GMT, Richard Reingruber wrote: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Interesting. I see no such failure on AIX either on my VThread branch, or when checking this PR out directly. I am currently running on Linux/Power to see if I can confirm that I see the failure there. I don't see any failures in BasicExt.java on my aix/vthread branch with these and other changes (most notably the PollsetPoller implementation). For reference, I created #13452. It collects some in-progress changes from me, changes from this PR, and some changes from Matthias that enable the Harfbuzz library to build. > [VMContinuations are disabled on AIX](https://github.com/openjdk/jdk/blob/bc15163386659bfd549576817b4efe7307261ea8/src/hotspot/cpu/ppc/globals_ppc.hpp#L59). Have you changed that line? BasicExt.java is otherwise skipped because [it requires continuations](https://github.com/openjdk/jdk/blob/bc15163386659bfd549576817b4efe7307261ea8/test/jdk/jdk/internal/vm/Continuation/BasicExt.java#L28). If you run it standalone (w/o jtreg) you should get an UnsupportedOperationException With the changes mentioned above, I believe I am running your VThreads implementation on AIX: VMContinuations is indeed set to true, and BasicExt should be running the 'real' vthread impl. The issues I see appear to hang while waiting for my PollsetPoller implementation, so I don't believe anything [in this PR] is an issue for me. > I'm planning to disable VMContinuations on PPC big endian platforms for now. This will help testing changes like this pr. Sounds good. I'll make sure to test my changes on AIX & Linux BE before merging. Thanks again for catching this issue. I was definitely looking in the wrong place for it's cause. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1505624569 PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1506127285 From rrich at openjdk.org Thu Apr 13 16:46:46 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 13 Apr 2023 16:46:46 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: <-PvmUW02zjXLVrTELOX6IJ7mtthBK6ryAxtyd-57LtQ=.1caa1d2d-4fcb-4492-ae81-99a22035ab3f@github.com> Message-ID: On Tue, 11 Apr 2023 22:46:01 GMT, Tyler Steele wrote: > Going back to draft because the following fails with this pr: > > ``` > time make test TEST=test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03 > ``` I've overlooked that there's also a [test case with a virtual thread](https://github.com/openjdk/jdk/blob/bc15163386659bfd549576817b4efe7307261ea8/test/hotspot/jtreg/serviceability/jvmti/thread/GetThreadState/thrstat03/thrstat03.java#L59). The test succeeds with `VMContinuations` disabled. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1505831116 From rrich at openjdk.org Thu Apr 13 16:46:47 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 13 Apr 2023 16:46:47 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: <00jGuPkoV8d38IYv8qUcsAjC9_D6P2UC1GZo38QsI5o=.aaf3ec24-199f-4b71-ae63-2b0f34470bf8@github.com> References: <00jGuPkoV8d38IYv8qUcsAjC9_D6P2UC1GZo38QsI5o=.aaf3ec24-199f-4b71-ae63-2b0f34470bf8@github.com> Message-ID: On Wed, 12 Apr 2023 17:05:46 GMT, Tyler Steele wrote: > Interesting. I see no such failure on AIX either on my VThread branch, or when checking this PR out directly. I am currently running on Linux/Power to see if I can confirm that I see the failure there. > > I don't see any failures in BasicExt.java on my aix/vthread branch with these and other changes (most notably the PollsetPoller implementation). [`VMContinuations` are disabled on AIX](https://github.com/openjdk/jdk/blob/bc15163386659bfd549576817b4efe7307261ea8/src/hotspot/cpu/ppc/globals_ppc.hpp#L59). Have you changed that line? BasicExt.java is otherwise skipped because [it requires continuations](https://github.com/openjdk/jdk/blob/bc15163386659bfd549576817b4efe7307261ea8/test/jdk/jdk/internal/vm/Continuation/BasicExt.java#L28). If you run it standalone (w/o jtreg) you should get an `UnsupportedOperationException` I'm planning to disable `VMContinuations` on PPC big endian platforms for now. This will help testing changes like this pr. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1505856926 From rrich at openjdk.org Thu Apr 13 17:03:39 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 13 Apr 2023 17:03:39 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: Message-ID: <4AJFIOvZS709q6k54ynUB-4P-3zP67c0ER_GboWLQno=.b27ef6f3-911e-491f-a212-8925b6885a2f@github.com> On Thu, 6 Apr 2023 13:22:49 GMT, Richard Reingruber wrote: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests I think this is ready for integration now. I think I do know the reason for the remaining issues with continuations. The [call of `SharedRuntime::exception_handler_for_return_address`](https://github.com/openjdk/jdk/blob/92521b100f1eb785eabd101870f631f555c3b135/src/hotspot/cpu/ppc/stubGenerator_ppc.cpp#L4579) hasn't got the full abi required for calling native code. This is compensated because there is dead space in the interpreter frame's expression stack where the call parameters used to be. The space is enough on LE but not on BE where 2 more words are required it is not. Depends on the C++ compiler though if live data is killed by the call. So @backwaterred it is possible that your continuation tests succeed on AIX despite the issue. I will take care of this in a followup. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1507301770 From tsteele at openjdk.org Thu Apr 13 17:08:33 2023 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 13 Apr 2023 17:08:33 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 21:16:33 GMT, Richard Reingruber wrote: > Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. I'm not totally clear on the reason for making this change. Other than #13372, is there much work to do in order to enable Continuations on BE Linux? If there are more changes, then it makes sense to me to go ahead with this PR. Otherwise, I'd rather work with you to tweak #13372 to get Continuations working instead. I don't want to stand in the way of progress, so I'm happy to support this change if you feel that it streamlines things. I'm just thinking that we are likely to undo these changes in a couple weeks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13449#issuecomment-1507309309 From rrich at openjdk.org Thu Apr 13 17:10:20 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 13 Apr 2023 17:10:20 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: Message-ID: <-m6BWI7jTr71NgO4WjwGWa_B1yJKkdGUF90RVeVgRYw=.64204a86-d645-4670-ba98-8b0472e0c225@github.com> On Thu, 6 Apr 2023 13:22:49 GMT, Richard Reingruber wrote: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Top of stack when `SharedRuntime::exception_handler_for_return_address` is called. Calling C++ code kills 14 words in the caller. This kills the `method` field of the interpreted frame. On little endian only 12 words are killed which is just enough. At least in this test. If there is no dead space then it can crash too. [0.832s][trace][continuations] 0x00000fffb562da40: 0x00000fffb562db50 #0 method BasicExt$Continuation3Frames.ord103_testMethod_dontinline(JJJLjava/lang/String;)Ljava/lang/String; @ 7 [0.832s][trace][continuations] - 8 locals 9 max stack [0.832s][trace][continuations] - codelet: return entry points [0.832s][trace][continuations] sp for #1 [0.832s][trace][continuations] 0x00000fffb562da38: 0x00000fff9b629690 fresult [0.832s][trace][continuations] 0x00000fffb562da30: 0x00000000ffb72cd8 lresult [0.832s][trace][continuations] 0x00000fffb562da28: 0x0000000000000000 oop_tmp [0.832s][trace][continuations] 0x00000fffb562da20: 0x00000fffb562daa0 sender_sp [0.832s][trace][continuations] 0x00000fffb562da18: 0x00000fffb562d920 top_frame_sp [0.832s][trace][continuations] 0x00000fffb562da10: 0x00000fffb562d730 mdx [0.832s][trace][continuations] 0x00000fffb562da08: 0x00000fffb562d990 esp [0.832s][trace][continuations] 0x00000fffb562da00: 0x00000fff9b614f87 bcp [0.832s][trace][continuations] 0x00000fffb562d9f8: 0x00000fff9b612000 cpoolCache [0.832s][trace][continuations] 0x00000fffb562d9f0: 0x00000fffb562d9d8 monitors [0.832s][trace][continuations] 0x00000fffb562d9e8: 0x000000000000000b locals [0.832s][trace][continuations] 0x00000fffb562d9e0: 0x00000000ffcc02b0 mirror [0.832s][trace][continuations] oop for #0 [0.832s][trace][continuations] 0x00000fffb562d9d8: 0x00000fff9b614fd8 method [0.832s][trace][continuations] 0x00000fffb562d9d0: 0x00000000ffcc1bf8 [0.832s][trace][continuations] 0x00000fffb562d9c8: 0x0000000000000000 [0.832s][trace][continuations] 0x00000fffb562d9c0: 0x0000000000000001 [0.832s][trace][continuations] 0x00000fffb562d9b8: 0x0000000000000000 [0.832s][trace][continuations] 0x00000fffb562d9b0: 0x0000000000000002 [0.832s][trace][continuations] 0x00000fffb562d9a8: 0x0000000000000000 [0.832s][trace][continuations] 0x00000fffb562d9a0: 0x0000000000000003 DEAD SPACE [0.832s][trace][continuations] 0x00000fffb562d998: 0x00000000ffb72cd8 [0.832s][trace][continuations] 0x00000fffb562d990: 0x00000fffa9de84dc <- esp [0.832s][trace][continuations] 0x00000fffb562d988: 0x0000000000000006 [0.832s][trace][continuations] 0x00000fffb562d980: 0x00000fffa9de81a8 return address [0.832s][trace][continuations] 0x00000fffb562d978: 0x00000000bbaaddf9 [0.832s][trace][continuations] 0x00000fffb562d970: 0x00000fffb562da40 unextended_sp for #0 Range [0x00000fffb562d970, 0x00000fffb562d9d8] has 14 words !!! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1507312475 From rrich at openjdk.org Thu Apr 13 17:17:51 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 13 Apr 2023 17:17:51 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 21:16:33 GMT, Richard Reingruber wrote: > Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. Yes we will undo the change but only when no issues are open. This is to reduce the noise. I lost a day analyzing test failures with https://github.com/openjdk/jdk/pull/13372. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13449#issuecomment-1507324518 From epeter at openjdk.org Thu Apr 13 17:55:39 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 13 Apr 2023 17:55:39 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 19:18:16 GMT, Jasmine Karthikeyan wrote: >>>If we have several unrolls we will have chain of MaxL/MinL nodes. Will the chain be folded by IGVN? >> >> @vnkozlov I fear it would not fold currently. The CMove would not fold before either, but with repeated unrolling, the CMove was reused, and so there was only ever a single CMove (unless some RC got in between). >> >> I think in many cases, the type does not underflow, and the `MaxL/MinL` can be removed completely. >> However, if that does not work, I think it now also fails to remove the repeated `ConvI2L / ConvL2I`. We would have to add more IGVN optimizations to fold things more. >> >> I think the performance impact is now insignificant, if it does not fold. Because the limits are only calculated once per loop. We can still improve the folding, if you want. I can also do that in a follow-up RFE, and try to add some IR tests that target type-limit underflow, and count the `MaxL/MinL` nodes. >> >> TLDR: @vnkozlov is it ok if I investingate & test `MaxL/MinL` and `ConvI2L / ConvL2I` folding in a follow-up RFE? > >> However, if that does not work, I think it now also fails to remove the repeated ConvI2L / ConvL2I. We would have to add more IGVN optimizations to fold things more. > > I think you're running into an issue where some nodes created by counted loop expansion aren't properly passed onto the IGVN worklist- I found the same thing while trying to investigate some strange code generation from small loops. If you make that follow-up RFE I would be happy to attach the cases that I found as well. @jaskarth please send me those cases, if it is many then maybe better via email. I'm generally working on doing verification of that kind, see [JDK-8298951](https://bugs.openjdk.org/browse/JDK-8298951). ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1507385614 From tsteele at openjdk.org Thu Apr 13 18:16:20 2023 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 13 Apr 2023 18:16:20 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 21:16:33 GMT, Richard Reingruber wrote: > Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. Marked as reviewed by tsteele (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13449#pullrequestreview-1383971671 From tsteele at openjdk.org Thu Apr 13 18:16:22 2023 From: tsteele at openjdk.org (Tyler Steele) Date: Thu, 13 Apr 2023 18:16:22 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Thu, 13 Apr 2023 17:15:13 GMT, Richard Reingruber wrote: >> Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. > > Yes we will undo the change but only when no issues are open. This is to reduce the noise. I lost a day analyzing test failures with https://github.com/openjdk/jdk/pull/13372. Sounds good. Thanks @reinrich. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13449#issuecomment-1507414190 From coleenp at openjdk.org Thu Apr 13 18:35:31 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 13 Apr 2023 18:35:31 GMT Subject: RFR: 8305404: Compile_lock not needed for InstanceKlass::implementor() In-Reply-To: References: Message-ID: On Thu, 13 Apr 2023 12:25:27 GMT, Coleen Phillimore wrote: > See CR for details. > Tested with tier1-4, 7, 8. Thanks Erik! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13458#issuecomment-1507440861 From rrich at openjdk.org Fri Apr 14 06:47:49 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 14 Apr 2023 06:47:49 GMT Subject: RFR: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Thu, 13 Apr 2023 09:27:31 GMT, Martin Doerr wrote: >> Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. > > LGTM. Thanks for the reviews @TheRealMDoerr and @backwaterred. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13449#issuecomment-1508005376 From rrich at openjdk.org Fri Apr 14 06:47:50 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 14 Apr 2023 06:47:50 GMT Subject: Integrated: 8305934: PPC64: Disable VMContinuations on Big Endian In-Reply-To: References: Message-ID: On Wed, 12 Apr 2023 21:16:33 GMT, Richard Reingruber wrote: > Disable VMContinuations on PPC64 big endian in general (not only on AIX) because there are known failures in jdk:jdk_loom tests. This pull request has now been integrated. Changeset: 12358e6c Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/12358e6c94bc96e618efc3ec5299a2cfe1b4669d Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8305934: PPC64: Disable VMContinuations on Big Endian Reviewed-by: mdoerr, tsteele ------------- PR: https://git.openjdk.org/jdk/pull/13449 From dzhang at openjdk.org Fri Apr 14 06:56:41 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 14 Apr 2023 06:56:41 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v14] In-Reply-To: <6xhQpQfUt6G3C7GES9UhRMCSFyfNGIF1ANCQ1u_xcQA=.acc3520f-df68-494d-ab30-c4af7334c8b8@github.com> References: <7-vzGGq80CYZvbB9N5qrn5mrhVhxxh97oo7AkgYRn1k=.8b9fe40f-d858-43d0-bdb8-4b050009876f@github.com> <6xhQpQfUt6G3C7GES9UhRMCSFyfNGIF1ANCQ1u_xcQA=.acc3520f-df68-494d-ab30-c4af7334c8b8@github.com> Message-ID: On Tue, 11 Apr 2023 02:49:22 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove unneeded combination nodes > > src/hotspot/cpu/riscv/riscv_v.ad line 175: > >> 173: match(Set dst (VectorMaskCmp (Binary src1 src2) (Binary cond vmask))); >> 174: effect(TEMP tmp); >> 175: format %{ "vmaskcmp_rvv_masked $dst, $src1, $src2, $vmask, $tmp, $cond" %} > > Suggestion: s/vmaskcmp_rvv_masked/vmaskcmp_masked/ Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 2561: > >> 2559: %} >> 2560: >> 2561: instruct vmaskcast_same_esize_rvv(vRegMask dst_src) %{ > > Suggestion: s/vmaskcast_same_esize_rvv/vmaskcast_same_esize/ Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 2565: > >> 2563: match(Set dst_src (VectorMaskCast dst_src)); >> 2564: ins_cost(0); >> 2565: format %{ "vmaskcast_same_esize_rvv $dst_src\t# do nothing" %} > > Suggestion: s/vmaskcast_same_esize_rvv/vmaskcast_same_esize/ Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166363159 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166363198 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166363302 From rrich at openjdk.org Fri Apr 14 07:01:49 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 14 Apr 2023 07:01:49 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 [v2] In-Reply-To: References: Message-ID: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Merge branch 'master' - Update comments - Rename abi_reg_args_spill -> native_abi_reg_args_spill - Use correct abi definitions - Rename native abi size enum elements - Introduce common_abi - Derive parent_ijava_frame_abi from java_abi - java_abi - Native abi structs ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13372/files - new: https://git.openjdk.org/jdk/pull/13372/files/9613ecd9..ea9ddcfe Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13372&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13372&range=00-01 Stats: 24490 lines in 716 files changed: 13432 ins; 8019 del; 3039 mod Patch: https://git.openjdk.org/jdk/pull/13372.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13372/head:pull/13372 PR: https://git.openjdk.org/jdk/pull/13372 From dzhang at openjdk.org Fri Apr 14 08:02:37 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 14 Apr 2023 08:02:37 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v17] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Add some pseudoinstruction and unify function name ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/bcbab448..110ebcf3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=15-16 Stats: 45 lines in 5 files changed: 25 ins; 0 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Fri Apr 14 08:02:44 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 14 Apr 2023 08:02:44 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v16] In-Reply-To: References: <8j0il6k2xmB5s72N2EAiTijDDjN7FNoXrPux4ur9IdE=.485aa3bb-de42-4bbb-a374-1c2ec7340f71@github.com> Message-ID: On Thu, 13 Apr 2023 12:12:35 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1726: > >> 1724: } >> 1725: >> 1726: void C2_MacroAssembler::rvv_compare(VectorRegister vd, BasicType bt, int length_in_bytes, VectorRegister src1, VectorRegister src2, int cond, VectorMask vm) { > > All RVV-related methods are named with the suffix `_v` in C2_MacroAssembler (except `rvv_vsetvli` and `rvv_reduce_integral`, which should be renamed too, IMO), I think we should follow this naming style. Thanks for the review! Fixed. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1734: > >> 1732: case BoolTest::ne: vmfne_vv(vd, src1, src2, vm); break; >> 1733: case BoolTest::le: vmfle_vv(vd, src1, src2, vm); break; >> 1734: case BoolTest::ge: vmfle_vv(vd, src2, src1, vm); break; > > Maybe we could add some pseudo instructions like `vmfge_vv`/`vmfgt_vv` [1] . > > 1. https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L3676-L3681 Fixed. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1747: > >> 1745: case BoolTest::ne: vmsne_vv(vd, src1, src2, vm); break; >> 1746: case BoolTest::le: vmsle_vv(vd, src1, src2, vm); break; >> 1747: case BoolTest::ge: vmsle_vv(vd, src2, src1, vm); break; > > Same here, `vmsge_vv`/`vmsgt_vv` [1] would be better. > > 1. https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L2724-L2729 Fixed. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 203: > >> 201: void rvv_vsetvli(BasicType bt, int length_in_bytes, Register tmp = t0); >> 202: >> 203: void rvv_compare(VectorRegister dst, BasicType bt, int length_in_bytes, > > Suggestion: > > void compare_v(VectorRegister dst, BasicType bt, int length_in_bytes, Fixed. > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 230: > >> 228: >> 229: // Clear vector registers independent of previous vl and vtype. >> 230: void rvv_clear_register(VectorRegister v) { > > Suggestion: > > void clear_register_v(VectorRegister v) { Fixed. > src/hotspot/cpu/riscv/riscv.ad line 919: > >> 917: // The mask value used to control execution of a masked vector >> 918: // instruction is always supplied by vector register v0. >> 919: reg_class vectmask_reg_v0 ( > > Suggestion: > > reg_class vmask_reg_v0 ( Fixed. > src/hotspot/cpu/riscv/riscv.ad line 926: > >> 924: // We need two more vmask registers to do the vector mask logical ops, >> 925: // so define v30, v31 as mask register too. >> 926: reg_class vectmask_reg ( > > Suggestion: > > reg_class vmask_reg ( Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166445903 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166446278 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166446324 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166445963 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166446007 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166446060 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166446104 From mdoerr at openjdk.org Fri Apr 14 09:04:42 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 14 Apr 2023 09:04:42 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 [v2] In-Reply-To: References: Message-ID: <36aK7kr7dji6Y485mdrljB8JEZvG2-L5_EnD9MG-AG0=.c3cd89fc-9b9e-4f68-afb2-7a868c9814ff@github.com> On Fri, 14 Apr 2023 07:01:49 GMT, Richard Reingruber wrote: >> This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. >> With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. >> >> Before: >> >> * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` >> * jit_abi is independent of `abi_minframe` >> * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) >> >> After changes: >> >> * prefixed structs that depend on `ABI_ELFv2` with `native_` >> * introduced `java_abi` which is independent of `ABI_ELFv2` >> * `frame::metadata_words` is the size in words of `java_abi` >> >> This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` >> >> Testing: >> >> PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. >> PPC64be Linux: hotspot tier1 tests > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Merge branch 'master' > - Update comments > - Rename abi_reg_args_spill -> native_abi_reg_args_spill > - Use correct abi definitions > - Rename native abi size enum elements > - Introduce common_abi > - Derive parent_ijava_frame_abi from java_abi > - java_abi > - Native abi structs Marked as reviewed by mdoerr (Reviewer). LGTM. ------------- PR Review: https://git.openjdk.org/jdk/pull/13372#pullrequestreview-1385000384 PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1508182994 From duke at openjdk.org Fri Apr 14 09:37:35 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 09:37:35 GMT Subject: RFR: 8305995: Use full dominant search for regions Message-ID: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. After that I've done a `git bisect` which allows me to locate the introducer. So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. ------------- Commit messages: - Use full dominant search for regions Changes: https://git.openjdk.org/jdk/pull/13453/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13453&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305995 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13453.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13453/head:pull/13453 PR: https://git.openjdk.org/jdk/pull/13453 From duke at openjdk.org Fri Apr 14 09:37:36 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 09:37:36 GMT Subject: RFR: 8305995: Use full dominant search for regions In-Reply-To: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Thu, 13 Apr 2023 01:02:08 GMT, Kirill A. Korinsky wrote: > This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. > > The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. > > When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. > > After that I've done a `git bisect` which allows me to locate the introducer. > > So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. I've minimized my code to a trivial benchmark https://gist.github.com/catap/3ac65ba878048cca0132e6cba17d86ba which shown that effect of this patch is reducing memory footprint 20 times. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1507796872 From thartmann at openjdk.org Fri Apr 14 09:37:37 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 14 Apr 2023 09:37:37 GMT Subject: RFR: 8305995: Use full dominant search for regions In-Reply-To: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: <8gRD-UO-2z5X8b3nRRJtkWyRaRc4Ld9DAWMIdPs-Vsg=.738455bd-0d36-4481-8545-413d9de783ed@github.com> On Thu, 13 Apr 2023 01:02:08 GMT, Kirill A. Korinsky wrote: > This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. > > The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. > > When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. > > After that I've done a `git bisect` which allows me to locate the introducer. > > So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. Thanks for reporting this, Kirill. I filed [JDK-8305995](https://bugs.openjdk.org/browse/JDK-8305995) for tracking. It would be great if you could add your benchmark (to `test/micro/org/openjdk/bench/`). I'll have a look at your proposed fix next week. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1508216682 From rcastanedalo at openjdk.org Fri Apr 14 12:47:39 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Apr 2023 12:47:39 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v3] In-Reply-To: References: Message-ID: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 31 commits: - Use is_marked_reduction() in new SLP code - Merge master - Emit Node::Flag_has_swapped_edges in IGV graphs - Merge master - Relax the reduction cycle search bound - Remove redundant IR check precondition - Use SuperWord members in reduction marking - Remove redundant opcode checks - Do not run test in x86-32 - Update existing test instead of removing it - ... and 21 more: https://git.openjdk.org/jdk/compare/a3137c75...d9fc7b22 ------------- Changes: https://git.openjdk.org/jdk/pull/13120/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13120&range=02 Stats: 821 lines in 17 files changed: 654 ins; 106 del; 61 mod Patch: https://git.openjdk.org/jdk/pull/13120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13120/head:pull/13120 PR: https://git.openjdk.org/jdk/pull/13120 From epeter at openjdk.org Fri Apr 14 12:47:41 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 14 Apr 2023 12:47:41 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 15:21:45 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 28 commits: > > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - Add negative vectorization test > - Update copyright headers > - Add two more reduction vectorization microbenchmarks > - ... and 18 more: https://git.openjdk.org/jdk/compare/941a7ac7...95f6cc33 I filed this RFE, it is related to this work here: [JDK-8305707](https://bugs.openjdk.org/browse/JDK-8305707) "SuperWord should vectorize reverse-order reduction loops" ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1499072763 From rcastanedalo at openjdk.org Fri Apr 14 12:47:44 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Apr 2023 12:47:44 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Sun, 2 Apr 2023 05:52:17 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/superword.cpp line 504: >> >>> 502: // to the phi node following edge index 'input'. >>> 503: PathEnd path = >>> 504: find_in_path( >> >> Hi @robcasloz, >> find_in_path expects reduction nodes to be present at same edge indices in the reduction chain, it also honors has_swapped_edge flag during backward traversal. >> However, there are still some ideal transforms like following which may break the reduction chain and this will prevent Min/Max reductions for test case mentioned in JDK-8302673. >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1147 >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L1230 > > One way to add fault-tolerance to find_in_path could be to follow strict DFS semantics where an alternate path is taken if node's predicates are not satisfied, currently we are starting all over again from the first node of chain with a different reduction_input which prevents inferring reduction chain even though all the nodes in the chain are commutative isomorphic operations. Hi again @jatin-bhateja, I have studied now [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673) in more detail, and my conclusion is that it is not a duplicate but actually orthogonal to this changeset. Even a perfect reduction analysis alone would not re-enable the missing vectorization, because the canonicalization transformations done by `MaxI/MinINode::Ideal()` inhibit SuperWord analysis at a later stage. In light of this, I propose to re-open [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673), and address it by handling all four combinations of two-level inputs in `MaxI/MinINode::Ideal()` instead of canonicalizing `MaxI/MinI` chains. This solution is, in my opinion, more straightforward (and not necessarily more expensive). The main reason is that it separates concerns, making it possible to reason about input swapping, optimization of `MaxI/MinI` nodes, reduction analysis, and auto-vectorization separately. I have [a WIP, prototype implementation](https://github.com/openjdk/jdk/compare/master...robcasloz:jdk:JDK-8302673) which seems to work fine for all of the discussed reduction analysis strategies. @jatin-bhateja, if you want I can take over JDK-8302673 and submit it for review once I have polished it. Given that JDK-8302673 is orthogonal to this RFE, and that a solution to JDK-8302673 is available for which this RFE detects MaxI/MinI reductions correctly, I suggest to move on with this RFE as-is, and file a follow-up RFE to investigate generic search approaches. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1166778814 From rcastanedalo at openjdk.org Fri Apr 14 12:47:46 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Apr 2023 12:47:46 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 09:13:20 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/share/opto/superword.cpp line 539: >> >>> 537: pred = current; >>> 538: current = original_input(current, reduction_input); >>> 539: } >> >> If we bookkeep the nodes in the reduction chain path during initial backward traversal we may simplify this checking and also another call to _original_input_ while populating _loop_reductions set on [#L547 ](https://github.com/openjdk/jdk/pull/13120/files#diff-8f29dd005a0f949d108687dabb7379c73dfd85cd782da453509dc9b6cb8c9f81R547) > > Thanks, will consider this together with your earlier comments. I tried out your suggestion but unfortunately, the bookkeeping code (marking/storing candidate nodes and their predecessors in the tentative reduction chain) became more complex than the simplifications it enabled. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1166790520 From rcastanedalo at openjdk.org Fri Apr 14 12:52:40 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 14 Apr 2023 12:52:40 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v3] In-Reply-To: References: Message-ID: On Fri, 14 Apr 2023 12:47:39 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64) is to replace this changeset's linear chain finding approach with a more general shortest path-finding algorithm. This alternative might preclude the need for tracking edge swapping at a potentially higher computational cost. Since the trade-off is not obvious, I propose to investigate it in a follow-up RFE. >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 31 commits: > > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - ... and 21 more: https://git.openjdk.org/jdk/compare/a3137c75...d9fc7b22 I have resolved conflicts caused by the integration of [JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042); added minimal, debug-only code for emitting `Node::Flag_has_swapped_edges` for IGV nodes; and addressed @jatin-bhateja's comments (including analyzing the interaction with [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673) and determining it is orthogonal to this RFE). Please review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1508456771 From rrich at openjdk.org Fri Apr 14 14:11:34 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 14 Apr 2023 14:11:34 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 [v2] In-Reply-To: <36aK7kr7dji6Y485mdrljB8JEZvG2-L5_EnD9MG-AG0=.c3cd89fc-9b9e-4f68-afb2-7a868c9814ff@github.com> References: <36aK7kr7dji6Y485mdrljB8JEZvG2-L5_EnD9MG-AG0=.c3cd89fc-9b9e-4f68-afb2-7a868c9814ff@github.com> Message-ID: <4uJKFMIWhLrs6uad1lFd3xTNPy8-vvsmZ6GKNjdXdYg=.d306cdd6-d270-409a-b14a-167694cccc18@github.com> On Fri, 14 Apr 2023 09:01:23 GMT, Martin Doerr wrote: > This PR LGTM. For the remaining BE issue, shouldn't we push a dummy frame (`push_frame_reg_args(0, R0);`)? Other callers of `SharedRuntime::exception_handler_for_return_address` do that. Yes, I've done that and it works. I'll come up with a separate PR that will also enable VMContinuations again on PPC64 big endian. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1508601093 From rrich at openjdk.org Fri Apr 14 14:18:32 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 14 Apr 2023 14:18:32 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 [v3] In-Reply-To: References: Message-ID: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - Merge branch 'master' after 8301495: Replace NULL with nullptr in cpu/ppc - Merge branch 'master' - Update comments - Rename abi_reg_args_spill -> native_abi_reg_args_spill - Use correct abi definitions - Rename native abi size enum elements - Introduce common_abi - Derive parent_ijava_frame_abi from java_abi - java_abi - Native abi structs ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13372/files - new: https://git.openjdk.org/jdk/pull/13372/files/ea9ddcfe..1e7ba3f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13372&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13372&range=01-02 Stats: 415 lines in 54 files changed: 30 ins; 1 del; 384 mod Patch: https://git.openjdk.org/jdk/pull/13372.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13372/head:pull/13372 PR: https://git.openjdk.org/jdk/pull/13372 From fparain at openjdk.org Fri Apr 14 15:26:35 2023 From: fparain at openjdk.org (Frederic Parain) Date: Fri, 14 Apr 2023 15:26:35 GMT Subject: RFR: 8305404: Compile_lock not needed for InstanceKlass::implementor() In-Reply-To: References: Message-ID: On Thu, 13 Apr 2023 12:25:27 GMT, Coleen Phillimore wrote: > See CR for details. > Tested with tier1-4, 7, 8. Marked as reviewed by fparain (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13458#pullrequestreview-1385674518 From dzhang at openjdk.org Fri Apr 14 15:34:29 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 14 Apr 2023 15:34:29 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v18] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Simplify some arithmetic mask nodes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/110ebcf3..a7f66796 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=16-17 Stats: 114 lines in 1 file changed: 3 ins; 102 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From coleenp at openjdk.org Fri Apr 14 15:36:45 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 14 Apr 2023 15:36:45 GMT Subject: RFR: 8305404: Compile_lock not needed for InstanceKlass::implementor() In-Reply-To: References: Message-ID: On Thu, 13 Apr 2023 12:25:27 GMT, Coleen Phillimore wrote: > See CR for details. > Tested with tier1-4, 7, 8. Thanks Fred! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13458#issuecomment-1508799136 From coleenp at openjdk.org Fri Apr 14 15:36:45 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 14 Apr 2023 15:36:45 GMT Subject: Integrated: 8305404: Compile_lock not needed for InstanceKlass::implementor() In-Reply-To: References: Message-ID: On Thu, 13 Apr 2023 12:25:27 GMT, Coleen Phillimore wrote: > See CR for details. > Tested with tier1-4, 7, 8. This pull request has now been integrated. Changeset: ebeee6dc Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/ebeee6dce8c52ef156d54ad14cce81a243ef5c0b Stats: 11 lines in 2 files changed: 0 ins; 5 del; 6 mod 8305404: Compile_lock not needed for InstanceKlass::implementor() Reviewed-by: eosterlund, fparain ------------- PR: https://git.openjdk.org/jdk/pull/13458 From dzhang at openjdk.org Fri Apr 14 15:36:43 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 14 Apr 2023 15:36:43 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v16] In-Reply-To: References: <8j0il6k2xmB5s72N2EAiTijDDjN7FNoXrPux4ur9IdE=.485aa3bb-de42-4bbb-a374-1c2ec7340f71@github.com> Message-ID: On Thu, 13 Apr 2023 13:28:17 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo > > src/hotspot/cpu/riscv/riscv_v.ad line 2108: > >> 2106: >> 2107: instruct vsubS_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ >> 2108: match(Set dst_src1 (SubVS (Binary dst_src1 src2) vmask)); > > Can we just merge those match rules in one instruct just like `vlsrIL`? Looks like those instructs only differ from BasicType. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166986311 From dzhang at openjdk.org Fri Apr 14 15:48:40 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 14 Apr 2023 15:48:40 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 16:42:20 GMT, Dingli Zhang wrote: >> src/hotspot/cpu/riscv/riscv_v.ad line 2422: >> >>> 2420: %} >>> 2421: >>> 2422: instruct vmask_gen_I(vRegMask dst, iRegI src) %{ >> >> Just a reminder that your following new rules will enable array operations partial inlining. Have you tested this feature on RISC-V? > > I am sorry I misunderstood you before, currently riscv does not have arraycopy partial inlining enabled like aarch64[1] because we did not add all the nodes needed for the above optimization in this patch[2]. > > We added the required nodes and tested `org/openjdk/bench/java/lang/ArrayCopyAligned.java`, the arraycopy partial inline function tested correctly and the corresponding nodes could be printed in the compilation log as follows: > > 124 B23: # out( B24 ) <- in( B22 ) Freq: 0.402723 > 124 # castII of R8, #@castII > 124 vmask_gen_I V0, R8 > 134 loadV_masked V1, V0, [R10] > 140 storeV_masked [R11], V0, V1 > > However, because of the lack of hardware data, we do not turn on the partial inlining optimization by default for now, as the partial inline optimization depends on the nodes we will append. > > We are testing more extensively, and will update the PR after that. > > [1] https://github.com/openjdk/jdk/blob/cd7d53c88c27eedbe16020b88c2219708d170a1e/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L557-L563 > [2] https://github.com/openjdk/jdk/blob/cd7d53c88c27eedbe16020b88c2219708d170a1e/src/hotspot/share/opto/macroArrayCopy.cpp#L220-L225 > Just a reminder that your following new rules will enable array operations partial inlining. Have you tested this feature on RISC-V? FYI: We enable array operations partial inlining on riscv with the patch below, tier1-3 looks fine. diff --git a/src/hotspot/cpu/riscv/vm_version_riscv.cpp b/src/hotspot/cpu/riscv/vm_version_riscv.cpp index 82f34ade0ef..6cbb8b61b6a 100644 --- a/src/hotspot/cpu/riscv/vm_version_riscv.cpp +++ b/src/hotspot/cpu/riscv/vm_version_riscv.cpp @@ -320,6 +320,14 @@ void VM_Version::c2_initialize() { if (FLAG_IS_DEFAULT(UseMontgomerySquareIntrinsic)) { FLAG_SET_DEFAULT(UseMontgomerySquareIntrinsic, true); } + + int inline_size = (UseRVV && MaxVectorSize >= 16) ? MaxVectorSize : 0; + if (FLAG_IS_DEFAULT(ArrayOperationPartialInlineSize)) { + FLAG_SET_DEFAULT(ArrayOperationPartialInlineSize, inline_size); + } else if (ArrayOperationPartialInlineSize != 0 && ArrayOperationPartialInlineSize != inline_size) { + warning("Setting ArrayOperationPartialInlineSize to %d", inline_size); + ArrayOperationPartialInlineSize = inline_size; + } } #endif // COMPILER2 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1166997854 From kvn at openjdk.org Fri Apr 14 16:09:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Apr 2023 16:09:34 GMT Subject: RFR: 8305995: Use full dominant search for regions In-Reply-To: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: <88HlHOLi6yXBmp2tTpI3fIHAGT0ZFeExGG7x4dbUJcY=.8b6e5603-941c-467e-b550-ecc24bcd2b96@github.com> On Thu, 13 Apr 2023 01:02:08 GMT, Kirill A. Korinsky wrote: > This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. > > The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. > > When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. > > After that I've done a `git bisect` which allows me to locate the introducer. > > So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. First, please update PR's title, it should match JBS's title. Second, @catap you referenced JMH benchmark. Is it already in `test/micro/org/openjdk/bench/` or it is new? Please add it if it is new. So before 8224957 code looks through in(1) for req() > 3 (and == 2). After 8224957 we bailout for >3 inputs. This fix removed this restriction and we now go through all inputs for case > 3 inputs. It may increase compilation time for such cases significantly but on other hand it may allow reduce IR. I assume, based on description, x4 increase is in Java heap due to not unboxing some Java objects. Or is it memory consumed by C2 Jit compiler during benchmark execution? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1508893706 From duke at openjdk.org Fri Apr 14 18:16:35 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 18:16:35 GMT Subject: RFR: 8305995: Use full dominant search for regions [v2] In-Reply-To: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: > This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. > > The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. > > When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. > > After that I've done a `git bisect` which allows me to locate the introducer. > > So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. Kirill A. Korinsky has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: 8305995: Use full dominant search for regions This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. The regression appears in provided JMH benchmark, which builds a RB-tree based map which contains 780 entries with primitive integers. This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from nothing `? 10?? ` to `7536` bytes per operation. Proposed fix reduces the memory footprint expected value. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13453/files - new: https://git.openjdk.org/jdk/pull/13453/files/82c77579..dbc3ebf9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13453&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13453&range=00-01 Stats: 1203 lines in 1 file changed: 1203 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13453.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13453/head:pull/13453 PR: https://git.openjdk.org/jdk/pull/13453 From duke at openjdk.org Fri Apr 14 18:16:36 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 18:16:36 GMT Subject: RFR: 8305995: Use full dominant search for regions In-Reply-To: <88HlHOLi6yXBmp2tTpI3fIHAGT0ZFeExGG7x4dbUJcY=.8b6e5603-941c-467e-b550-ecc24bcd2b96@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> <88HlHOLi6yXBmp2tTpI3fIHAGT0ZFeExGG7x4dbUJcY=.8b6e5603-941c-467e-b550-ecc24bcd2b96@github.com> Message-ID: On Fri, 14 Apr 2023 16:06:05 GMT, Vladimir Kozlov wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > First, please update PR's title, it should match JBS's title. > Second, @catap you referenced JMH benchmark. Is it already in `test/micro/org/openjdk/bench/` or it is new? Please add it if it is new. > > So before 8224957 code looks through in(1) for req() > 3 (and == 2). After 8224957 we bailout for >3 inputs. > This fix removed this restriction and we now go through all inputs for case > 3 inputs. > It may increase compilation time for such cases significantly but on other hand it may allow reduce IR. > > I assume, based on description, x4 increase is in Java heap due to not unboxing some Java objects. Or is it memory consumed by C2 Jit compiler during benchmark execution? @vnkozlov after spending some time to simplify my benchmark and to include it into PR as the next commit, I'd like to conclude that boxing / unboxing aren't doing anything with it. I just made a force push which contains the benchmark. JMH reports that it almost hasn't got memory footprint in runtime `? 10??` on JDK up to 13 includes, but on JDK 15 and after that it consumes `? 7536`. So, in fact it isn't x4, I was wrong on the first assumption. It is much bigger. P.S. I've also rewrite the commit message and include issue ID into it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1509043148 From kvn at openjdk.org Fri Apr 14 18:32:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Apr 2023 18:32:34 GMT Subject: RFR: 8305995: Use full dominant search for regions [v2] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Fri, 14 Apr 2023 18:16:35 GMT, Kirill A. Korinsky wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Kirill A. Korinsky has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > 8305995: Use full dominant search for regions > > This is a fix for the regression introduced by > da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application > from JDK 1.8 to JDK 17, by running internal benchmarks, and while > investigating abnormal memory usage for about 4 times more from one of > them. > > The regression appears in provided JMH benchmark, which builds a RB-tree > based map which contains 780 entries with primitive integers. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK > 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, > `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed > between 13 and 15, and that the memory footprint for this code has > increased from nothing `? 10?? ` to `7536` bytes per operation. > > Proposed fix reduces the memory footprint expected value. Thank you for adding benchmark. Please use meaningful name for benchmark and add copyright header to it. See other benchmarks there as example. And add benchmark's output as comment and what machine you used to run it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1509058021 PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1509060526 From duke at openjdk.org Fri Apr 14 18:51:35 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 18:51:35 GMT Subject: RFR: 8305995: Use full dominant search for regions [v2] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: <4aZBKA4AocZOcpK9rGxVlMbmlL40SD-Wq0rN9m1waIA=.00dfdd52-2c59-4704-9c39-ca714c3bd85e@github.com> On Fri, 14 Apr 2023 18:29:58 GMT, Vladimir Kozlov wrote: >> Kirill A. Korinsky has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> 8305995: Use full dominant search for regions >> >> This is a fix for the regression introduced by >> da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application >> from JDK 1.8 to JDK 17, by running internal benchmarks, and while >> investigating abnormal memory usage for about 4 times more from one of >> them. >> >> The regression appears in provided JMH benchmark, which builds a RB-tree >> based map which contains 780 entries with primitive integers. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK >> 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, >> `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed >> between 13 and 15, and that the memory footprint for this code has >> increased from nothing `? 10?? ` to `7536` bytes per operation. >> >> Proposed fix reduces the memory footprint expected value. > > And add benchmark's output as comment and what machine you used to run it. @vnkozlov may you suggest any meaningful name? :) `RBTreeSearch` is good enough? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1509076244 From duke at openjdk.org Fri Apr 14 19:14:36 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 19:14:36 GMT Subject: RFR: 8305995: Use full dominant search for regions [v3] In-Reply-To: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: > This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. > > The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. > > When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. > > After that I've done a `git bisect` which allows me to locate the introducer. > > So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: Rename benchmark and add header ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13453/files - new: https://git.openjdk.org/jdk/pull/13453/files/dbc3ebf9..4cd22b60 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13453&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13453&range=01-02 Stats: 36 lines in 1 file changed: 33 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13453.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13453/head:pull/13453 PR: https://git.openjdk.org/jdk/pull/13453 From duke at openjdk.org Fri Apr 14 19:14:45 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 19:14:45 GMT Subject: RFR: 8305995: Use full dominant search for regions [v2] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Fri, 14 Apr 2023 18:16:35 GMT, Kirill A. Korinsky wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Kirill A. Korinsky has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > 8305995: Use full dominant search for regions > > This is a fix for the regression introduced by > da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application > from JDK 1.8 to JDK 17, by running internal benchmarks, and while > investigating abnormal memory usage for about 4 times more from one of > them. > > The regression appears in provided JMH benchmark, which builds a RB-tree > based map which contains 780 entries with primitive integers. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK > 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, > `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed > between 13 and 15, and that the memory footprint for this code has > increased from nothing `? 10?? ` to `7536` bytes per operation. > > Proposed fix reduces the memory footprint expected value. Renamed. I've run it as `make test TEST="micro:RBTreeSearch" MICRO_OPTIONS="-prof gc"` and have an output on my branch: Benchmark Mode Cnt Score Error Units RBTreeSearch.search thrpt 12 0.079 ? 0.013 ops/us RBTreeSearch.search:?gc.alloc.rate thrpt 12 ? 10?? MB/sec RBTreeSearch.search:?gc.alloc.rate.norm thrpt 12 0.003 ? 0.001 B/op RBTreeSearch.search:?gc.count thrpt 12 ? 0 counts if I revert my changes by `git diff ..origin/master -- src/hotspot/share/opto/node.cpp | patch -p1` for example and re-run test I do have an output: Benchmark Mode Cnt Score Error Units RBTreeSearch.search thrpt 12 0.044 ? 0.009 ops/us RBTreeSearch.search:?gc.alloc.rate thrpt 12 209.072 ? 42.450 MB/sec RBTreeSearch.search:?gc.alloc.rate.norm thrpt 12 5024.005 ? 0.001 B/op RBTreeSearch.search:?gc.count thrpt 12 17.000 counts RBTreeSearch.search:?gc.time thrpt 12 17.000 ms As you may see memory footprint quite different. As test machine I use right now macOS 12.6.3. I may find some linux, but I doubt that this logic is os specific. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1509101903 From kvn at openjdk.org Fri Apr 14 20:41:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Apr 2023 20:41:34 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 [v3] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Fri, 14 Apr 2023 19:14:36 GMT, Kirill A. Korinsky wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: > > Rename benchmark and add header New test's name is fine. Please, update copyright year (first line) to 2023. Thank you for data. It shows that C2 did not eliminate Integer boxing allocations without your fix. As result they consume Java heap. With your fix C2's Escape Analysis eliminated these allocations. Good! I will submit internal testing for your changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1509229848 From duke at openjdk.org Fri Apr 14 20:51:35 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Fri, 14 Apr 2023 20:51:35 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 [v4] In-Reply-To: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: > This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. > > The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. > > When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. > > After that I've done a `git bisect` which allows me to locate the introducer. > > So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: Fix the copyright year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13453/files - new: https://git.openjdk.org/jdk/pull/13453/files/4cd22b60..18b997f9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13453&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13453&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13453.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13453/head:pull/13453 PR: https://git.openjdk.org/jdk/pull/13453 From cslucas at openjdk.org Fri Apr 14 20:54:45 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 14 Apr 2023 20:54:45 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v8] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Address PR review 3. Some comments and be able to abort compilation. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12897/files - new: https://git.openjdk.org/jdk/pull/12897/files/8ed147f4..a10b0a4c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=06-07 Stats: 118 lines in 13 files changed: 60 ins; 11 del; 47 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From cslucas at openjdk.org Fri Apr 14 20:54:48 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 14 Apr 2023 20:54:48 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Fri, 31 Mar 2023 18:30:19 GMT, Xin Liu wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. > > src/hotspot/share/opto/escape.cpp line 457: > >> 455: found_sr_allocate = true; >> 456: } else { >> 457: ptn->set_scalar_replaceable(false); > > This member function is const. Do we really need to change ptn's property here? > > My reading is ophi is profitable as long as we spot any input object which can be eliminated. how about you just return at line 455? This is actually necessary here. By setting the input to NSR I don't need to later, when performing reduction, check that I can eliminate the node. I can just check that I can scalar replace the input. If I removed this line I'd hit a problem if the merge had an input that is SR but that ME can't eliminate. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1167263888 From cslucas at openjdk.org Fri Apr 14 20:56:39 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 14 Apr 2023 20:56:39 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v4] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <6NDwZSpjSrokmglncPRp4tM7_Hiq4b26dXukhXODpKo=.8ba7efd0-bc44-4f1e-beb8-c1c68bc33515@github.com> Message-ID: On Fri, 24 Mar 2023 16:40:15 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Add support for SR'ing some inputs of merges used for field loads > > You new test failed in GHA testing with 32-bit VM: `Could not find VM flag "UseCompressedOops" in @IR rule 1 at int`. > You need to adjust next rule: `@IR(counts = { IRNode.ALLOC, "2" }, applyIf = { "UseCompressedOops", "false" })` @vnkozlov - I think I addressed all your comments. Please let me know if I missed something or if there is something on that you think need to be improved. @iwanowww - can I ask you to please take a look and let me know what you think? ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1509247613 From kvn at openjdk.org Fri Apr 14 21:49:36 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 14 Apr 2023 21:49:36 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v8] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: <1pTfg6PGb3zu3ndvKYt0FSFmkOA01w9qLFtQ_s1BQbE=.7de234bc-5484-4d98-a003-ff86836922b9@github.com> On Fri, 14 Apr 2023 20:54:45 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR review 3. Some comments and be able to abort compilation. Nice. I will test it. ------------- PR Review: https://git.openjdk.org/jdk/pull/12897#pullrequestreview-1386210380 From kvn at openjdk.org Sat Apr 15 00:20:36 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Apr 2023 00:20:36 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v8] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Fri, 14 Apr 2023 20:54:45 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges that are used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically in: 1) Extend SafePointScalarObjectNode to represent multiple SR objects; 2) Add a new Class to support rematerialization of SR objects part of merges; 3) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 4) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straight forward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also tested with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Address PR review 3. Some comments and be able to abort compilation. New test failed in tier1 on all platforms. Here is list: 1) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.TestTrapAfterMerge(boolean,int,int)" - [Failed IR rules: 1]: 2) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testCondLoadAfterMerge(boolean,boolean,int,int)" - [Failed IR rules: 1]: 3) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testLoadInCondAfterMerge(boolean,int,int)" - [Failed IR rules: 1]: 4) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testMergesAndMixedEscape(boolean,int,int)" - [Failed IR rules: 1]: 5) Method "compiler.c2.irTests.scalarReplacement.AllocationMergesTests$Point[] compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testNestedObjectsArray(boolean,int,int)" - [Failed IR rules: 1]: 6) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testNestedObjectsNoEscapeObject(boolean,int,int)" - [Failed IR rules: 1]: 7) Method "compiler.c2.irTests.scalarReplacement.AllocationMergesTests$Point compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testNestedObjectsObject(boolean,int,int)" - [Failed IR rules: 1]: ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1509415967 From kvn at openjdk.org Sat Apr 15 00:26:33 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 15 Apr 2023 00:26:33 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 [v4] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Fri, 14 Apr 2023 20:51:35 GMT, Kirill A. Korinsky wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: > > Fix the copyright year My testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13453#pullrequestreview-1386276077 From jkarthikeyan at openjdk.org Sat Apr 15 06:37:41 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sat, 15 Apr 2023 06:37:41 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 09:37:28 GMT, Emanuel Peter wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Testing** >> >> I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. >> >> Tested up to `tier5` and stress testing. Performance testing **running...** >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Review suggestion by Tobias Hartmann > > Co-authored-by: Tobias Hartmann I only had a handful of cases so I've attached them in [this gist](https://gist.github.com/jaskarth/878648eecaf74e168d5499c5961b4715). I found these a few weeks ago, but looking back I think I have misremembered the problem chain. Examples 3-5 may be a different bug as they deal with `ConvL2I->ConvI2L` chains instead of `ConvI2L->ConvL2I` chains as you are seeing, as the latter has an Identity() transform defined while it seems the former does not- I apologize for the noise if the issues are unrelated. Examples 1 and 2 could perhaps still be useful in diagnosing the issue, as they describe cases where ideal transforms that do exist aren't taken. I tried looking into that bug myself a while ago but didn't get far. JDK-8298951 is exciting, it'll make reasoning about middle-end optimizations a lot easier :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1509586279 From jkarthikeyan at openjdk.org Mon Apr 17 04:32:29 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Apr 2023 04:32:29 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE [v2] In-Reply-To: References: Message-ID: > Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruct ions that are generated, as that can help in cases where applications are I-Cache bound. > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) > > > This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Re-work transform to happen in macro expansion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13345/files - new: https://git.openjdk.org/jdk/pull/13345/files/1f7878d0..ee468b9e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=00-01 Stats: 88 lines in 9 files changed: 56 ins; 15 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/13345.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13345/head:pull/13345 PR: https://git.openjdk.org/jdk/pull/13345 From jkarthikeyan at openjdk.org Mon Apr 17 04:37:40 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 17 Apr 2023 04:37:40 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE [v2] In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 04:32:29 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruc tions that are generated, as that can help in cases where applications are I-Cache bound. >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) >> >> >> This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Re-work transform to happen in macro expansion I've reworked the transformation to happen in macro expansion, and it seems the performance is actually *better* now! Baseline Patch Improvement Benchmark Mode Cnt Score Error Units Score Error Units Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% The comparison is now basically the same as doing it with `cmp`, which is nice. It seems the reason is because the assembly now zeroes the register, tests against zero, and then does the `setcc`, instead of comparsion, `setcc`, then `movzbl`. So, it seems that doing the transform in macro expansion is indeed a better choice for x86, as well as reducing the overhead in the matcher. However, I'm not so sure if the benefit will be the same across other platforms as it seems like the different architectures implement `Conv2B` using different strategies. Do you have any thoughts on this approach @merykitty? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1510684128 From rrich at openjdk.org Mon Apr 17 07:25:36 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 17 Apr 2023 07:25:36 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 [v4] In-Reply-To: References: Message-ID: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Copyright years ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13372/files - new: https://git.openjdk.org/jdk/pull/13372/files/1e7ba3f6..207b500f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13372&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13372&range=02-03 Stats: 15 lines in 14 files changed: 0 ins; 0 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/13372.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13372/head:pull/13372 PR: https://git.openjdk.org/jdk/pull/13372 From fjiang at openjdk.org Mon Apr 17 07:30:42 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 17 Apr 2023 07:30:42 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 08:50:27 GMT, Dingli Zhang wrote: >> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1731: >> >>> 1729: if (bt == T_FLOAT || bt == T_DOUBLE) { >>> 1730: switch (cond) { >>> 1731: case BoolTest::eq: vmfeq_vv(vd, src1, src2, vm); break; >> >> `BoolTest::ge` and `BoolTest::gt` are implemented with `BoolTest::le` and `BoolTest::lt` by exchanging the operands, when one of the operands is NAN, will the results of comparisons be wrong? > > Thanks for the review! > > I think there may be no problem here. The foating-point compare instructions follow the semantics of the scalar floating-point compare instructions[1] in RVV. For all three instructions (FEQ.S, FLT.S, FLE.S), the result is 0 if either operand is NaN[2]. So when one of the operands is NaN, `BoolTest::ge`, `BoolTest::gt`, `BoolTest::le` and `BoolTest::lt` will all generate a 0 on the corresponding bit. > > Also a jtreg test case[3] proves that our current logic is fine. `GTFloat512VectorTests` covers the case where the input is Nan. The test will pass properly and generate the following compilation log which contains `vmaskcmp_rvv`: > > > 1ac B20: # out( B49 B21 ) <- in( B48 B19 ) Freq: 4188.06 > 1ac vmaskcmp_rvv V0, V4, V5, #3 > 1b8 > 1b8 MEMBAR-store-store #@membar_storestore > 1bc # checkcastPP of R11, #@checkCastPP > 1bc vstoremask V1, V0 > 1c8 addi R7, R11, #16 # ptr, #@addP_reg_imm > 1cc spill R11 -> [sp, #104] # spill size = 64 > 1ce storeV [R7], V1 # vector (rvv) > 1d6 ld R19, [R23, #264] # ptr, #@loadP > 1da ld R7, [R23, #280] # ptr, #@loadP > 1de addi R28, R19, #16 # ptr, #@addP_reg_imm > 1e2 bgeu R28, R7, B49 #@cmpP_branch P=0.000100 C=-1.000000 > > > [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#1313-vector-floating-point-compare-instructions > [2] https://github.com/riscv/riscv-isa-manual/releases/download/draft-20230131-c0b298a/riscv-spec.pdf > [3] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector/Float512VectorTests.java Hi, @DingliZhang, I have the same concern here. As the scalar version of Float-point comparison, we have `is_unordered` flag to return the right result if operands contain NaN(s) [1]. Does vfcmp need this flag too? 1. https://github.com/openjdk/jdk/blob/7f56de8f78c0b54e5cf313f53213102a3495234f/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1010-L1037 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1168288129 From eliu at openjdk.org Mon Apr 17 08:50:52 2023 From: eliu at openjdk.org (Eric Liu) Date: Mon, 17 Apr 2023 08:50:52 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox Message-ID: This patch fixes C2 failure with SIGSEGV due to endless recursion. With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: ------------ / \ Region | VectorBox | \ | / | Phi | | | | | Region | VectorBox | \ | / | Phi | | | |------------/ | This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: ------------ / \ Region | Proj | \ | / | Phi | | | | | Region | Proj | \ | / | Phi | | | |------------/ | | Phi | / VectorBox which the circle case should be taken into consideration as well. [TEST] Full Jtreg passed without new failure. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 ------------- Commit messages: - 8304948: [vectorapi] C2 crashes when expanding VectorBox Changes: https://git.openjdk.org/jdk/pull/13489/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13489&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304948 Stats: 119 lines in 3 files changed: 113 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/13489.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13489/head:pull/13489 PR: https://git.openjdk.org/jdk/pull/13489 From duke at openjdk.org Mon Apr 17 08:56:42 2023 From: duke at openjdk.org (Afshin Zafari) Date: Mon, 17 Apr 2023 08:56:42 GMT Subject: RFR: 8305080: Remove the 'removal' warning for finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests [v2] In-Reply-To: References: Message-ID: > The finalize() methods are removed and replaced by Cleaner callbacks. > > Note: > `test/hotspot/jtreg/compiler/jvmci/compilerToVM/HasFinalizableSubclassTest.java` may be removed since there is no need to test if finalize() exists in the subclasses or not.. Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: Remove the 'removal' warning for finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13419/files - new: https://git.openjdk.org/jdk/pull/13419/files/386af9f8..fc614316 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13419&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13419&range=00-01 Stats: 46 lines in 10 files changed: 28 ins; 3 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/13419.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13419/head:pull/13419 PR: https://git.openjdk.org/jdk/pull/13419 From dnsimon at openjdk.org Mon Apr 17 09:07:01 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 17 Apr 2023 09:07:01 GMT Subject: RFR: 8305080: Remove the 'removal' warning for finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests [v2] In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 08:56:42 GMT, Afshin Zafari wrote: >> The finalize() methods are removed and replaced by Cleaner callbacks. >> >> Note: >> `test/hotspot/jtreg/compiler/jvmci/compilerToVM/HasFinalizableSubclassTest.java` may be removed since there is no need to test if finalize() exists in the subclasses or not.. > > Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: > > Remove the 'removal' warning for finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests Marked as reviewed by dnsimon (Committer). Looks good. Only final comment is that I would change the title of this issue from "Remove the..." to "Suppress the...". ------------- PR Review: https://git.openjdk.org/jdk/pull/13419#pullrequestreview-1387596214 PR Comment: https://git.openjdk.org/jdk/pull/13419#issuecomment-1510971144 From jsjolen at openjdk.org Mon Apr 17 09:34:16 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Mon, 17 Apr 2023 09:34:16 GMT Subject: RFR: JDK-8306077: Replace NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when applicable in opto Message-ID: Hi, this is a small cleanup that switches out NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when the allocation is done on a ResourceArea. Please consider, thank you. ------------- Commit messages: - Style - Replace NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when applicable Changes: https://git.openjdk.org/jdk/pull/13490/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13490&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306077 Stats: 11 lines in 1 file changed: 0 ins; 2 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/13490.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13490/head:pull/13490 PR: https://git.openjdk.org/jdk/pull/13490 From epeter at openjdk.org Mon Apr 17 10:35:42 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Apr 2023 10:35:42 GMT Subject: RFR: 8305740: C2: add print statements to assert: Can't determine return type. In-Reply-To: <7w453XpFls8FMawBknRZOjVtfySrYr5wmeDJl5jUMZM=.8d7eeeaf-2c00-483d-9f60-1f65d7f02c21@github.com> References: <7w453XpFls8FMawBknRZOjVtfySrYr5wmeDJl5jUMZM=.8d7eeeaf-2c00-483d-9f60-1f65d7f02c21@github.com> Message-ID: On Mon, 10 Apr 2023 18:45:19 GMT, Vladimir Kozlov wrote: >> I added this assert before the bailout, because it probably hides bugs. Now we have failure reports with [JDK-8305185](https://bugs.openjdk.org/browse/JDK-8305185). >> >> It is difficult to reproduce, so I'd like to add some print statements to get at least a bit of info. >> >> Passed tests up to tier5 and stress testing. > > Marked as reviewed by kvn (Reviewer). Thanks for the reviews @vnkozlov @TobiHartmann ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13385#issuecomment-1511095166 From epeter at openjdk.org Mon Apr 17 10:35:42 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Apr 2023 10:35:42 GMT Subject: Integrated: 8305740: C2: add print statements to assert: Can't determine return type. In-Reply-To: References: Message-ID: On Fri, 7 Apr 2023 09:08:55 GMT, Emanuel Peter wrote: > I added this assert before the bailout, because it probably hides bugs. Now we have failure reports with [JDK-8305185](https://bugs.openjdk.org/browse/JDK-8305185). > > It is difficult to reproduce, so I'd like to add some print statements to get at least a bit of info. > > Passed tests up to tier5 and stress testing. This pull request has now been integrated. Changeset: c0b4957f Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/c0b4957fcce530290fe3b1e730b593b6458285aa Stats: 9 lines in 1 file changed: 9 ins; 0 del; 0 mod 8305740: C2: add print statements to assert: Can't determine return type. Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13385 From rrich at openjdk.org Mon Apr 17 10:36:38 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 17 Apr 2023 10:36:38 GMT Subject: RFR: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 [v4] In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 07:25:36 GMT, Richard Reingruber wrote: >> This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. >> With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. >> >> Before: >> >> * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` >> * jit_abi is independent of `abi_minframe` >> * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) >> >> After changes: >> >> * prefixed structs that depend on `ABI_ELFv2` with `native_` >> * introduced `java_abi` which is independent of `ABI_ELFv2` >> * `frame::metadata_words` is the size in words of `java_abi` >> >> This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` >> >> Testing: >> >> PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. >> PPC64be Linux: hotspot tier1 tests > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Copyright years Another round of testing succeeded (tier 1 - 4). On AIX `serviceability/jvmti/vthread/VThreadNotifyFramePopTest/VThreadNotifyFramePopTest.java` fails because the PollsetPoller is not fully implemented there. @backwaterred is working on it. Thanks for the reviews. I'm planning to integrate the pr tomorrow. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13372#issuecomment-1511098385 From gli at openjdk.org Mon Apr 17 11:23:34 2023 From: gli at openjdk.org (Guoxiong Li) Date: Mon, 17 Apr 2023 11:23:34 GMT Subject: RFR: 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix In-Reply-To: References: Message-ID: <3alDjtzx-7Y5JS71Y8qoBkKehT8SiUNUNx5y_zSPVXg=.803e4d92-f3e3-490a-b2a0-2169bd71df48@github.com> On Thu, 6 Apr 2023 05:36:06 GMT, Guoxiong Li wrote: > Hi all, > > This patch prevents `Assembler::prefix` from emitting two `REX prefixes`. The current code in mainline works well because the corresponding code path is not triggered. > > Thanks for the review. > > Best Regards, > -- Guoxiong Ping for review. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13369#issuecomment-1511159217 From roland at openjdk.org Mon Apr 17 11:35:17 2023 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 17 Apr 2023 11:35:17 GMT Subject: RFR: 8305781: compiler/c2/irTests/TestVectorizationMultiInvar.java failed with "IRViolationException: There were one or multiple IR rule failures." Message-ID: The test case only works if unaligned accesses are allowed (that is AlignVector false). I added a runtime check similar to what I did with TestVectorizationMismatchedAccess. ------------- Commit messages: - test fix Changes: https://git.openjdk.org/jdk/pull/13492/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13492&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8305781 Stats: 9 lines in 1 file changed: 7 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13492.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13492/head:pull/13492 PR: https://git.openjdk.org/jdk/pull/13492 From thartmann at openjdk.org Mon Apr 17 11:43:30 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 17 Apr 2023 11:43:30 GMT Subject: RFR: 8305781: compiler/c2/irTests/TestVectorizationMultiInvar.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 11:26:47 GMT, Roland Westrelin wrote: > The test case only works if unaligned accesses are allowed (that is > AlignVector false). I added a runtime check similar to what I did with > TestVectorizationMismatchedAccess. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13492#pullrequestreview-1387847533 From eosterlund at openjdk.org Mon Apr 17 12:16:43 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 17 Apr 2023 12:16:43 GMT Subject: Integrated: 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers In-Reply-To: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> References: <4RyYObqmSswKoIdIw59VFyr6m47nN0rsiPu4ebsrlC0=.5ab0d5e2-aa03-4bf0-9b77-237a74d16670@github.com> Message-ID: On Tue, 4 Apr 2023 13:02:05 GMT, Erik ?sterlund wrote: > The arraycopy stubs on AArch64 now allows the GC to vectorize arraycopy barriers. That's great! But the gct3 registers we hand to the GC is v8 today, which is callee saved (well at least the lower 64 bits). Therefore, if the GC clobbers this temp registers, it can have unexpected side effects on the caller float/double registers. We should use a caller saved register instead. > This is only used by generational ZGC, so isn't a mainline bug yet. We should fix it before it becomes one. This pull request has now been integrated. Changeset: 2240c7ec Author: Erik ?sterlund URL: https://git.openjdk.org/jdk/commit/2240c7ec2fd87a4fd5670f88b9e7dcb3758294c6 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8305543: Ensure GC barriers for arraycopy on AArch64 use caller saved neon temp registers Reviewed-by: rcastanedalo, aph ------------- PR: https://git.openjdk.org/jdk/pull/13325 From eosterlund at openjdk.org Mon Apr 17 12:17:44 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 17 Apr 2023 12:17:44 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: <6mhkx0K3fKWsW_jYNIzmWq31nKAe4TBAJtiClOJdgdU=.92f97ad1-9657-444d-9514-450ef2b167c0@github.com> On Thu, 13 Apr 2023 09:49:34 GMT, Martin Doerr wrote: > Thanks for fixing it! I guess we should also backport the fix for Shenandoah. Thanks for the review! Yes I think Shenandoah would benefit from having this backported. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13324#issuecomment-1511226536 From eosterlund at openjdk.org Mon Apr 17 12:17:46 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 17 Apr 2023 12:17:46 GMT Subject: Integrated: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 12:40:14 GMT, Erik ?sterlund wrote: > The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. This pull request has now been integrated. Changeset: 02347d0c Author: Erik ?sterlund URL: https://git.openjdk.org/jdk/commit/02347d0cec77212d38aad8d06b6ac0c316be00d7 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod 8305351: C2 setScopedValueCache intrinsic doesn't use access API Reviewed-by: kvn, rcastanedalo, aph, mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/13324 From thartmann at openjdk.org Mon Apr 17 12:22:39 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 17 Apr 2023 12:22:39 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 [v4] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Fri, 14 Apr 2023 20:51:35 GMT, Kirill A. Korinsky wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: > > Fix the copyright year Looks good to me too. Thanks for fixing this! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13453#pullrequestreview-1387910597 From thartmann at openjdk.org Mon Apr 17 12:24:38 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 17 Apr 2023 12:24:38 GMT Subject: RFR: JDK-8306077: Replace NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when applicable in opto In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 09:27:51 GMT, Johan Sj?len wrote: > Hi, this is a small cleanup that switches out NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when the allocation is done on a ResourceArea. > > Please consider, thank you. Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13490#pullrequestreview-1387912834 From duke at openjdk.org Mon Apr 17 12:25:56 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Mon, 17 Apr 2023 12:25:56 GMT Subject: Integrated: 8305995: Footprint regression from JDK-8224957 In-Reply-To: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Thu, 13 Apr 2023 01:02:08 GMT, Kirill A. Korinsky wrote: > This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. > > This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. > > The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. > > When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. > > This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. > > After that I've done a `git bisect` which allows me to locate the introducer. > > So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. This pull request has now been integrated. Changeset: 75515298 Author: Kirill A. Korinsky Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/7551529854b325488b58481e11103b08a211aff4 Stats: 1237 lines in 2 files changed: 1236 ins; 0 del; 1 mod 8305995: Footprint regression from JDK-8224957 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13453 From duke at openjdk.org Mon Apr 17 12:29:47 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Mon, 17 Apr 2023 12:29:47 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 In-Reply-To: <8gRD-UO-2z5X8b3nRRJtkWyRaRc4Ld9DAWMIdPs-Vsg=.738455bd-0d36-4481-8545-413d9de783ed@github.com> References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> <8gRD-UO-2z5X8b3nRRJtkWyRaRc4Ld9DAWMIdPs-Vsg=.738455bd-0d36-4481-8545-413d9de783ed@github.com> Message-ID: On Fri, 14 Apr 2023 09:23:55 GMT, Tobias Hartmann wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Thanks for reporting this, Kirill. I filed [JDK-8305995](https://bugs.openjdk.org/browse/JDK-8305995) for tracking. It would be great if you could add your benchmark (to `test/micro/org/openjdk/bench/`). I'll have a look at your proposed fix next week. @TobiHartmann / @vnkozlov may it also be backported to jdk17? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1511245144 From thartmann at openjdk.org Mon Apr 17 12:36:50 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 17 Apr 2023 12:36:50 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 [v4] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Fri, 14 Apr 2023 20:51:35 GMT, Kirill A. Korinsky wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: > > Fix the copyright year We (Oracle) will backport it to Oracle JDK 17u after some bake time in mainline. It's then up to the OpenJDK community to backport it to OpenJDK 17u as well but I think it's likely that they will do so. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1511255355 From duke at openjdk.org Mon Apr 17 12:53:48 2023 From: duke at openjdk.org (Kirill A. Korinsky) Date: Mon, 17 Apr 2023 12:53:48 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 [v4] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Mon, 17 Apr 2023 12:34:10 GMT, Tobias Hartmann wrote: >> Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix the copyright year > > We (Oracle) will backport it to Oracle JDK 17u after some bake time in mainline. It's then up to the OpenJDK community to backport it to OpenJDK 17u as well but I think it's likely that they will do so. @TobiHartmann thanks, it's clear. But can I expect that this fix will be the next release of JDK17? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1511277849 From thartmann at openjdk.org Mon Apr 17 13:10:56 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 17 Apr 2023 13:10:56 GMT Subject: RFR: 8305995: Footprint regression from JDK-8224957 [v4] In-Reply-To: References: <1fT7mBT0BgH-MlwqeeNtA3biaCc3tt6URxnYrWnMEkE=.e3b393b8-ff0e-4c4d-9046-cec013d7b978@github.com> Message-ID: On Fri, 14 Apr 2023 20:51:35 GMT, Kirill A. Korinsky wrote: >> This is a fix for the regression introduced by da43cb5e463069cf4dafb262664f0d3d7c2e0eac in fix 8224957. >> >> This regression was found while attempting to migrate an application from JDK 1.8 to JDK 17, by running internal benchmarks, and while investigating abnormal memory usage for about 4 times more from one of them. >> >> The regression appears in the JMH benchmark, which builds a huge tree which contains boxed integers from 0 to a few thousand. A tree has very complex structure and the same objects are reused a lot. >> >> When an `integer` is found it's collected as `Integer` and unboxed inside the collector callback. >> >> This benchmark was run with `ParallelGC` on different JVMs: `JDK 1.8.0_362`, `JDK 11.0.18`, `JDK 13.0.13`, `JDK 15.0.9`, `JDK 17.0.6`, `JDK 19.0.2` and `JDK 20`. This allows to see that something has changed between 13 and 15, and that the memory footprint for this code has increased from `3152` to `11828` bytes per operation. >> >> After that I've done a `git bisect` which allows me to locate the introducer. >> >> So the current fix reduces the memory footprint on the local root 425ef0685c584abec80454fbcccdcc6db6558f93 to `2960` bytes per operation. > > Kirill A. Korinsky has updated the pull request incrementally with one additional commit since the last revision: > > Fix the copyright year This will most likely go into JDK 17.0.8 but it depends on if and when the fix is backported. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13453#issuecomment-1511308911 From fjiang at openjdk.org Mon Apr 17 13:18:42 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 17 Apr 2023 13:18:42 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v18] In-Reply-To: References: Message-ID: <3Hjv_E07Y6V-yGy0C2SGVBqCprxiXil1IhfkR_fM75M=.0a780b41-98cc-41c7-be08-1981075b8c39@github.com> On Fri, 14 Apr 2023 15:34:29 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Simplify some arithmetic mask nodes src/hotspot/cpu/riscv/riscv.ad line 1937: > 1935: case Op_OrVMask: > 1936: case Op_LoadVector: > 1937: opcode = Op_LoadVectorMasked; There is no `break` before `Op_LoadVector`, then all `opcode` matched before `Op_LoadVector` will be assigned to `Op_LoadVectorMasked`. Is this as expected? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1168676688 From vkempik at openjdk.org Mon Apr 17 13:53:52 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Mon, 17 Apr 2023 13:53:52 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v3] In-Reply-To: <6xESHmK3740UCxCW9YpqxH8qg5mwR6GnBqyK8s5baAA=.784e556f-918e-4cf4-a92a-5159083d928c@github.com> References: <6xESHmK3740UCxCW9YpqxH8qg5mwR6GnBqyK8s5baAA=.784e556f-918e-4cf4-a92a-5159083d928c@github.com> Message-ID: On Thu, 30 Mar 2023 16:29:34 GMT, Quan Anh Mai wrote: > It would probably be more efficient if you just use `memcpy` and let the compiler figure out the best method to do memory accesses. Yeah, I think it would be better to just use memcpy in emit_intX and to not modify any put_native_uY code at all ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1511393158 From dzhang at openjdk.org Mon Apr 17 13:56:40 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 17 Apr 2023 13:56:40 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: References: Message-ID: <8TwWTG4t0LW36zLNaJx_oIb8qhhg4_O63mGC7IVn8ZM=.318f53c2-c4ec-4e76-acec-a1d8db53a0ce@github.com> On Mon, 17 Apr 2023 07:27:28 GMT, Feilong Jiang wrote: >> Thanks for the review! >> >> I think there may be no problem here. The foating-point compare instructions follow the semantics of the scalar floating-point compare instructions[1] in RVV. For all three instructions (FEQ.S, FLT.S, FLE.S), the result is 0 if either operand is NaN[2]. So when one of the operands is NaN, `BoolTest::ge`, `BoolTest::gt`, `BoolTest::le` and `BoolTest::lt` will all generate a 0 on the corresponding bit. >> >> Also a jtreg test case[3] proves that our current logic is fine. `GTFloat512VectorTests` covers the case where the input is Nan. The test will pass properly and generate the following compilation log which contains `vmaskcmp_rvv`: >> >> >> 1ac B20: # out( B49 B21 ) <- in( B48 B19 ) Freq: 4188.06 >> 1ac vmaskcmp_rvv V0, V4, V5, #3 >> 1b8 >> 1b8 MEMBAR-store-store #@membar_storestore >> 1bc # checkcastPP of R11, #@checkCastPP >> 1bc vstoremask V1, V0 >> 1c8 addi R7, R11, #16 # ptr, #@addP_reg_imm >> 1cc spill R11 -> [sp, #104] # spill size = 64 >> 1ce storeV [R7], V1 # vector (rvv) >> 1d6 ld R19, [R23, #264] # ptr, #@loadP >> 1da ld R7, [R23, #280] # ptr, #@loadP >> 1de addi R28, R19, #16 # ptr, #@addP_reg_imm >> 1e2 bgeu R28, R7, B49 #@cmpP_branch P=0.000100 C=-1.000000 >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#1313-vector-floating-point-compare-instructions >> [2] https://github.com/riscv/riscv-isa-manual/releases/download/draft-20230131-c0b298a/riscv-spec.pdf >> [3] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector/Float512VectorTests.java > > Hi, @DingliZhang, I have the same concern here. As the scalar version of Float-point comparison, we have `is_unordered` flag to return the right result if operands contain NaN(s) [1]. Does vfcmp need this flag too? > > Also, I see an example of implementing `isgreater()` when operands contain NaN(s) for vfcmp [2], which checks NaN for both operands. > > 1. https://github.com/openjdk/jdk/blob/7f56de8f78c0b54e5cf313f53213102a3495234f/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1010-L1037 > 2. https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L3700-L3706 Hi @feilongjiang , thanks for the view! I think maybe we don't need to distinguish whether it is unordered or not here at the moment. For example, the greater than or less than comparison in sve_comapre in the vmaskcmp call in aarch64 uses the GE/GT[1][2] in the Assembler condition to make the determination (instead of using the LE/LT flag bits, which would include unordered as well). The most obvious difference in the results of isgreater() and the logic in the patch is that `isgreater()`[3] does not set the invalid operation exception flag (NV=1) when input is Nan. It does seem that we would be better off using the logic in [3] for the case when input is Nan. We will update PR later. [1] https://developer.arm.com/documentation/ddi0596/2021-12/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPCompareGE.3 [2] https://developer.arm.com/documentation/ddi0596/2021-12/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPCompareGT.3 [3] https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L3700-L3706 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1168736383 From dzhang at openjdk.org Mon Apr 17 14:12:34 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 17 Apr 2023 14:12:34 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v19] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Fix match_rule_supported_vector_masked ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/a7f66796..af237dae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=18 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=17-18 Stats: 12 lines in 1 file changed: 7 ins; 4 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Mon Apr 17 14:12:39 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 17 Apr 2023 14:12:39 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v18] In-Reply-To: <3Hjv_E07Y6V-yGy0C2SGVBqCprxiXil1IhfkR_fM75M=.0a780b41-98cc-41c7-be08-1981075b8c39@github.com> References: <3Hjv_E07Y6V-yGy0C2SGVBqCprxiXil1IhfkR_fM75M=.0a780b41-98cc-41c7-be08-1981075b8c39@github.com> Message-ID: On Mon, 17 Apr 2023 13:15:03 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Simplify some arithmetic mask nodes > > src/hotspot/cpu/riscv/riscv.ad line 1937: > >> 1935: case Op_OrVMask: >> 1936: case Op_LoadVector: >> 1937: opcode = Op_LoadVectorMasked; > > There is no `break` before `Op_LoadVector`, then all `opcode` matched before `Op_LoadVector` will be assigned to `Op_LoadVectorMasked`. > Is this as expected? Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1168757392 From vkempik at openjdk.org Mon Apr 17 14:19:36 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Mon, 17 Apr 2023 14:19:36 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v4] In-Reply-To: References: Message-ID: > Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. > > Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: Rework the fix to use memcpy in codeBuffer.hpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13227/files - new: https://git.openjdk.org/jdk/pull/13227/files/ffa4edd3..c014a806 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=02-03 Stats: 38 lines in 5 files changed: 7 ins; 11 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/13227.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13227/head:pull/13227 PR: https://git.openjdk.org/jdk/pull/13227 From epeter at openjdk.org Mon Apr 17 14:24:41 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Apr 2023 14:24:41 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: Message-ID: <9druojszHMZKJqtonknAR-ykDUZwTqkAgpWx6TI0_zA=.011ac9ae-e41e-4e5b-8a4f-f9567eef3ce5@github.com> On Sat, 15 Apr 2023 06:35:02 GMT, Jasmine Karthikeyan wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Review suggestion by Tobias Hartmann >> >> Co-authored-by: Tobias Hartmann > > I only had a handful of cases so I've attached them in [this gist](https://gist.github.com/jaskarth/878648eecaf74e168d5499c5961b4715). > I found these a few weeks ago, but looking back I think I have misremembered the problem chain. Examples 3-5 may be a different bug as they deal with `ConvL2I->ConvI2L` chains instead of `ConvI2L->ConvL2I` chains as you are seeing, as the latter has an Identity() transform defined while it seems the former does not- I apologize for the noise if the issues are unrelated. Examples 1 and 2 could perhaps still be useful in diagnosing the issue, as they describe cases where ideal transforms that do exist aren't taken. I tried looking into that bug myself a while ago but didn't get far. > > JDK-8298951 is exciting, it'll make reasoning about middle-end optimizations a lot easier :) @jaskarth I think your issues are not related, though I can look at them again once I get back to IGVN verification. @vnkozlov I thought about it a bit more. With a simple example like `Test::test`, I get unrolling `2048`, so we unroll 10-ish times. I see accordingly many `ConvI2L, MaxL, ConvL2I` nodes. Now, I can collapse the `ConvL2I -> ConvI2L` parts (the types guarantee that we never leave the `int` range, so conversion never clips anything), so it is only a chain of `MaxL` nodes. The issue at that point: I now basically have a reduction of all the "limits" I have ever wanted to respect (including some range-check limits, and all the unroll limits: `limit, limit-1, limit-3, limit-7, limit-15, limit-31, ... limit-1023`. The problem is that most of them have quite large ranges that are overlapping, so we cannot fold them at that point. The example: `./java -Xcomp -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=printcompilation,Test::test -XX:+TraceLoopOpts -XX:+PrintIdeal Test.java` public class Test { static int START = 0; static int FINISH = 512; static int RANGE = 512; public static void main(String args[]) { byte[] data = new byte[RANGE]; test(data); } public static void test(byte[] data) { for (int j = START; j < FINISH; j++) { data[j] = (byte)(data[j] * 11); } } } What to do with this? - Performance testing did not show any difference. But maybe we do not trust that enough. - Before and now, the chain of unrolling-limits can be interrupted by range-check limits. We probably will just accept that this means that not all of the unrolling-limits can be folded together. **I have an alternative proposal:** Leave the `MaxL/MinL` node for the range-check limits, there are usually not that many RC-limits, and up to now we used a `CMove` node per such limit already anyway. But for the unroll-limits, we introduce a `SubINoUnderflow` node, which does a safe (no-underflow) subtraction `limit-stride`. These nodes can be folded together relatively easily. I already had such an implementation before, and reverted it https://github.com/openjdk/jdk/pull/13269/commits/f5fcf6084a2446876ba2a85907a2991ef4c705b7 I had already discussed this idea with @chhagedorn a while ago. But then decided against it once I also saw that I wanted a unified solution for RC-limits and unroll-limits. But now I'm questioning that decision again. With this `SubINoUnderflowNode` idea, we would have a constant number of nodes added per RC-limit. And then for all the unroll limit adjustments together, we would only have one `SubINoUnderflow` node, as they would all collapse into one. At macro expansion, I can then expand it into a single CMove node. @vnkozlov What do you think? Do you have any other ideas? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1511456666 From epeter at openjdk.org Mon Apr 17 14:41:34 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 17 Apr 2023 14:41:34 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v5] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Merge branch 'JDK-8303466' of https://github.com/eme64/jdk into JDK-8303466 - convert I2L(L2I(x)) => x, when allowed by types ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/ecdff09b..2f5eb056 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=03-04 Stats: 15 lines in 2 files changed: 15 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From mdoerr at openjdk.org Mon Apr 17 15:50:50 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 17 Apr 2023 15:50:50 GMT Subject: RFR: 8305351: C2 setScopedValueCache intrinsic doesn't use access API In-Reply-To: References: Message-ID: <8Doe9eQSiFXfKFZnXXl8JnCKFTS7iMto-LeyGYnHo4E=.fac7c5e7-9a04-4764-84bb-ee9589113dc0@github.com> On Tue, 4 Apr 2023 12:40:14 GMT, Erik ?sterlund wrote: > The setScopedValueCache intrinsic for C2 doesn't use the access API. Instead, we store into an OopHandle with a raw store. That doesn't necessarily play well with all GCs, for example Shenandoah and generational ZGC. We should use the access API to ensure the right barriers are emitted. This fix is for JEP 429 ([JDK-8286666](https://bugs.openjdk.org/browse/JDK-8286666)), so 17u is not affected. Only backport to JDK 20 may be desirable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13324#issuecomment-1511629178 From cslucas at openjdk.org Mon Apr 17 16:17:30 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 17 Apr 2023 16:17:30 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v9] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Fix tests. Remember previous reducible Phis. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12897/files - new: https://git.openjdk.org/jdk/pull/12897/files/a10b0a4c..aec1b07a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=07-08 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From never at openjdk.org Mon Apr 17 16:49:44 2023 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 17 Apr 2023 16:49:44 GMT Subject: Integrated: 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent In-Reply-To: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> References: <0B1MuPO_KSgWyfCDYo3vsX4KTXV3o0x2NENQDTBzgWI=.f638f9e0-0168-4be7-b047-2e3f08a3864f@github.com> Message-ID: On Fri, 7 Apr 2023 17:30:39 GMT, Tom Rodriguez wrote: > Add missing GC barrier for reflective read. I'm not sure the idiom I've chosen it the correct one so please correct me if there's a better way to write this. In testing, this resolved the issue. This pull request has now been integrated. Changeset: 497f9e76 Author: Tom Rodriguez URL: https://git.openjdk.org/jdk/commit/497f9e760da6342c611a2f542090c5cf4428b9fd Stats: 12 lines in 2 files changed: 4 ins; 0 del; 8 mod 8305755: [JVMCI] missing barriers in CompilerToVM.readFieldValue for Reference.referent Reviewed-by: eosterlund, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/13389 From kvn at openjdk.org Mon Apr 17 19:33:08 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Apr 2023 19:33:08 GMT Subject: RFR: 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 05:36:06 GMT, Guoxiong Li wrote: > Hi all, > > This patch prevents `Assembler::prefix` from emitting two `REX prefixes`. The current code in mainline works well because the corresponding code path is not triggered. > > Thanks for the review. > > Best Regards, > -- Guoxiong Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13369#pullrequestreview-1388823706 From kvn at openjdk.org Mon Apr 17 19:35:23 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Apr 2023 19:35:23 GMT Subject: RFR: 8305781: compiler/c2/irTests/TestVectorizationMultiInvar.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 11:26:47 GMT, Roland Westrelin wrote: > The test case only works if unaligned accesses are allowed (that is > AlignVector false). I added a runtime check similar to what I did with > TestVectorizationMismatchedAccess. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13492#pullrequestreview-1388826246 From kvn at openjdk.org Mon Apr 17 19:58:58 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 17 Apr 2023 19:58:58 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: <9druojszHMZKJqtonknAR-ykDUZwTqkAgpWx6TI0_zA=.011ac9ae-e41e-4e5b-8a4f-f9567eef3ce5@github.com> References: <9druojszHMZKJqtonknAR-ykDUZwTqkAgpWx6TI0_zA=.011ac9ae-e41e-4e5b-8a4f-f9567eef3ce5@github.com> Message-ID: On Mon, 17 Apr 2023 14:21:51 GMT, Emanuel Peter wrote: > But I think I can do the same with just collapsing `SubL -> MaxL -> SubL -> MaxL` to `SubL -> MaxL`. That may be cleaner. I prefer this if you can do it. So you have sequence (after folding `Conv` nodes) MaxL(SubL(MaxL(SubL(limit, stride), min_int), stride*2), min_int); Yes, I think it can be collapsed to: MaxL(SubL(limit, stride*3), min_int); If in any point of chain `limit` become `min_int` it will stay `min_int` (even if `stride` is `max_int`) because you use Long arithmetic and we have "small" limit on unrolling (16?). If it does not hit min_int the result it similar to SubL(SubL((limit, stride), stride*2). So you just need to correctly collect `stride*N` values. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1511996427 From cslucas at openjdk.org Mon Apr 17 22:11:46 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 17 Apr 2023 22:11:46 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v8] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Sat, 15 Apr 2023 00:17:55 GMT, Vladimir Kozlov wrote: >> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: >> >> Address PR review 3. Some comments and be able to abort compilation. > > New test failed in tier1 on all platforms. Here is list: > > 1) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.TestTrapAfterMerge(boolean,int,int)" - [Failed IR rules: 1]: > 2) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testCondLoadAfterMerge(boolean,boolean,int,int)" - [Failed IR rules: 1]: > 3) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testLoadInCondAfterMerge(boolean,int,int)" - [Failed IR rules: 1]: > 4) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testMergesAndMixedEscape(boolean,int,int)" - [Failed IR rules: 1]: > 5) Method "compiler.c2.irTests.scalarReplacement.AllocationMergesTests$Point[] compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testNestedObjectsArray(boolean,int,int)" - [Failed IR rules: 1]: > 6) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testNestedObjectsNoEscapeObject(boolean,int,int)" - [Failed IR rules: 1]: > 7) Method "compiler.c2.irTests.scalarReplacement.AllocationMergesTests$Point compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testNestedObjectsObject(boolean,int,int)" - [Failed IR rules: 1]: @vnkozlov - sorry about that. I fixed the code now and all GHA tests are passing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1512148978 From fyang at openjdk.org Tue Apr 18 02:20:48 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 18 Apr 2023 02:20:48 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v4] In-Reply-To: References: Message-ID: <9HYcBfpVIcGW2BOpMOGcCOjReR_lOxnMo_N8u1NnQZQ=.eee4f1dd-b298-49a1-bed3-94b08f3c75d2@github.com> On Mon, 17 Apr 2023 14:19:36 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Rework the fix to use memcpy in codeBuffer.hpp src/hotspot/share/asm/codeBuffer.hpp line 262: > 260: void emit_float( jfloat x) { put_native(end(), jint_cast(x)); set_end(end() + sizeof(jfloat)); } > 261: void emit_double(jdouble x) { put_native(end(), julong_cast(x)); set_end(end() + sizeof(jdouble)); } > 262: void emit_address(address x) { put_native(end(), p2i(x)); set_end(end() + sizeof(address)); } You might want to remove the explicit casts of 'x'. And the JBS title might needs to be updated too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13227#discussion_r1169420154 From thartmann at openjdk.org Tue Apr 18 06:00:47 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 18 Apr 2023 06:00:47 GMT Subject: RFR: 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 05:36:06 GMT, Guoxiong Li wrote: > Hi all, > > This patch prevents `Assembler::prefix` from emitting two `REX prefixes`. The current code in mainline works well because the corresponding code path is not triggered. > > Thanks for the review. > > Best Regards, > -- Guoxiong Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13369#pullrequestreview-1389340091 From dzhang at openjdk.org Tue Apr 18 06:13:34 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 18 Apr 2023 06:13:34 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v20] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Handle unordered compares ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/af237dae..e9a707ea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=19 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=18-19 Stats: 102 lines in 3 files changed: 70 ins; 0 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Tue Apr 18 06:36:06 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 18 Apr 2023 06:36:06 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v21] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8302908-merge - Fix trailing whitespace - Handle unordered compares - Fix match_rule_supported_vector_masked - Simplify some arithmetic mask nodes - Add some pseudoinstruction and unify function name - Fix typo - Add loadstoremask support - Remove unneeded combination nodes - Fix typo and use match_rule_supported_vector instead of true - ... and 2 more: https://git.openjdk.org/jdk/compare/b5b736f6...a52686f4 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/e9a707ea..a52686f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=20 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=19-20 Stats: 259256 lines in 2244 files changed: 226626 ins; 19706 del; 12924 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Tue Apr 18 06:36:50 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 18 Apr 2023 06:36:50 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: <8TwWTG4t0LW36zLNaJx_oIb8qhhg4_O63mGC7IVn8ZM=.318f53c2-c4ec-4e76-acec-a1d8db53a0ce@github.com> References: <8TwWTG4t0LW36zLNaJx_oIb8qhhg4_O63mGC7IVn8ZM=.318f53c2-c4ec-4e76-acec-a1d8db53a0ce@github.com> Message-ID: <2SJdL1ESHCrOW0gGwOqditQBCvODlTWAWHXFBTzsmJw=.c9de8ef1-30e1-4b71-80fe-0e763f9e716e@github.com> On Mon, 17 Apr 2023 13:53:02 GMT, Dingli Zhang wrote: >> Hi, @DingliZhang, I have the same concern here. As the scalar version of Float-point comparison, we have `is_unordered` flag to return the right result if operands contain NaN(s) [1]. Does vfcmp need this flag too? >> >> Also, I see an example of implementing `isgreater()` when operands contain NaN(s) for vfcmp [2], which checks NaN for both operands. >> >> 1. https://github.com/openjdk/jdk/blob/7f56de8f78c0b54e5cf313f53213102a3495234f/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1010-L1037 >> 2. https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L3700-L3706 > > Hi @feilongjiang , thanks for the view! > > I think maybe we don't need to distinguish whether it is unordered or not here at the moment. For example, the aarch64 greater(than) or less(than) comparison of sve_comapre[1] in the vmaskcmp call uses the GE/GT[2][3] in the Assembler condition to make the determination (instead of using the LE/LT flag bits, which would include unordered as well). > > The most obvious difference in the results between the sequence of `isgreater()`[4] and the sequence of this patch is that `isgreater()` does not set the invalid operation exception flag (NV=1) if operands contain NaN(s). It seems that we would be better to use the sequence like `isgreater()` for the case if operands contain NaN(s). > > We will update PR later. > > [1] https://github.com/openjdk/jdk/blob/2b81faeb3514060e6c8c950ef4e39e299c43199d/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp#L1127-L1143 > [2] https://developer.arm.com/documentation/ddi0596/2021-12/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPCompareGE.3 > [3] https://developer.arm.com/documentation/ddi0596/2021-12/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPCompareGT.3 > [4] https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L3700-L3706 Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1169557323 From rrich at openjdk.org Tue Apr 18 07:03:00 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 18 Apr 2023 07:03:00 GMT Subject: Integrated: 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 13:22:49 GMT, Richard Reingruber wrote: > This PR makes parent interpreted Java frames independent of `ABI_ELFv2`. > With the changes `test/jdk/jdk/internal/vm/Continuation/BasicExt.java#COMP_ALL` succeeds on PPC64 Big Endian Linux. > > Before: > > * `parent_ijava_frame_abi` was derived from `abi_minframe` which depends on `ABI_ELFv2` > * jit_abi is independent of `abi_minframe` > * `frame::metadata_words` is wrong for `parent_ijava_frame_abi` if `ABI_ELFv2` is not defined (big endian) > > After changes: > > * prefixed structs that depend on `ABI_ELFv2` with `native_` > * introduced `java_abi` which is independent of `ABI_ELFv2` > * `frame::metadata_words` is the size in words of `java_abi` > > This is still a little imprecise since `top_ijava_frame_abi` is larger than `java_abi` but the top frame is never frozen as it is always `vmIntrinsics::_Continuation_doYield` > > Testing: > > PPC64le: most JCK and JTREG tiers 1-4, also in Xcomp mode. > PPC64be Linux: hotspot tier1 tests This pull request has now been integrated. Changeset: 445ebef4 Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/445ebef4371569b574af698138dccb159ce95602 Stats: 161 lines in 21 files changed: 13 ins; 13 del; 135 mod 8305668: PPC: Non-Top Interpreted frames should be independent of ABI_ELFv2 Reviewed-by: mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/13372 From yzhu at openjdk.org Tue Apr 18 07:05:50 2023 From: yzhu at openjdk.org (Yanhong Zhu) Date: Tue, 18 Apr 2023 07:05:50 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v19] In-Reply-To: References: Message-ID: <64ngA7LrbMqHMbMns6QT9pKQh4NLyeJJlgjbXyTmYJk=.c07b846c-0108-4460-a20b-80661709f67c@github.com> On Mon, 17 Apr 2023 14:12:34 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Fix match_rule_supported_vector_masked Marked as reviewed by yzhu (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1389420655 From yzhu at openjdk.org Tue Apr 18 07:05:52 2023 From: yzhu at openjdk.org (Yanhong Zhu) Date: Tue, 18 Apr 2023 07:05:52 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v12] In-Reply-To: <2SJdL1ESHCrOW0gGwOqditQBCvODlTWAWHXFBTzsmJw=.c9de8ef1-30e1-4b71-80fe-0e763f9e716e@github.com> References: <8TwWTG4t0LW36zLNaJx_oIb8qhhg4_O63mGC7IVn8ZM=.318f53c2-c4ec-4e76-acec-a1d8db53a0ce@github.com> <2SJdL1ESHCrOW0gGwOqditQBCvODlTWAWHXFBTzsmJw=.c9de8ef1-30e1-4b71-80fe-0e763f9e716e@github.com> Message-ID: On Tue, 18 Apr 2023 06:34:49 GMT, Dingli Zhang wrote: >> Hi @feilongjiang , thanks for the view! >> >> I think maybe we don't need to distinguish whether it is unordered or not here at the moment. For example, the aarch64 greater(than) or less(than) comparison of sve_comapre[1] in the vmaskcmp call uses the GE/GT[2][3] in the Assembler condition to make the determination (instead of using the LE/LT flag bits, which would include unordered as well). >> >> The most obvious difference in the results between the sequence of `isgreater()`[4] and the sequence of this patch is that `isgreater()` does not set the invalid operation exception flag (NV=1) if operands contain NaN(s). It seems that we would be better to use the sequence like `isgreater()` for the case if operands contain NaN(s). >> >> We will update PR later. >> >> [1] https://github.com/openjdk/jdk/blob/2b81faeb3514060e6c8c950ef4e39e299c43199d/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp#L1127-L1143 >> [2] https://developer.arm.com/documentation/ddi0596/2021-12/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPCompareGE.3 >> [3] https://developer.arm.com/documentation/ddi0596/2021-12/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPCompareGT.3 >> [4] https://github.com/riscv/riscv-v-spec/blob/b9afd6f5709fe3f91ce39bb83695bcfaa78eef94/v-spec.adoc?plain=1#L3700-L3706 > > Fixed. > Thanks for the review! > > I think there may be no problem here. The foating-point compare instructions follow the semantics of the scalar floating-point compare instructions[1] in RVV. For all three instructions (FEQ.S, FLT.S, FLE.S), the result is 0 if either operand is NaN[2]. So when one of the operands is NaN, `BoolTest::ge`, `BoolTest::gt`, `BoolTest::le` and `BoolTest::lt` will all generate a 0 on the corresponding bit. > > Also a jtreg test case[3] proves that our current logic is fine. `GTFloat512VectorTests` covers the case where the input is Nan. The test will pass properly and generate the following compilation log which contains `vmaskcmp_rvv`: > > ``` > 1ac B20: # out( B49 B21 ) <- in( B48 B19 ) Freq: 4188.06 > 1ac vmaskcmp_rvv V0, V4, V5, #3 > 1b8 > 1b8 MEMBAR-store-store #@membar_storestore > 1bc # checkcastPP of R11, #@checkCastPP > 1bc vstoremask V1, V0 > 1c8 addi R7, R11, #16 # ptr, #@addP_reg_imm > 1cc spill R11 -> [sp, #104] # spill size = 64 > 1ce storeV [R7], V1 # vector (rvv) > 1d6 ld R19, [R23, #264] # ptr, #@loadP > 1da ld R7, [R23, #280] # ptr, #@loadP > 1de addi R28, R19, #16 # ptr, #@addP_reg_imm > 1e2 bgeu R28, R7, B49 #@cmpP_branch P=0.000100 C=-1.000000 > ``` > > [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#1313-vector-floating-point-compare-instructions [2] https://github.com/riscv/riscv-isa-manual/releases/download/draft-20230131-c0b298a/riscv-spec.pdf [3] https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector/Float512VectorTests.java Thank you for your explanation. Looks good to me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1169583928 From thartmann at openjdk.org Tue Apr 18 07:06:51 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 18 Apr 2023 07:06:51 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE [v2] In-Reply-To: References: Message-ID: <5AhwzeWlzoPsRh_-UiA5g6_vqWQ3fO41BD8-eQRbHKI=.0c0a6b86-d547-46f9-85aa-5ec74a8867ea@github.com> On Mon, 17 Apr 2023 04:32:29 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruc tions that are generated, as that can help in cases where applications are I-Cache bound. >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) >> >> >> This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Re-work transform to happen in macro expansion I'm a bit confused, why do you need the new match rules if Conv2B nodes are now macro expanded to CMove? ------------- PR Review: https://git.openjdk.org/jdk/pull/13345#pullrequestreview-1389422395 From dzhang at openjdk.org Tue Apr 18 07:11:54 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 18 Apr 2023 07:11:54 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v22] In-Reply-To: References: Message-ID: <7cvHLdmmY_pzqr6YRqBQDqMU2oDV6BrsG_PnQ5g5UpE=.7942ff03-47f8-4a0e-93d4-40fd0f8f5274@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Fix build fail after JDK-8305008 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/a52686f4..f5974d48 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=20-21 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From thartmann at openjdk.org Tue Apr 18 07:52:47 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 18 Apr 2023 07:52:47 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox In-Reply-To: References: Message-ID: <1yri9IEJiIiarInYDIRxDllndOsk9qPHsN0ABQbFe2o=.5220987f-7f46-46fc-8a2b-414604ea2c7b@github.com> On Mon, 17 Apr 2023 08:43:28 GMT, Eric Liu wrote: > This patch fixes C2 failure with SIGSEGV due to endless recursion. > > With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: > > > ------------ > / \ > Region | VectorBox | > \ | / | > Phi | > | | > | | > Region | VectorBox | > \ | / | > Phi | > | | > |------------/ > | > > > > This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. > > After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: > > > ------------ > / \ > Region | Proj | > \ | / | > Phi | > | | > | | > Region | Proj | > \ | / | > Phi | > | | > |------------/ > | > | Phi > | / > VectorBox > > > which the circle case should be taken into consideration as well. > > [TEST] > Full Jtreg passed without new failure. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 Looks reasonable to me. src/hotspot/share/opto/vector.cpp line 307: > 305: void PhaseVector::expand_vbox_node(VectorBoxNode* vec_box) { > 306: if (vec_box->outcnt() > 0) { > 307: VectorSet visited; Do we need a ResourceMark here or above? src/hotspot/share/opto/vector.cpp line 327: > 325: // Phi (VectorBox VectorBox) => VectorBox (Phi Phi) > 326: if (visited.test_set(vbox->_idx)) { > 327: assert(vbox->is_Phi(), "not a phi"); Are we sure that the cycle is always detected at a phi? test/hotspot/jtreg/compiler/vectorapi/VectorBoxExpandTest.java line 83: > 81: // VectorBox > 82: // > 83: // which the circle case should be taken into consideration as well. Suggestion: // where the circle case should be taken into consideration as well. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13489#pullrequestreview-1389432514 PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1169593574 PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1169634345 PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1169591421 From gli at openjdk.org Tue Apr 18 08:01:41 2023 From: gli at openjdk.org (Guoxiong Li) Date: Tue, 18 Apr 2023 08:01:41 GMT Subject: RFR: 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 05:36:06 GMT, Guoxiong Li wrote: > Hi all, > > This patch prevents `Assembler::prefix` from emitting two `REX prefixes`. The current code in mainline works well because the corresponding code path is not triggered. > > Thanks for the review. > > Best Regards, > -- Guoxiong Thanks for the review. Integrating. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13369#issuecomment-1512631116 From gli at openjdk.org Tue Apr 18 08:04:56 2023 From: gli at openjdk.org (Guoxiong Li) Date: Tue, 18 Apr 2023 08:04:56 GMT Subject: Integrated: 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix In-Reply-To: References: Message-ID: On Thu, 6 Apr 2023 05:36:06 GMT, Guoxiong Li wrote: > Hi all, > > This patch prevents `Assembler::prefix` from emitting two `REX prefixes`. The current code in mainline works well because the corresponding code path is not triggered. > > Thanks for the review. > > Best Regards, > -- Guoxiong This pull request has now been integrated. Changeset: 49726ee3 Author: Guoxiong Li URL: https://git.openjdk.org/jdk/commit/49726ee3a95023a912aacad0e3714eae146eed21 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8305690: [X86] Do not emit two REX prefixes in Assembler::prefix Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13369 From qamai at openjdk.org Tue Apr 18 08:19:45 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 18 Apr 2023 08:19:45 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 08:43:28 GMT, Eric Liu wrote: > This patch fixes C2 failure with SIGSEGV due to endless recursion. > > With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: > > > ------------ > / \ > Region | VectorBox | > \ | / | > Phi | > | | > | | > Region | VectorBox | > \ | / | > Phi | > | | > |------------/ > | > > > > This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. > > After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: > > > ------------ > / \ > Region | Proj | > \ | / | > Phi | > | | > | | > Region | Proj | > \ | / | > Phi | > | | > |------------/ > | > | Phi > | / > VectorBox > > > which the circle case should be taken into consideration as well. > > [TEST] > Full Jtreg passed without new failure. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 Nit: It is a CYCLE, not a circle. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13489#issuecomment-1512658488 From roland at openjdk.org Tue Apr 18 08:57:56 2023 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Apr 2023 08:57:56 GMT Subject: Integrated: 8305781: compiler/c2/irTests/TestVectorizationMultiInvar.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 11:26:47 GMT, Roland Westrelin wrote: > The test case only works if unaligned accesses are allowed (that is > AlignVector false). I added a runtime check similar to what I did with > TestVectorizationMismatchedAccess. This pull request has now been integrated. Changeset: 8ecb5dfa Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/8ecb5dfa34ebd2ef7717994522fbb4bd7a14e0c9 Stats: 9 lines in 1 file changed: 7 ins; 0 del; 2 mod 8305781: compiler/c2/irTests/TestVectorizationMultiInvar.java failed with "IRViolationException: There were one or multiple IR rule failures." Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/13492 From roland at openjdk.org Tue Apr 18 08:57:54 2023 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 18 Apr 2023 08:57:54 GMT Subject: RFR: 8305781: compiler/c2/irTests/TestVectorizationMultiInvar.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 11:40:32 GMT, Tobias Hartmann wrote: >> The test case only works if unaligned accesses are allowed (that is >> AlignVector false). I added a runtime check similar to what I did with >> TestVectorizationMismatchedAccess. > > Looks good to me. @TobiHartmann @vnkozlov thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13492#issuecomment-1512710180 From jsjolen at openjdk.org Tue Apr 18 09:00:45 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Tue, 18 Apr 2023 09:00:45 GMT Subject: RFR: JDK-8306077: Replace NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when applicable in opto In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 09:27:51 GMT, Johan Sj?len wrote: > Hi, this is a small cleanup that switches out NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when the allocation is done on a ResourceArea. > > Please consider, thank you. Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13490#issuecomment-1512716708 From jsjolen at openjdk.org Tue Apr 18 09:03:53 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Tue, 18 Apr 2023 09:03:53 GMT Subject: Integrated: JDK-8306077: Replace NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when applicable in opto In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 09:27:51 GMT, Johan Sj?len wrote: > Hi, this is a small cleanup that switches out NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when the allocation is done on a ResourceArea. > > Please consider, thank you. This pull request has now been integrated. Changeset: 896207de Author: Johan Sj?len URL: https://git.openjdk.org/jdk/commit/896207de144380e58584838382e0ec32fb0f9d02 Stats: 11 lines in 1 file changed: 0 ins; 2 del; 9 mod 8306077: Replace NEW_ARENA_ARRAY with NEW_RESOURCE_ARRAY when applicable in opto Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13490 From aph at openjdk.org Tue Apr 18 09:41:46 2023 From: aph at openjdk.org (Andrew Haley) Date: Tue, 18 Apr 2023 09:41:46 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v4] In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 14:19:36 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Rework the fix to use memcpy in codeBuffer.hpp src/hotspot/share/asm/codeBuffer.hpp line 42: > 40: assert(p != nullptr, "null pointer"); > 41: > 42: memcpy((void*)p, &x, sizeof(T)); Suggestion: memcpy((void*)p, &x, sizeof x); src/hotspot/share/asm/codeBuffer.hpp line 228: > 226: } > 227: > 228: void emit_int16(uint16_t x) { put_native(end(), x); set_end(end() + sizeof(uint16_t)); } Suggestion: void emit_int16(uint16_t x) { put_native(end(), x); set_end(end() + sizeof x; } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13227#discussion_r1169763187 PR Review Comment: https://git.openjdk.org/jdk/pull/13227#discussion_r1169761777 From aph at openjdk.org Tue Apr 18 09:46:47 2023 From: aph at openjdk.org (Andrew Haley) Date: Tue, 18 Apr 2023 09:46:47 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v4] In-Reply-To: References: Message-ID: <1unE46SS_SaArMYRAPWJWjxPyB_acMvkuxr-69LNob0=.653291e3-bfcc-45d3-aa92-6950361feb52@github.com> On Mon, 17 Apr 2023 14:19:36 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Rework the fix to use memcpy in codeBuffer.hpp src/hotspot/share/asm/codeBuffer.hpp line 262: > 260: void emit_float( jfloat x) { put_native(end(), jint_cast(x)); set_end(end() + sizeof(jfloat)); } > 261: void emit_double(jdouble x) { put_native(end(), julong_cast(x)); set_end(end() + sizeof(jdouble)); } > 262: void emit_address(address x) { put_native(end(), p2i(x)); set_end(end() + sizeof(address)); } Suggestion: template void emit_native(T x) { put_native(end(), x); set_end(end() + sizeof x); } void emit_float( jfloat x) { emit_native(x); } void emit_double(jdouble x) { emit_native(x); } void emit_address(address x) { emit_native(x); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13227#discussion_r1169768579 From epeter at openjdk.org Tue Apr 18 10:02:00 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Apr 2023 10:02:00 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v6] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Collapse SubL->MaxL->SubL->MaxL pattern, test it ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/2f5eb056..34a69b0f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=04-05 Stats: 169 lines in 4 files changed: 169 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From epeter at openjdk.org Tue Apr 18 10:13:01 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 18 Apr 2023 10:13:01 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v7] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Fixed some TOP cases ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/34a69b0f..33e1ad54 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=05-06 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From rrich at openjdk.org Tue Apr 18 13:14:37 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 18 Apr 2023 13:14:37 GMT Subject: RFR: 8306111: PPC64: RT call after thaw with exception requires larger ABI section Message-ID: <3i8_NDyYCUM0wutPbFoOvmIuJpJN0o7fbgikLxQJzO8=.b1b3747f-f75c-404c-a579-b81e1861cdc9@github.com> After thawing a frame to forward an exception, we call `SharedRuntime::exception_handler_for_return_address()`. The frame just thawed lacks the required `frame::native_abi_reg_args` though. It is only equipped with `frame::java_abi`. This change pushes a new frame with the required ABI. Furthermore the change enables `VMContinuations` by default also for PPC64 big endian. Testing: jdk_loom on little and big endian. ------------- Commit messages: - Copyright year - Enable VMContinuations also on big endian PPC - Fix Changes: https://git.openjdk.org/jdk/pull/13505/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13505&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306111 Stats: 5 lines in 2 files changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13505.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13505/head:pull/13505 PR: https://git.openjdk.org/jdk/pull/13505 From mdoerr at openjdk.org Tue Apr 18 13:14:39 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 18 Apr 2023 13:14:39 GMT Subject: RFR: 8306111: PPC64: RT call after thaw with exception requires larger ABI section In-Reply-To: <3i8_NDyYCUM0wutPbFoOvmIuJpJN0o7fbgikLxQJzO8=.b1b3747f-f75c-404c-a579-b81e1861cdc9@github.com> References: <3i8_NDyYCUM0wutPbFoOvmIuJpJN0o7fbgikLxQJzO8=.b1b3747f-f75c-404c-a579-b81e1861cdc9@github.com> Message-ID: On Tue, 18 Apr 2023 07:54:29 GMT, Richard Reingruber wrote: > After thawing a frame to forward an exception, we call `SharedRuntime::exception_handler_for_return_address()`. The frame just thawed lacks the required `frame::native_abi_reg_args` though. It is only equipped with `frame::java_abi`. This change pushes a new frame with the required ABI. > > Furthermore the change enables `VMContinuations` by default also for PPC64 big endian. > > Testing: jdk_loom on little and big endian. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13505#pullrequestreview-1390076120 From eliu at openjdk.org Tue Apr 18 14:57:00 2023 From: eliu at openjdk.org (Eric Liu) Date: Tue, 18 Apr 2023 14:57:00 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox In-Reply-To: <1yri9IEJiIiarInYDIRxDllndOsk9qPHsN0ABQbFe2o=.5220987f-7f46-46fc-8a2b-414604ea2c7b@github.com> References: <1yri9IEJiIiarInYDIRxDllndOsk9qPHsN0ABQbFe2o=.5220987f-7f46-46fc-8a2b-414604ea2c7b@github.com> Message-ID: On Tue, 18 Apr 2023 07:12:25 GMT, Tobias Hartmann wrote: >> This patch fixes C2 failure with SIGSEGV due to endless recursion. >> >> With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: >> >> >> ------------ >> / \ >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> | | >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> >> >> >> This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. >> >> After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: >> >> >> ------------ >> / \ >> Region | Proj | >> \ | / | >> Phi | >> | | >> | | >> Region | Proj | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> | Phi >> | / >> VectorBox >> >> >> which the circle case should be taken into consideration as well. >> >> [TEST] >> Full Jtreg passed without new failure. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 > > src/hotspot/share/opto/vector.cpp line 307: > >> 305: void PhaseVector::expand_vbox_node(VectorBoxNode* vec_box) { >> 306: if (vec_box->outcnt() > 0) { >> 307: VectorSet visited; > > Do we need a ResourceMark here or above? expand_vbox_node only has a single one call chain, which start from Compile::Optimize(). I think we can trust the ResourceMark defined in there [1]. [1] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/compile.cpp#L2207 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1170162842 From eliu at openjdk.org Tue Apr 18 16:43:17 2023 From: eliu at openjdk.org (Eric Liu) Date: Tue, 18 Apr 2023 16:43:17 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox In-Reply-To: <1yri9IEJiIiarInYDIRxDllndOsk9qPHsN0ABQbFe2o=.5220987f-7f46-46fc-8a2b-414604ea2c7b@github.com> References: <1yri9IEJiIiarInYDIRxDllndOsk9qPHsN0ABQbFe2o=.5220987f-7f46-46fc-8a2b-414604ea2c7b@github.com> Message-ID: <4NUT5QmYN6dXvrNE5pqA4rdCkEfSXumQKjWOYZErCFY=.13a1e22c-d423-4a3d-9e6f-997cfd488a32@github.com> On Tue, 18 Apr 2023 07:49:56 GMT, Tobias Hartmann wrote: >> This patch fixes C2 failure with SIGSEGV due to endless recursion. >> >> With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: >> >> >> ------------ >> / \ >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> | | >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> >> >> >> This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. >> >> After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: >> >> >> ------------ >> / \ >> Region | Proj | >> \ | / | >> Phi | >> | | >> | | >> Region | Proj | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> | Phi >> | / >> VectorBox >> >> >> which the circle case should be taken into consideration as well. >> >> [TEST] >> Full Jtreg passed without new failure. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 > > src/hotspot/share/opto/vector.cpp line 327: > >> 325: // Phi (VectorBox VectorBox) => VectorBox (Phi Phi) >> 326: if (visited.test_set(vbox->_idx)) { >> 327: assert(vbox->is_Phi(), "not a phi"); > > Are we sure that the cycle is always detected at a phi? I think it is. The CYCLE is derived from merge_through_phi, in which only Phi CYCLE is allowed [1]. This method `expand_vbox_node_helper` serves to locate the target node `Proj(VectorBoxAllocate)` and then replace it with a punch of other nodes. When expanding VectorBox, the normal case is `VectorBox (Proj value)`. If it is in shape of `VectorBox (Phi1 Phi2)`[2] or `VectorBox (Phi1 vect)`[3], I suppose it should be transformed by merge_through_phi. In `expand_vbox_node_helper`, it is recursive only at Phi1 since it is where Proj in. [1] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/cfgnode.cpp#L2574 [2] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/vector.cpp#L331 [3] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/vector.cpp#L341 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1170307982 From rrich at openjdk.org Tue Apr 18 18:34:52 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 18 Apr 2023 18:34:52 GMT Subject: RFR: 8306111: PPC64: RT call after thaw with exception requires larger ABI section [v2] In-Reply-To: <3i8_NDyYCUM0wutPbFoOvmIuJpJN0o7fbgikLxQJzO8=.b1b3747f-f75c-404c-a579-b81e1861cdc9@github.com> References: <3i8_NDyYCUM0wutPbFoOvmIuJpJN0o7fbgikLxQJzO8=.b1b3747f-f75c-404c-a579-b81e1861cdc9@github.com> Message-ID: > After thawing a frame to forward an exception, we call `SharedRuntime::exception_handler_for_return_address()`. The frame just thawed lacks the required `frame::native_abi_reg_args` though. It is only equipped with `frame::java_abi`. This change pushes a new frame with the required ABI. > > Furthermore the change enables `VMContinuations` by default also for PPC64 big endian. > > Testing: jdk_loom on little and big endian. Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' to get rid of unrelated GHA failures - Copyright year - Enable VMContinuations also on big endian PPC - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13505/files - new: https://git.openjdk.org/jdk/pull/13505/files/aa35da7e..12e3ec71 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13505&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13505&range=00-01 Stats: 695 lines in 64 files changed: 19 ins; 128 del; 548 mod Patch: https://git.openjdk.org/jdk/pull/13505.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13505/head:pull/13505 PR: https://git.openjdk.org/jdk/pull/13505 From dzhang at openjdk.org Wed Apr 19 02:51:00 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 19 Apr 2023 02:51:00 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v23] In-Reply-To: References: Message-ID: <-tqGJTDwdjWYBNBNZxFAcosCq3_V_fWvGzwJc98uUZI=.013ff8de-380c-464d-8d09-405cfcac17d3@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Rename vmaskcmp_DF ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/f5974d48..b5f716dc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=22 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=21-22 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From fjiang at openjdk.org Wed Apr 19 02:51:02 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 19 Apr 2023 02:51:02 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v22] In-Reply-To: <7cvHLdmmY_pzqr6YRqBQDqMU2oDV6BrsG_PnQ5g5UpE=.7942ff03-47f8-4a0e-93d4-40fd0f8f5274@github.com> References: <7cvHLdmmY_pzqr6YRqBQDqMU2oDV6BrsG_PnQ5g5UpE=.7942ff03-47f8-4a0e-93d4-40fd0f8f5274@github.com> Message-ID: <4-igbfDm_1lEJAn0Pj-CN3KuXky5i6LZUEj3nCV1t0s=.ab9ade8b-db62-41b5-81d1-79e8dc259149@github.com> On Tue, 18 Apr 2023 07:11:54 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Fix build fail after JDK-8305008 The update looks good, thanks. src/hotspot/cpu/riscv/riscv_v.ad line 199: > 197: // vector mask float compare > 198: > 199: instruct vmaskcmp_DF(vRegMask dst, vReg src1, vReg src2, immI cond, vReg tmp1, vReg tmp2) %{ Would you please rename `vmaskcmp_DF` to `vmaskcmp_FD`? We got `minmax_FD_v` already. src/hotspot/cpu/riscv/riscv_v.ad line 216: > 214: %} > 215: > 216: instruct vmaskcmp_DF_masked(vRegMask dst, vReg src1, vReg src2, immI cond, vRegMask_V0 vmask, vReg tmp1, vReg tmp2, vReg tmp3) %{ ditto ------------- Marked as reviewed by fjiang (Author). PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1391182204 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1170748852 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1170750331 From dzhang at openjdk.org Wed Apr 19 02:51:04 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 19 Apr 2023 02:51:04 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v22] In-Reply-To: <4-igbfDm_1lEJAn0Pj-CN3KuXky5i6LZUEj3nCV1t0s=.ab9ade8b-db62-41b5-81d1-79e8dc259149@github.com> References: <7cvHLdmmY_pzqr6YRqBQDqMU2oDV6BrsG_PnQ5g5UpE=.7942ff03-47f8-4a0e-93d4-40fd0f8f5274@github.com> <4-igbfDm_1lEJAn0Pj-CN3KuXky5i6LZUEj3nCV1t0s=.ab9ade8b-db62-41b5-81d1-79e8dc259149@github.com> Message-ID: On Wed, 19 Apr 2023 02:37:35 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix build fail after JDK-8305008 > > src/hotspot/cpu/riscv/riscv_v.ad line 199: > >> 197: // vector mask float compare >> 198: >> 199: instruct vmaskcmp_DF(vRegMask dst, vReg src1, vReg src2, immI cond, vReg tmp1, vReg tmp2) %{ > > Would you please rename `vmaskcmp_DF` to `vmaskcmp_FD`? We got `minmax_FD_v` already. Thanks! Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 216: > >> 214: %} >> 215: >> 216: instruct vmaskcmp_DF_masked(vRegMask dst, vReg src1, vReg src2, immI cond, vRegMask_V0 vmask, vReg tmp1, vReg tmp2, vReg tmp3) %{ > > ditto Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1170752063 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1170752084 From jkarthikeyan at openjdk.org Wed Apr 19 04:30:39 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 19 Apr 2023 04:30:39 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE [v3] In-Reply-To: References: Message-ID: > Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruct ions that are generated, as that can help in cases where applications are I-Cache bound. > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) > > > This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Remove Conv2B from backend as it's macro expanded now ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13345/files - new: https://git.openjdk.org/jdk/pull/13345/files/ee468b9e..59a68a10 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=01-02 Stats: 648 lines in 11 files changed: 91 ins; 555 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13345.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13345/head:pull/13345 PR: https://git.openjdk.org/jdk/pull/13345 From jkarthikeyan at openjdk.org Wed Apr 19 04:30:42 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 19 Apr 2023 04:30:42 GMT Subject: RFR: 8051725: Questionable if-conversion involving SETNE [v2] In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 04:32:29 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the x86 lowering of `Conv2B` nodes, when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). The optimization here is using the `sete` instruction instead of always using `setne` and flipping the bit with xor afterwards. According to the Intel optimization guide (pages 3-26 and 3-27), this sequence is preferred over `cmp $0, %src` as it prevents the need to encode the constant in the assembly sequence. A similar rule exists in the PPC backend, here: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/ppc/ppc.ad#L10462. I've attached some performance testing but I think the real world improvements will be less significant- the motivation is primarily to decrease the amount of instruc tions that are generated, as that can help in cases where applications are I-Cache bound. >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 37.904 ? 1.856 ns/op + 22.6% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 37.352 ? 1.529 ns/op (unchanged) >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 46.916 ? 0.308 ns/op + 19.0% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 46.974 ? 0.218 ns/op (unchanged) >> >> >> This change also cleans up some code relating to `Assembler::set_byte_if_not_zero`, as that function duplicates behavior with `Assembler::setne`. The 32-bit only version of that method is never called as the only other usage is in the C1 LIR assembler, which is also guarded behind an 64-bit check so I opted to remove it entirely and replace usages with `Assembler::setne`. Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Re-work transform to happen in macro expansion Hi, that's my bad- I was working on removing the now-redundant Conv2B rules from the backends but I hadn't had a chance to push that yet, I've done so now. Since this has become a broader cleanup effort I'll also go ahead and rename the JBS issue to reflect the new approach with macro expansion as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1514113816 From rrich at openjdk.org Wed Apr 19 07:21:56 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 19 Apr 2023 07:21:56 GMT Subject: Integrated: 8306111: PPC64: RT call after thaw with exception requires larger ABI section In-Reply-To: <3i8_NDyYCUM0wutPbFoOvmIuJpJN0o7fbgikLxQJzO8=.b1b3747f-f75c-404c-a579-b81e1861cdc9@github.com> References: <3i8_NDyYCUM0wutPbFoOvmIuJpJN0o7fbgikLxQJzO8=.b1b3747f-f75c-404c-a579-b81e1861cdc9@github.com> Message-ID: On Tue, 18 Apr 2023 07:54:29 GMT, Richard Reingruber wrote: > After thawing a frame to forward an exception, we call `SharedRuntime::exception_handler_for_return_address()`. The frame just thawed lacks the required `frame::native_abi_reg_args` though. It is only equipped with `frame::java_abi`. This change pushes a new frame with the required ABI. > > Furthermore the change enables `VMContinuations` by default also for PPC64 big endian. > > Testing: jdk_loom on little and big endian. This pull request has now been integrated. Changeset: 42b7260e Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/42b7260e8be02de78d82c6a4601519b9895826e9 Stats: 5 lines in 2 files changed: 3 ins; 0 del; 2 mod 8306111: PPC64: RT call after thaw with exception requires larger ABI section Reviewed-by: mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/13505 From eliu at openjdk.org Wed Apr 19 08:11:54 2023 From: eliu at openjdk.org (Eric Liu) Date: Wed, 19 Apr 2023 08:11:54 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: > This patch fixes C2 failure with SIGSEGV due to endless recursion. > > With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: > > > ------------ > / \ > Region | VectorBox | > \ | / | > Phi | > | | > | | > Region | VectorBox | > \ | / | > Phi | > | | > |------------/ > | > > > > This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. > > After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: > > > ------------ > / \ > Region | Proj | > \ | / | > Phi | > | | > | | > Region | Proj | > \ | / | > Phi | > | | > |------------/ > | > | Phi > | / > VectorBox > > > which the circle case should be taken into consideration as well. > > [TEST] > Full Jtreg passed without new failure. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge jdk:master Change-Id: I63c06b87d5b0c20ddaec0aa43031872b8ebb5362 - fix typo Change-Id: I1b84c4957398178bf234f71242a1cdd044181a79 - 8304948: [vectorapi] C2 crashes when expanding VectorBox This patch fixes C2 failure with SIGSEGV due to endless recursion. With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: ``` ------------ / \ Region | VectorBox | \ | / | Phi | | | | | Region | VectorBox | \ | / | Phi | | | |\------------/ | ``` This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: ``` ------------ / \ Region | Proj | \ | / | Phi | | | | | Region | Proj | \ | / | Phi | | | |\------------/ | | Phi | / VectorBox ``` which the circle case should be taken into consideration as well. [TEST] Full Jtreg passed without new failure. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2557 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2574 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2534 [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L316 Change-Id: I381b1ba7e0865814d97535e365db6d9d72ef1949 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13489/files - new: https://git.openjdk.org/jdk/pull/13489/files/0e73688a..7119ed69 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13489&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13489&range=00-01 Stats: 245814 lines in 2038 files changed: 221344 ins; 11953 del; 12517 mod Patch: https://git.openjdk.org/jdk/pull/13489.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13489/head:pull/13489 PR: https://git.openjdk.org/jdk/pull/13489 From thartmann at openjdk.org Wed Apr 19 08:17:49 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 19 Apr 2023 08:17:49 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: <4NjSkOwLqjp_TkdMwYLo2GcxEACHHlBwcRjUTW50fVg=.18b2de9d-75a5-42f2-a0f7-3324c19323ba@github.com> On Wed, 19 Apr 2023 08:11:54 GMT, Eric Liu wrote: >> This patch fixes C2 failure with SIGSEGV due to endless recursion. >> >> With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: >> >> >> ------------ >> / \ >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> | | >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> >> >> >> This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. >> >> After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: >> >> >> ------------ >> / \ >> Region | Proj | >> \ | / | >> Phi | >> | | >> | | >> Region | Proj | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> | Phi >> | / >> VectorBox >> >> >> which the circle case should be taken into consideration as well. >> >> [TEST] >> Full Jtreg passed without new failure. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge jdk:master > > Change-Id: I63c06b87d5b0c20ddaec0aa43031872b8ebb5362 > - fix typo > > Change-Id: I1b84c4957398178bf234f71242a1cdd044181a79 > - 8304948: [vectorapi] C2 crashes when expanding VectorBox > > This patch fixes C2 failure with SIGSEGV due to endless recursion. > > With test case VectorBoxExpandTest.java in this patch, C2 would generate > IR graph like below: > > ``` > ------------ > / \ > Region | VectorBox | > \ | / | > Phi | > | | > | | > Region | VectorBox | > \ | / | > Phi | > | | > |\------------/ > | > > ``` > > This Phi will be optimized by merge_through_phi [1], which transforms > `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue > opportunity of combining VectorBox with VectorUnbox. In this process, > either the pre type check [2] or the process cloning Phi nodes [3], the > circle case is well considered to avoid falling into endless loop. > > After merge_through_phi, each input Phi of new VectorBox has the same > shape with original root Phi before merging (only VectorBox has been > replaced). After several other optimizations, C2 would expand VectorBox > [4] on a graph like below: > > ``` > ------------ > / \ > Region | Proj | > \ | / | > Phi | > | | > | | > Region | Proj | > \ | / | > Phi | > | | > |\------------/ > | > | Phi > | / > VectorBox > > ``` > which the circle case should be taken into consideration as well. > > [TEST] > Full Jtreg passed without new failure. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2557 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2574 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2534 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L316 > > Change-Id: I381b1ba7e0865814d97535e365db6d9d72ef1949 Another review would be good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13489#issuecomment-1514321737 From thartmann at openjdk.org Wed Apr 19 08:17:53 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 19 Apr 2023 08:17:53 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: <1yri9IEJiIiarInYDIRxDllndOsk9qPHsN0ABQbFe2o=.5220987f-7f46-46fc-8a2b-414604ea2c7b@github.com> Message-ID: On Tue, 18 Apr 2023 14:53:54 GMT, Eric Liu wrote: >> src/hotspot/share/opto/vector.cpp line 307: >> >>> 305: void PhaseVector::expand_vbox_node(VectorBoxNode* vec_box) { >>> 306: if (vec_box->outcnt() > 0) { >>> 307: VectorSet visited; >> >> Do we need a ResourceMark here or above? > > expand_vbox_node only has a single one call chain, which start from Compile::Optimize(). I think we can trust the ResourceMark defined in there [1]. > > [1] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/compile.cpp#L2207 Makes sense. >> src/hotspot/share/opto/vector.cpp line 327: >> >>> 325: // Phi (VectorBox VectorBox) => VectorBox (Phi Phi) >>> 326: if (visited.test_set(vbox->_idx)) { >>> 327: assert(vbox->is_Phi(), "not a phi"); >> >> Are we sure that the cycle is always detected at a phi? > > I think it is. > > The CYCLE is derived from merge_through_phi, in which only Phi CYCLE is allowed [1]. > > This method `expand_vbox_node_helper` serves to locate the target node `Proj(VectorBoxAllocate)` and then replace it with a punch of other nodes. When expanding VectorBox, the normal case is `VectorBox (Proj value)`. If it is in shape of `VectorBox (Phi1 Phi2)`[2] or `VectorBox (Phi1 vect)`[3], I suppose it should be transformed by merge_through_phi. > > In `expand_vbox_node_helper`, it is recursive only at Phi1 since it is where Proj in. > > [1] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/cfgnode.cpp#L2574 > [2] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/vector.cpp#L331 > [3] https://github.com/e1iu/jdk/blob/ENTLLT-4778-external/src/hotspot/share/opto/vector.cpp#L341 Okay, thanks for the details! Looks good. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1170982063 PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1170982799 From jsjolen at openjdk.org Wed Apr 19 13:19:43 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 19 Apr 2023 13:19:43 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin Message-ID: Hi, First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. I haven't done any micro benchmarking, but this should be faster and take less space. This is currently passing tier1. ------------- Commit messages: - Use nr_blocks in assert - Merge loops - Optimize PhaseChaitin Changes: https://git.openjdk.org/jdk/pull/13533/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13533&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306444 Stats: 37 lines in 1 file changed: 23 ins; 6 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/13533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13533/head:pull/13533 PR: https://git.openjdk.org/jdk/pull/13533 From jsjolen at openjdk.org Wed Apr 19 14:28:54 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 19 Apr 2023 14:28:54 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute Message-ID: `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. ------------- Commit messages: - Don't leak the worklist in PhaseLive Changes: https://git.openjdk.org/jdk/pull/13535/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13535&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306456 Stats: 5 lines in 1 file changed: 3 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13535.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13535/head:pull/13535 PR: https://git.openjdk.org/jdk/pull/13535 From qamai at openjdk.org Wed Apr 19 15:50:49 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 19 Apr 2023 15:50:49 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v3] In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 04:30:39 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the expansion of `Conv2B` nodes, especially when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). This change replaces `Conv2B` nodes in the middle-end during macro expansion with conditional moves, allowing the bit flip with `xor` to be subsumed with an inversion of the comparison instead. This change also reduces the overhead of the matcher in the backend, as fewer rules need to be traversed in order to match an ideal node. Performance results from my (Zen 2) machine: >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% >> >> Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Remove Conv2B from backend as it's macro expanded now Generally I think it's good, one small question I want to ask is whether we should do splitting `Xor` through `CMove` as an idealisation of the `Xor`? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1514967526 From amitkumar at openjdk.org Wed Apr 19 16:08:50 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 19 Apr 2023 16:08:50 GMT Subject: RFR: 8306459: s390x: Replace NULL to nullptr Message-ID: This PR changes two occurrences of NULL to null & nullptr. It is a trivial change OR probably we can say some left out from #12325. ------------- Commit messages: - replace NULL with nullptr Changes: https://git.openjdk.org/jdk/pull/13538/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13538&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306459 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13538.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13538/head:pull/13538 PR: https://git.openjdk.org/jdk/pull/13538 From kvn at openjdk.org Wed Apr 19 17:16:50 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Apr 2023 17:16:50 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v9] In-Reply-To: <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> Message-ID: On Mon, 17 Apr 2023 16:17:30 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix tests. Remember previous reducible Phis. Submitted new testing ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1515089566 From kvn at openjdk.org Wed Apr 19 17:54:49 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Apr 2023 17:54:49 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin In-Reply-To: References: Message-ID: <1XX36qmVJxnq4scfh4CY8YMuo40K4omFvs370V7hIfI=.f976cd60-0d1c-4691-a582-bf00fbc3a2fd@github.com> On Wed, 19 Apr 2023 13:12:00 GMT, Johan Sj?len wrote: > Hi, > > First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. > > Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. > > I haven't done any micro benchmarking, but this should be faster and take less space. > > This is currently passing tier1. Changes are good. Just few comments src/hotspot/share/opto/chaitin.cpp line 250: > 248: uint cnt = buckcnt[j]; > 249: // Assign block to end of list for appropriate bucket > 250: buckets[j][cnt] = _cfg.get_block(i); You can use `blk` here. src/hotspot/share/opto/chaitin.cpp line 263: > 261: ::memmove(offset, buckets[i], buckcnt[i]*sizeof(Block*)); > 262: offset += buckcnt[i]; > 263: } May add assert that `assert((offset - &buckets[0][0]) == nr_blocks` src/hotspot/share/opto/chaitin.cpp line 265: > 263: } > 264: // Free the now unused memory > 265: FREE_RESOURCE_ARRAY(Block*, buckets[1], (NUMBUCKS-1)*nr_blocks); I did not know that you can free part of allocated space. ------------- PR Review: https://git.openjdk.org/jdk/pull/13533#pullrequestreview-1392572048 PR Review Comment: https://git.openjdk.org/jdk/pull/13533#discussion_r1171657776 PR Review Comment: https://git.openjdk.org/jdk/pull/13533#discussion_r1171680140 PR Review Comment: https://git.openjdk.org/jdk/pull/13533#discussion_r1171675541 From kvn at openjdk.org Wed Apr 19 17:56:43 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 19 Apr 2023 17:56:43 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 14:21:13 GMT, Johan Sj?len wrote: > `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13535#pullrequestreview-1392607956 From qamai at openjdk.org Wed Apr 19 19:11:47 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 19 Apr 2023 19:11:47 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 08:11:54 GMT, Eric Liu wrote: >> This patch fixes C2 failure with SIGSEGV due to endless recursion. >> >> With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: >> >> >> ------------ >> / \ >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> | | >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> >> >> >> This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. >> >> After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: >> >> >> ------------ >> / \ >> Region | Proj | >> \ | / | >> Phi | >> | | >> | | >> Region | Proj | >> \ | / | >> Phi | >> | | >> |------------/ >> | >> | Phi >> | / >> VectorBox >> >> >> which the circle case should be taken into consideration as well. >> >> [TEST] >> Full Jtreg passed without new failure. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge jdk:master > > Change-Id: I63c06b87d5b0c20ddaec0aa43031872b8ebb5362 > - fix typo > > Change-Id: I1b84c4957398178bf234f71242a1cdd044181a79 > - 8304948: [vectorapi] C2 crashes when expanding VectorBox > > This patch fixes C2 failure with SIGSEGV due to endless recursion. > > With test case VectorBoxExpandTest.java in this patch, C2 would generate > IR graph like below: > > ``` > ------------ > / \ > Region | VectorBox | > \ | / | > Phi | > | | > | | > Region | VectorBox | > \ | / | > Phi | > | | > |\------------/ > | > > ``` > > This Phi will be optimized by merge_through_phi [1], which transforms > `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue > opportunity of combining VectorBox with VectorUnbox. In this process, > either the pre type check [2] or the process cloning Phi nodes [3], the > circle case is well considered to avoid falling into endless loop. > > After merge_through_phi, each input Phi of new VectorBox has the same > shape with original root Phi before merging (only VectorBox has been > replaced). After several other optimizations, C2 would expand VectorBox > [4] on a graph like below: > > ``` > ------------ > / \ > Region | Proj | > \ | / | > Phi | > | | > | | > Region | Proj | > \ | / | > Phi | > | | > |\------------/ > | > | Phi > | / > VectorBox > > ``` > which the circle case should be taken into consideration as well. > > [TEST] > Full Jtreg passed without new failure. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2557 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2574 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2534 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L316 > > Change-Id: I381b1ba7e0865814d97535e365db6d9d72ef1949 src/hotspot/share/opto/vector.cpp line 323: > 321: if (visited.test_set(vbox->_idx)) { > 322: assert(vbox->is_Phi(), "not a phi"); > 323: return vbox; // already visited Shouldn't the short circuit return a transformed node instead? Or it does not matter here? Thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1171752679 From dlong at openjdk.org Wed Apr 19 23:28:44 2023 From: dlong at openjdk.org (Dean Long) Date: Wed, 19 Apr 2023 23:28:44 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 14:21:13 GMT, Johan Sj?len wrote: > `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. Changes requested by dlong (Reviewer). src/hotspot/share/opto/live.cpp line 92: > 90: Block_List wl; > 91: _worklist = &wl; > 92: Now `_worklist` is a dangling pointer to released stack memory at the end of this method. How do we make sure it isn't used? ------------- PR Review: https://git.openjdk.org/jdk/pull/13535#pullrequestreview-1393009006 PR Review Comment: https://git.openjdk.org/jdk/pull/13535#discussion_r1171940037 From kvn at openjdk.org Thu Apr 20 00:37:49 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Apr 2023 00:37:49 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v9] In-Reply-To: <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> Message-ID: On Mon, 17 Apr 2023 16:17:30 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix tests. Remember previous reducible Phis. Again got failures in the test on Aarch64 running with -XX:-UseTLAB: testCmpMergeWithNull(boolean,int,int): - Failed comparison: [found] 0 = 2 [given] testCmpMergeWithNull_Second(boolean,int,int) - Failed comparison: [found] 0 = 1 [given] testMergedAccessAfterCallNoWrite(boolean,int,int) - Failed comparison: [found] 2 = 3 [given] testMergedAccessAfterCallWithWrite(boolean,int,int) - Failed comparison: [found] 2 = 3 [given] testNestedObjectsArray(boolean,int,int) - Failed comparison: [found] 2 = 4 [given] ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1515550553 From kvn at openjdk.org Thu Apr 20 00:50:59 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Apr 2023 00:50:59 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v9] In-Reply-To: <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> Message-ID: <09b2gzJOWHojxvBpg79PfgQgD0qh56CqHJk484zJX-8=.f1df20ad-c202-4a20-a98b-c334e808eaae@github.com> On Mon, 17 Apr 2023 16:17:30 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: > > Fix tests. Remember previous reducible Phis. Also next 2 JVMCI tests failed: compiler/jvmci/jdk.vm.ci.runtime.test/src/jdk/vm/ci/runtime/test/TestResolvedJavaMethod.java compiler/jvmci/jdk.vm.ci.runtime.test/src/jdk/vm/ci/runtime/test/TestResolvedJavaType.java # Internal Error (/workspace/open/src/hotspot/cpu/x86/macroAssembler_x86.cpp:829), pid=2430194, tid=2430218 # fatal error: DEBUG MESSAGE: exact klass and actual klass differ Could be due to [12810](https://git.openjdk.org/jdk/pull/12810) ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1515556571 From duke at openjdk.org Thu Apr 20 01:55:44 2023 From: duke at openjdk.org (Chang Peng) Date: Thu, 20 Apr 2023 01:55:44 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v2] In-Reply-To: References: Message-ID: > We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. > > The following instruction sequence > > > movi v17.16b, #12 > cmpgt p0.b, p7/z, z16.b, z17.b > > > can be optimized to: > > > cmpgt p0.b, p7/z, z16.b, #12 > > > This patch does the following: > 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. > SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to > 127)or 5bit signed integer immediate (range from -16 to 15). > > 2. Add optimized match rules to generate the compare-with-immediate instructions. > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- Chang Peng has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'openjdk:master' into add_sve_cmpU - 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. The following instruction sequence ``` movi v17.16b, #12 cmpgt p0.b, p7/z, z16.b, z17.b ``` can be optimized to: ``` cmpgt p0.b, p7/z, z16.b, #12 ``` This patch does the following: 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to 127) or 5bit signed integer immediate (range from -16 to 15). 2. Add optimized match rules to generate the compare-with-immediate instructions. [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- TEST_LABEL: v1 || n2, aarch64&&ubuntu&&conformance-enabled JDK_SCOPE: hotspot:compiler/vectorapi, jdk:jdk/incubator/vector/ Jira: ENTLLT-5294 Change-Id: I6b915864308faf9a8ec6e35ca1b4948666d75dca ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13200/files - new: https://git.openjdk.org/jdk/pull/13200/files/dd190608..8a9a43d9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13200&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13200&range=00-01 Stats: 295625 lines in 3103 files changed: 247243 ins; 30238 del; 18144 mod Patch: https://git.openjdk.org/jdk/pull/13200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13200/head:pull/13200 PR: https://git.openjdk.org/jdk/pull/13200 From duke at openjdk.org Thu Apr 20 02:25:45 2023 From: duke at openjdk.org (Chang Peng) Date: Thu, 20 Apr 2023 02:25:45 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: > We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. > > The following instruction sequence > > > movi v17.16b, #12 > cmpgt p0.b, p7/z, z16.b, z17.b > > > can be optimized to: > > > cmpgt p0.b, p7/z, z16.b, #12 > > > This patch does the following: > 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. > SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to > 127)or 5bit signed integer immediate (range from -16 to 15). > > 2. Add optimized match rules to generate the compare-with-immediate instructions. > > [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- Chang Peng has updated the pull request incrementally with one additional commit since the last revision: Refactor some code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13200/files - new: https://git.openjdk.org/jdk/pull/13200/files/8a9a43d9..d9d861ea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13200&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13200&range=01-02 Stats: 111 lines in 4 files changed: 42 ins; 28 del; 41 mod Patch: https://git.openjdk.org/jdk/pull/13200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13200/head:pull/13200 PR: https://git.openjdk.org/jdk/pull/13200 From dlong at openjdk.org Thu Apr 20 02:50:54 2023 From: dlong at openjdk.org (Dean Long) Date: Thu, 20 Apr 2023 02:50:54 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if Message-ID: This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. ------------- Commit messages: - fix signed overflow Changes: https://git.openjdk.org/jdk/pull/13551/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13551&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306331 Stats: 26 lines in 1 file changed: 24 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13551.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13551/head:pull/13551 PR: https://git.openjdk.org/jdk/pull/13551 From eliu at openjdk.org Thu Apr 20 03:23:49 2023 From: eliu at openjdk.org (Eric Liu) Date: Thu, 20 Apr 2023 03:23:49 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 19:08:57 GMT, Quan Anh Mai wrote: >> Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge jdk:master >> >> Change-Id: I63c06b87d5b0c20ddaec0aa43031872b8ebb5362 >> - fix typo >> >> Change-Id: I1b84c4957398178bf234f71242a1cdd044181a79 >> - 8304948: [vectorapi] C2 crashes when expanding VectorBox >> >> This patch fixes C2 failure with SIGSEGV due to endless recursion. >> >> With test case VectorBoxExpandTest.java in this patch, C2 would generate >> IR graph like below: >> >> ``` >> ------------ >> / \ >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> | | >> Region | VectorBox | >> \ | / | >> Phi | >> | | >> |\------------/ >> | >> >> ``` >> >> This Phi will be optimized by merge_through_phi [1], which transforms >> `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue >> opportunity of combining VectorBox with VectorUnbox. In this process, >> either the pre type check [2] or the process cloning Phi nodes [3], the >> circle case is well considered to avoid falling into endless loop. >> >> After merge_through_phi, each input Phi of new VectorBox has the same >> shape with original root Phi before merging (only VectorBox has been >> replaced). After several other optimizations, C2 would expand VectorBox >> [4] on a graph like below: >> >> ``` >> ------------ >> / \ >> Region | Proj | >> \ | / | >> Phi | >> | | >> | | >> Region | Proj | >> \ | / | >> Phi | >> | | >> |\------------/ >> | >> | Phi >> | / >> VectorBox >> >> ``` >> which the circle case should be taken into consideration as well. >> >> [TEST] >> Full Jtreg passed without new failure. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2557 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2574 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2534 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L316 >> >> Change-Id: I381b1ba7e0865814d97535e365db6d9d72ef1949 > > src/hotspot/share/opto/vector.cpp line 323: > >> 321: if (visited.test_set(vbox->_idx)) { >> 322: assert(vbox->is_Phi(), "not a phi"); >> 323: return vbox; // already visited > > Shouldn't the short circuit return a transformed node instead? Or it does not matter here? Thanks. It does not matter. It will finally be transformed in the round, in which it is the root. At this round, it's an input of another Phi. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1172041740 From qamai at openjdk.org Thu Apr 20 03:43:46 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 20 Apr 2023 03:43:46 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 03:21:09 GMT, Eric Liu wrote: >> src/hotspot/share/opto/vector.cpp line 323: >> >>> 321: if (visited.test_set(vbox->_idx)) { >>> 322: assert(vbox->is_Phi(), "not a phi"); >>> 323: return vbox; // already visited >> >> Shouldn't the short circuit return a transformed node instead? Or it does not matter here? Thanks. > > It does not matter. It will finally be transformed in the round, in which it is the root. At this round, it's an input of another Phi. But that phi will have an incorrect input, because the return value of this call is used as an input of the transformed phi that uses this node? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1172049352 From eliu at openjdk.org Thu Apr 20 05:25:43 2023 From: eliu at openjdk.org (Eric Liu) Date: Thu, 20 Apr 2023 05:25:43 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 03:41:02 GMT, Quan Anh Mai wrote: > But that phi will have an incorrect input, because the return value of this call is used as an input of the transformed phi that uses this node? the return value of this call is Phi1. Phi1 is used as an input of Phi2 which is used by Phi1 as well. The Phi cycle is not an incorrect shape, it's a normal case generated by some simple cases, i.g., I have a test case in this patch. When expanding VectorBox node, the purpose is to traverse the first input of VectorBox to locate Proj, and replace Proj with some other nodes. The first input of VectorBox can be a graph, contains Phi (maybe Phi cycle) and Proj. The process finding and replacing Proj is not in local graph, it creates a new graph at the same time. Return this visited node here is used to maintain that cycle. Besides Proj, nodes in graph should not be changed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1172093284 From aph at openjdk.org Thu Apr 20 08:11:56 2023 From: aph at openjdk.org (Andrew Haley) Date: Thu, 20 Apr 2023 08:11:56 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 02:25:45 GMT, Chang Peng wrote: >> We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. >> >> The following instruction sequence >> >> >> movi v17.16b, #12 >> cmpgt p0.b, p7/z, z16.b, z17.b >> >> >> can be optimized to: >> >> >> cmpgt p0.b, p7/z, z16.b, #12 >> >> >> This patch does the following: >> 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. >> SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to >> 127)or 5bit signed integer immediate (range from -16 to 15). >> >> 2. Add optimized match rules to generate the compare-with-immediate instructions. >> >> [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- > > Chang Peng has updated the pull request incrementally with one additional commit since the last revision: > > Refactor some code src/hotspot/cpu/aarch64/aarch64.ad line 4321: > 4319: operand immI_cmp_cond() > 4320: %{ > 4321: predicate(!Matcher::is_unsigned_booltest_pred(n->get_int())); Suggestion: predicate(! Matcher::is_unsigned_booltest_pred(n->get_int())); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1172231889 From aph at openjdk.org Thu Apr 20 08:18:48 2023 From: aph at openjdk.org (Andrew Haley) Date: Thu, 20 Apr 2023 08:18:48 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 02:25:45 GMT, Chang Peng wrote: >> We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. >> >> The following instruction sequence >> >> >> movi v17.16b, #12 >> cmpgt p0.b, p7/z, z16.b, z17.b >> >> >> can be optimized to: >> >> >> cmpgt p0.b, p7/z, z16.b, #12 >> >> >> This patch does the following: >> 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. >> SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to >> 127)or 5bit signed integer immediate (range from -16 to 15). >> >> 2. Add optimized match rules to generate the compare-with-immediate instructions. >> >> [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- > > Chang Peng has updated the pull request incrementally with one additional commit since the last revision: > > Refactor some code src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3656: > 3654: ins_pipe(pipe_slow); > 3655: %}')dnl > 3656: VMASKCMP_SVE_IMM_I(immI5, cmp) This is tricky to review because the two macros here seems to be almost, but not exactly, the same. Why is that? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1172239949 From duke at openjdk.org Thu Apr 20 09:11:46 2023 From: duke at openjdk.org (Chang Peng) Date: Thu, 20 Apr 2023 09:11:46 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 08:15:49 GMT, Andrew Haley wrote: >> Chang Peng has updated the pull request incrementally with one additional commit since the last revision: >> >> Refactor some code > > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3656: > >> 3654: ins_pipe(pipe_slow); >> 3655: %}')dnl >> 3656: VMASKCMP_SVE_IMM_I(immI5, cmp) > > This is tricky to review because the two macros here seems to be almost, but not exactly, the same. Why is that? This patch adds rules for vector comparing with immediate. These immediate may have different types and his manifests as different ConNodes in middle-end (ConI and ConL). ConI and ConL have different methods to get the value, i.e., get_int() for ConI and get_long() for ConL. We should use this value in predicate, so I set two macros, one for integer (byte, short and int) and another one for long. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1172303063 From mdoerr at openjdk.org Thu Apr 20 10:09:45 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 20 Apr 2023 10:09:45 GMT Subject: RFR: 8306459: s390x: Replace NULL to nullptr In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 16:01:22 GMT, Amit Kumar wrote: > This PR changes two occurrences of NULL to null & nullptr. It is a trivial change OR probably we can say some left out from #12325. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13538#pullrequestreview-1393661147 From amitkumar at openjdk.org Thu Apr 20 11:31:43 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 20 Apr 2023 11:31:43 GMT Subject: RFR: 8306459: s390x: Replace NULL to nullptr In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 10:07:15 GMT, Martin Doerr wrote: >> This PR changes two occurrences of NULL to null & nullptr. It is a trivial change OR probably we can say some left out from #12325. > > LGTM. Thanks for review @TheRealMDoerr , This is trivial change so integrating. Would you mind to sponsor this as well ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13538#issuecomment-1516168507 From jsjolen at openjdk.org Thu Apr 20 12:05:49 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 12:05:49 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute [v2] In-Reply-To: References: Message-ID: <1UR24BFT95Q5DDm5Xuo430l_So0NjrDiEGAojfymLRk=.58769e0b-f132-4a1e-9026-302d618e64ad@github.com> > `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: Fix style and make worklist passed in as argument ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13535/files - new: https://git.openjdk.org/jdk/pull/13535/files/e1191341..8c436ab9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13535&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13535&range=00-01 Stats: 17 lines in 2 files changed: 0 ins; 3 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/13535.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13535/head:pull/13535 PR: https://git.openjdk.org/jdk/pull/13535 From jsjolen at openjdk.org Thu Apr 20 12:05:51 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 12:05:51 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 14:21:13 GMT, Johan Sj?len wrote: > `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. I also added some style changes, to make the code look more like the rest of Hotspot. I didn't convert the whole file though, only areas that I touched. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13535#issuecomment-1516206650 From jsjolen at openjdk.org Thu Apr 20 12:05:55 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 12:05:55 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute [v2] In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 23:26:10 GMT, Dean Long wrote: >> Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix style and make worklist passed in as argument > > src/hotspot/share/opto/live.cpp line 92: > >> 90: Block_List wl; >> 91: _worklist = &wl; >> 92: > > Now `_worklist` is a dangling pointer to released stack memory at the end of this method. How do we make sure it isn't used? I removed the `_worklist` local variable and pass it into the functions that need it instead. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13535#discussion_r1172483362 From jsjolen at openjdk.org Thu Apr 20 12:08:45 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 12:08:45 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin In-Reply-To: <1XX36qmVJxnq4scfh4CY8YMuo40K4omFvs370V7hIfI=.f976cd60-0d1c-4691-a582-bf00fbc3a2fd@github.com> References: <1XX36qmVJxnq4scfh4CY8YMuo40K4omFvs370V7hIfI=.f976cd60-0d1c-4691-a582-bf00fbc3a2fd@github.com> Message-ID: On Wed, 19 Apr 2023 17:46:42 GMT, Vladimir Kozlov wrote: >> Hi, >> >> First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. >> >> Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. >> >> I haven't done any micro benchmarking, but this should be faster and take less space. >> >> This is currently passing tier1. > > src/hotspot/share/opto/chaitin.cpp line 265: > >> 263: } >> 264: // Free the now unused memory >> 265: FREE_RESOURCE_ARRAY(Block*, buckets[1], (NUMBUCKS-1)*nr_blocks); > > I did not know that you can free part of allocated space. Arena and ResourceArea looks at array allocations as `sizeof(T)*N` bytes being allocated, nothing more than that. If it's the last thing you allocated then you can free that space. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13533#discussion_r1172489348 From amitkumar at openjdk.org Thu Apr 20 12:32:00 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 20 Apr 2023 12:32:00 GMT Subject: Integrated: 8306459: s390x: Replace NULL to nullptr In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 16:01:22 GMT, Amit Kumar wrote: > This PR changes two occurrences of NULL to null & nullptr. It is a trivial change OR probably we can say some left out from #12325. This pull request has now been integrated. Changeset: 9c2e5b38 Author: Amit Kumar Committer: Martin Doerr URL: https://git.openjdk.org/jdk/commit/9c2e5b387112606352b3150a5cc10ddec8d3afe9 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod 8306459: s390x: Replace NULL to nullptr Reviewed-by: mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/13538 From jsjolen at openjdk.org Thu Apr 20 12:38:31 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 12:38:31 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin [v2] In-Reply-To: References: Message-ID: > Hi, > > First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. > > Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. > > I haven't done any micro benchmarking, but this should be faster and take less space. > > This is currently passing tier1. Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: Apply Kozlov's comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13533/files - new: https://git.openjdk.org/jdk/pull/13533/files/137b98ab..52db05ef Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13533&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13533&range=00-01 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13533/head:pull/13533 PR: https://git.openjdk.org/jdk/pull/13533 From jsjolen at openjdk.org Thu Apr 20 12:38:34 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 12:38:34 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin [v2] In-Reply-To: <1XX36qmVJxnq4scfh4CY8YMuo40K4omFvs370V7hIfI=.f976cd60-0d1c-4691-a582-bf00fbc3a2fd@github.com> References: <1XX36qmVJxnq4scfh4CY8YMuo40K4omFvs370V7hIfI=.f976cd60-0d1c-4691-a582-bf00fbc3a2fd@github.com> Message-ID: On Wed, 19 Apr 2023 17:51:09 GMT, Vladimir Kozlov wrote: >> Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply Kozlov's comments > > src/hotspot/share/opto/chaitin.cpp line 263: > >> 261: ::memmove(offset, buckets[i], buckcnt[i]*sizeof(Block*)); >> 262: offset += buckcnt[i]; >> 263: } > > May add assert that `assert((offset - &buckets[0][0]) == nr_blocks` I added `assert((&buckets[0][0] + nr_blocks) == offset, "should be");`, it's easier for me to see that `nr_blocks` is implicitly multiplied with `sizeof(Block*)` :-). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13533#discussion_r1172521966 From jsjolen at openjdk.org Thu Apr 20 13:00:42 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 13:00:42 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin [v3] In-Reply-To: References: Message-ID: > Hi, > > First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. > > Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. > > I haven't done any micro benchmarking, but this should be faster and take less space. > > This is currently passing tier1. Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge remote-tracking branch 'origin/master' into opt-chaitin - Apply Kozlov's comments - Use nr_blocks in assert - Merge loops - Optimize PhaseChaitin ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13533/files - new: https://git.openjdk.org/jdk/pull/13533/files/52db05ef..c6fe3ec7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13533&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13533&range=01-02 Stats: 6222 lines in 152 files changed: 5434 ins; 432 del; 356 mod Patch: https://git.openjdk.org/jdk/pull/13533.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13533/head:pull/13533 PR: https://git.openjdk.org/jdk/pull/13533 From jsjolen at openjdk.org Thu Apr 20 13:02:23 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 20 Apr 2023 13:02:23 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute [v3] In-Reply-To: References: Message-ID: > `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge remote-tracking branch 'origin/master' into dontleak-worklist - Fix style and make worklist passed in as argument - Don't leak the worklist in PhaseLive ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13535/files - new: https://git.openjdk.org/jdk/pull/13535/files/8c436ab9..81a039f5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13535&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13535&range=01-02 Stats: 6222 lines in 152 files changed: 5434 ins; 432 del; 356 mod Patch: https://git.openjdk.org/jdk/pull/13535.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13535/head:pull/13535 PR: https://git.openjdk.org/jdk/pull/13535 From vkempik at openjdk.org Thu Apr 20 13:52:19 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 20 Apr 2023 13:52:19 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v5] In-Reply-To: References: Message-ID: <5HIRhLY2y8NJE_oQvoEvKWtfwlsI0bCmzqImcweV4nI=.241452f9-6b7f-4742-b7db-5f9aa21dc202@github.com> > Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. > > Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: Add APH's suggestions and remove some whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13227/files - new: https://git.openjdk.org/jdk/pull/13227/files/c014a806..82b918a1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=03-04 Stats: 10 lines in 1 file changed: 1 ins; 2 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/13227.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13227/head:pull/13227 PR: https://git.openjdk.org/jdk/pull/13227 From vkempik at openjdk.org Thu Apr 20 13:58:26 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 20 Apr 2023 13:58:26 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if not enabled [v6] In-Reply-To: References: Message-ID: > Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. > > Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: Rewrite few more helpers ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13227/files - new: https://git.openjdk.org/jdk/pull/13227/files/82b918a1..81861821 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=04-05 Stats: 13 lines in 1 file changed: 3 ins; 6 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/13227.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13227/head:pull/13227 PR: https://git.openjdk.org/jdk/pull/13227 From fyang at openjdk.org Thu Apr 20 14:35:51 2023 From: fyang at openjdk.org (Fei Yang) Date: Thu, 20 Apr 2023 14:35:51 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v23] In-Reply-To: <-tqGJTDwdjWYBNBNZxFAcosCq3_V_fWvGzwJc98uUZI=.013ff8de-380c-464d-8d09-405cfcac17d3@github.com> References: <-tqGJTDwdjWYBNBNZxFAcosCq3_V_fWvGzwJc98uUZI=.013ff8de-380c-464d-8d09-405cfcac17d3@github.com> Message-ID: On Wed, 19 Apr 2023 02:51:00 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Rename vmaskcmp_DF Changes requested by fyang (Reviewer). src/hotspot/cpu/riscv/riscv_v.ad line 199: > 197: // vector mask float compare > 198: > 199: instruct vmaskcmp_FD(vRegMask dst, vReg src1, vReg src2, immI cond, vReg tmp1, vReg tmp2) %{ I would prefer renaming this into something like 'vmaskcmp_fp'. Also, would you mind renaming macro-assember functions minmax_FD, minmax_FD_v and reduce_minmax_FD_v into something like minmax_fp, minmax_fp_v, reduce_minmax_fp_v respectively? Thanks. src/hotspot/cpu/riscv/riscv_v.ad line 216: > 214: %} > 215: > 216: instruct vmaskcmp_FD_masked(vRegMask dst, vReg src1, vReg src2, immI cond, vRegMask_V0 vmask, vReg tmp1, vReg tmp2, vReg tmp3) %{ Similar here, we can rename this into something like 'vmaskcmp_fp_masked'. src/hotspot/cpu/riscv/riscv_v.ad line 413: > 411: %} > 412: > 413: instruct vaddF_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ Can we combine this one with the next 'vaddD_masked' into a new instruct and give it a new name like 'vadd_fp_masked'? It looks to me that we can do similar thing for other instructs like 'vmulF_masked/vmulD_masked', etc. src/hotspot/cpu/riscv/riscv_v.ad line 1491: > 1489: // vector shift > 1490: > 1491: instruct vasrBS(vReg dst, vReg src, vReg shift, vRegMask_V0 tmp) %{ I really don't like instruct names like 'vasrBS' and 'vasrIL' which looks kind of misleading. I would perfer keep those seperated like before. ------------- PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1394119086 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1172668496 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1172669505 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1172671930 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1172686453 From aph at openjdk.org Thu Apr 20 14:54:46 2023 From: aph at openjdk.org (Andrew Haley) Date: Thu, 20 Apr 2023 14:54:46 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 09:08:28 GMT, Chang Peng wrote: >> src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3656: >> >>> 3654: ins_pipe(pipe_slow); >>> 3655: %}')dnl >>> 3656: VMASKCMP_SVE_IMM_I(immI5, cmp) >> >> This is tricky to review because the two macros here seems to be almost, but not exactly, the same. Why is that? > > This patch adds rules for vector comparing with immediate. These immediate may have different types and his manifests as different ConNodes in middle-end (ConI and ConL). ConI and ConL have different methods to get the value, i.e., get_int() for ConI and get_long() for ConL. We should use this value in predicate, so I set two macros, one for integer (byte, short and int) and another one for long. Please show me where these differences appear. I can't see them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1172714471 From jkarthikeyan at openjdk.org Thu Apr 20 15:32:43 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 20 Apr 2023 15:32:43 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v3] In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 04:30:39 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the expansion of `Conv2B` nodes, especially when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). This change replaces `Conv2B` nodes in the middle-end during macro expansion with conditional moves, allowing the bit flip with `xor` to be subsumed with an inversion of the comparison instead. This change also reduces the overhead of the matcher in the backend, as fewer rules need to be traversed in order to match an ideal node. Performance results from my (Zen 2) machine: >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% >> >> Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Remove Conv2B from backend as it's macro expanded now I think that's a good idea, it would reduce the complexity of the logic in macro expansion and allow the transform to be applied more generally. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1516537644 From vkempik at openjdk.org Thu Apr 20 15:48:46 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 20 Apr 2023 15:48:46 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if it's unsupported [v6] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 13:58:26 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Rewrite few more helpers the numbers on misaligned stores ( trp_sam event) before and after the patch on risc-v: before: Performance counter stats for './jdk/bin/java -version': 170054 trp_lam 34171 trp_sam 5.120392220 seconds time elapsed 5.865599000 seconds user 0.593732000 seconds sys after: 169598 trp_lam 13562 trp_sam 5.374909022 seconds time elapsed 6.057279000 seconds user 0.524302000 seconds sys ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1516561316 From vkempik at openjdk.org Thu Apr 20 16:25:52 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Thu, 20 Apr 2023 16:25:52 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if it's unsupported [v6] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 13:58:26 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision: > > Rewrite few more helpers Windows failures are infra issue ( visual studio can't be installed) and unrelated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1516615266 From kvn at openjdk.org Thu Apr 20 16:39:44 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Apr 2023 16:39:44 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 13:00:42 GMT, Johan Sj?len wrote: >> Hi, >> >> First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. >> >> Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. >> >> I haven't done any micro benchmarking, but this should be faster and take less space. >> >> This is currently passing tier1. > > Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into opt-chaitin > - Apply Kozlov's comments > - Use nr_blocks in assert > - Merge loops > - Optimize PhaseChaitin Looks good. You need second review for this. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13533#pullrequestreview-1394385535 From dlong at openjdk.org Thu Apr 20 18:29:44 2023 From: dlong at openjdk.org (Dean Long) Date: Thu, 20 Apr 2023 18:29:44 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 13:02:23 GMT, Johan Sj?len wrote: >> `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. > > Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into dontleak-worklist > - Fix style and make worklist passed in as argument > - Don't leak the worklist in PhaseLive Looks good. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13535#pullrequestreview-1394549616 From cslucas at openjdk.org Thu Apr 20 19:27:58 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 20 Apr 2023 19:27:58 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - Catching up with master Merge remote-tracking branch 'origin/master' into rematerialization-of-merges - Fix tests. Remember previous reducible Phis. - Address PR review 3. Some comments and be able to abort compilation. - Merge with Master - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. - Add support for SR'ing some inputs of merges used for field loads - Fix some typos and do some small refactorings. - Merge master - Add support for rematerializing scalar replaced objects participating in allocation merges ------------- Changes: https://git.openjdk.org/jdk/pull/12897/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=09 Stats: 2253 lines in 26 files changed: 1992 ins; 108 del; 153 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From dnsimon at openjdk.org Thu Apr 20 19:40:41 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 20 Apr 2023 19:40:41 GMT Subject: RFR: 8306581: JVMCI tests failed when run with -XX:TypeProfileLevel=222 after JDK-8303431 Message-ID: This PR fixes an issue where an `Object[]` value is allocated in the VM and passed to a parameter of type `Class[]`. ------------- Commit messages: - fix type punning bug in upcall to VMSupport::encodeAnnotations Changes: https://git.openjdk.org/jdk/pull/13566/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13566&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306581 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13566.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13566/head:pull/13566 PR: https://git.openjdk.org/jdk/pull/13566 From never at openjdk.org Thu Apr 20 19:40:43 2023 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 20 Apr 2023 19:40:43 GMT Subject: RFR: 8306581: JVMCI tests failed when run with -XX:TypeProfileLevel=222 after JDK-8303431 In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 19:31:14 GMT, Doug Simon wrote: > This PR fixes an issue where an `Object[]` value is allocated in the VM and passed to a parameter of type `Class[]`. Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13566#pullrequestreview-1394643581 From kvn at openjdk.org Thu Apr 20 20:01:44 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 20 Apr 2023 20:01:44 GMT Subject: RFR: 8306581: JVMCI tests failed when run with -XX:TypeProfileLevel=222 after JDK-8303431 In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 19:31:14 GMT, Doug Simon wrote: > This PR fixes an issue where an `Object[]` value is allocated in the VM and passed to a parameter of type `Class[]`. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13566#pullrequestreview-1394674454 From cslucas at openjdk.org Thu Apr 20 20:19:51 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 20 Apr 2023 20:19:51 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 20 Apr 2023 19:27:58 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Catching up with master > > Merge remote-tracking branch 'origin/master' into rematerialization-of-merges > - Fix tests. Remember previous reducible Phis. > - Address PR review 3. Some comments and be able to abort compilation. > - Merge with Master > - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. > - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. > - Add support for SR'ing some inputs of merges used for field loads > - Fix some typos and do some small refactorings. > - Merge master > - Add support for rematerializing scalar replaced objects participating in allocation merges Thank you for testing, Vladimir. I was able to reproduce the IR test failures on AArch64 with -UseTLAB. I'll push a fix later today. Looks like the other failures are due to: https://bugs.openjdk.org/browse/JDK-8306581 ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1516893566 From fyang at openjdk.org Fri Apr 21 00:41:47 2023 From: fyang at openjdk.org (Fei Yang) Date: Fri, 21 Apr 2023 00:41:47 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v3] In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 04:30:39 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the expansion of `Conv2B` nodes, especially when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). This change replaces `Conv2B` nodes in the middle-end during macro expansion with conditional moves, allowing the bit flip with `xor` to be subsumed with an inversion of the comparison instead. This change also reduces the overhead of the matcher in the backend, as fewer rules need to be traversed in order to match an ideal node. Performance results from my (Zen 2) machine: >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% >> >> Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Remove Conv2B from backend as it's macro expanded now Hello, I wonder if we could make this transformation of Conv2B conditional? Architectures like RISC-V doesn't have support of conditional moves at the ISA level for now. So we set ConditionalMoveLimit parameter to 0 for this platform and conditionals moves are emulated with normal compare and branch instructions instead [1]. I don't think we would achieve better performance numbers on this platform with this change. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L9583 ------------- PR Review: https://git.openjdk.org/jdk/pull/13345#pullrequestreview-1394916330 From fyang at openjdk.org Fri Apr 21 03:38:50 2023 From: fyang at openjdk.org (Fei Yang) Date: Fri, 21 Apr 2023 03:38:50 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v23] In-Reply-To: <-tqGJTDwdjWYBNBNZxFAcosCq3_V_fWvGzwJc98uUZI=.013ff8de-380c-464d-8d09-405cfcac17d3@github.com> References: <-tqGJTDwdjWYBNBNZxFAcosCq3_V_fWvGzwJc98uUZI=.013ff8de-380c-464d-8d09-405cfcac17d3@github.com> Message-ID: On Wed, 19 Apr 2023 02:51:00 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Rename vmaskcmp_DF Changes requested by fyang (Reviewer). src/hotspot/cpu/riscv/riscv_v.ad line 2393: > 2391: format %{ "vmask_gen_I $dst, $src" %} > 2392: ins_encode %{ > 2393: __ clear_register_v(as_VectorRegister($dst$$reg)); Could you explain why we need to clear 'dst' here? And can we use vector mask-register logical instruction 'vmclr.m' instead if it is really needed? src/hotspot/cpu/riscv/riscv_v.ad line 2406: > 2404: __ clear_register_v(as_VectorRegister($dst$$reg)); > 2405: __ vsetvli(t0, $src$$Register, Assembler::e8); > 2406: __ vmxnor_mm(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); We could introduce assembler pseudoinstructions 'vmmv.m', 'vmclr.m', 'vmset.m' and 'vmnot.m' from the spec to make the vector mask computation more readable. vmmv.m vd, vs => vmand.mm vd, vs, vs # Copy mask register vmclr.m vd => vmxor.mm vd, vd, vd # Clear mask register vmset.m vd => vmxnor.mm vd, vd, vd # Set mask register vmnot.m vd, vs => vmnand.mm vd, vs, vs # Invert bits ------------- PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1395006081 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173268904 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173270395 From dzhang at openjdk.org Fri Apr 21 07:07:48 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 21 Apr 2023 07:07:48 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v24] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: rename fp and modify VectorMaskGen node ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/b5f716dc..4631181a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=23 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=22-23 Stats: 354 lines in 4 files changed: 194 ins; 65 del; 95 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dnsimon at openjdk.org Fri Apr 21 07:16:53 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 21 Apr 2023 07:16:53 GMT Subject: RFR: 8306581: JVMCI tests failed when run with -XX:TypeProfileLevel=222 after JDK-8303431 In-Reply-To: References: Message-ID: <8FB32fxEzgqAiy-sIF2dGVm-pqMQV_YqrBtmqrv2YhM=.6e2234ad-0ada-4ad0-9b9a-e63c08956437@github.com> On Thu, 20 Apr 2023 19:35:07 GMT, Tom Rodriguez wrote: >> This PR fixes an issue where an `Object[]` value is allocated in the VM and passed to a parameter of type `Class[]`. > > Marked as reviewed by never (Reviewer). Thanks for the review @tkrodriguez and @vnkozlov . ------------- PR Comment: https://git.openjdk.org/jdk/pull/13566#issuecomment-1517374733 From dnsimon at openjdk.org Fri Apr 21 07:16:55 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 21 Apr 2023 07:16:55 GMT Subject: Integrated: 8306581: JVMCI tests failed when run with -XX:TypeProfileLevel=222 after JDK-8303431 In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 19:31:14 GMT, Doug Simon wrote: > This PR fixes an issue where an `Object[]` value is allocated in the VM and passed to a parameter of type `Class[]`. This pull request has now been integrated. Changeset: fdaabd6e Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/fdaabd6eecd86d1a8b1d1a4ed11cd03996d1db65 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8306581: JVMCI tests failed when run with -XX:TypeProfileLevel=222 after JDK-8303431 Reviewed-by: never, kvn ------------- PR: https://git.openjdk.org/jdk/pull/13566 From fjiang at openjdk.org Fri Apr 21 09:01:55 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 21 Apr 2023 09:01:55 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v24] In-Reply-To: References: Message-ID: <7KCd9xWDpRwOQCUZN7wfg_eJ8cIBOerEM3p7UJ3sV6k=.06133ed0-fd6d-428f-b730-5a24bee992f2@github.com> On Fri, 21 Apr 2023 07:07:48 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > rename fp and modify VectorMaskGen node src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 234: > 232: > 233: // Clear vector registers independent of previous vl and vtype. > 234: void clear_register_v(VectorRegister v) { Do we still need `clear_register_v`? There seems to be nowhere to use it anymore. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173522695 From epeter at openjdk.org Fri Apr 21 13:00:49 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 21 Apr 2023 13:00:49 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v4] In-Reply-To: References: <9druojszHMZKJqtonknAR-ykDUZwTqkAgpWx6TI0_zA=.011ac9ae-e41e-4e5b-8a4f-f9567eef3ce5@github.com> Message-ID: On Mon, 17 Apr 2023 19:55:52 GMT, Vladimir Kozlov wrote: >> @jaskarth I think your issues are not related, though I can look at them again once I get back to IGVN verification. >> >> @vnkozlov I thought about it a bit more. With a simple example like `Test::test`, I get unrolling `2048`, so we unroll 10-ish times. I see accordingly many `ConvI2L, SubL, MaxL, ConvL2I` nodes. Now, I can collapse the `ConvL2I -> ConvI2L` parts (the types guarantee that we never leave the `int` range, so conversion never clips anything), so it is only a chain of `SubL -> MaxL` nodes. >> >> One idea would be to fold `SubL -> MaxL -> SubL -> MaxL` down to a single subtraction and maximum. Maybe that could be done, we just have to be very careful with the types. **I'll give it a try**, and it seems to work on a basic example. >> >> The example: >> >> `./java -Xcomp -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=printcompilation,Test::test -XX:+TraceLoopOpts -XX:+PrintIdeal Test.java` >> >> >> public class Test { >> static int START = 0; >> static int FINISH = 512; >> static int RANGE = 512; >> >> public static void main(String args[]) { >> byte[] data = new byte[RANGE]; >> test(data); >> } >> >> public static void test(byte[] data) { >> for (int j = START; j < FINISH; j++) { >> data[j] = (byte)(data[j] * 11); >> } >> } >> } >> >> >> What to do with this? >> - Performance testing did not show any difference. But maybe we do not trust that enough. >> - Before and now, the chain of unrolling-limits can be interrupted by range-check limits. We probably will just accept that this means that not all of the unrolling-limits can be folded together. >> >> **I have an alternative proposal:** >> Leave the `MaxL/MinL` node for the range-check limits, there are usually not that many RC-limits, and up to now we used a `CMove` node per such limit already anyway. >> >> But for the unroll-limits, we introduce a `SubINoUnderflow` node, which does a safe (no-underflow) subtraction `limit-stride`. >> These nodes can be folded together relatively easily. >> I already had such an implementation before, and reverted it https://github.com/openjdk/jdk/pull/13269/commits/f5fcf6084a2446876ba2a85907a2991ef4c705b7 >> I had already discussed this idea with @chhagedorn a while ago. But then decided against it once I also saw that I wanted a unified solution for RC-limits and unroll-limits. The downside is that it takes a new special node. >> >> With this `SubINoUnderflowNode` idea, we would have a constant number of nodes added per RC-limit. And then for all the unroll limit adjustments together, we would only have one `SubINoUnderflow` node, as they would all collapse into one. At macro expansion, I can then expand it into a single CMove node. >> >> But I think I can do the same with just collapsing `SubL -> MaxL -> SubL -> MaxL` to `SubL -> MaxL`. That may be cleaner. >> >> @vnkozlov What do you think? Do you have any other ideas? What solution would you prefer? > >> But I think I can do the same with just collapsing `SubL -> MaxL -> SubL -> MaxL` to `SubL -> MaxL`. That may be cleaner. > > I prefer this if you can do it. So you have sequence (after folding `Conv` nodes) > > MaxL(SubL(MaxL(SubL(limit, stride), min_int), stride*2), min_int); > > > Yes, I think it can be collapsed to: > > MaxL(SubL(limit, stride*3), min_int); > > > If in any point of chain `limit` become `min_int` it will stay `min_int` (even if `stride` is `max_int`) because you use Long arithmetic and we have "small" limit on unrolling (16?). > If it does not hit min_int the result it similar to SubL(SubL((limit, stride), stride*2). > So you just need to correctly collect `stride*N` values. @vnkozlov I added some more IGVN optimizations that help to fold the `SubL -> MaxL` chains. 1. `fold_subI_no_underflow_pattern` in `MaxLNode::Ideal`. Collapses `SubL -> MaxL->SubL -> MaxL` to a simple `SubL -> MaxL`. 2. `ConvI2LNode::Identity` can now convert `I2L(L2I(x))` => `x`. We need this, so that the Casts are not in the way for the first optimization. I added verification, that these optimizations are really taken: https://github.com/openjdk/jdk/blob/33e1ad54d0322bdbc50c85fbea6f1ada0963f3ef/test/hotspot/jtreg/compiler/loopopts/TestLoopLimitSubtractionsCollapse.java#L50-L68 Is this now ok? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1517798221 From rsunderbabu at openjdk.org Fri Apr 21 13:18:53 2023 From: rsunderbabu at openjdk.org (Ramkumar Sunderbabu) Date: Fri, 21 Apr 2023 13:18:53 GMT Subject: RFR: 8306636: Disable compiler/c2/Test6905845.java with -XX:TieredStopAtLevel=3 Message-ID: <3vWrB1NyJ0jObav66ZyBcnd41zjYERxNfGEOgGkJ9jw=.649c81d0-7d21-4c2a-a13a-f1c7ccf3d177@github.com> Disable the c2 test for TieredStopAtLevel=3 ------------- Commit messages: - 8306636: Disable compiler/c2/Test6905845.java with -XX:TieredStopAtLevel=3 Changes: https://git.openjdk.org/jdk/pull/13579/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13579&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306636 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13579.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13579/head:pull/13579 PR: https://git.openjdk.org/jdk/pull/13579 From dzhang at openjdk.org Fri Apr 21 13:29:50 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 21 Apr 2023 13:29:50 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v25] In-Reply-To: References: Message-ID: <2lw4KEg0sODdIaxQqrihbNU0W-bTM3YPy12jECrgGM0=.675c17f2-0c62-4c47-a2c2-1a5741e6c4ff@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Add some vector pseudo instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/4631181a..800205bb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=23-24 Stats: 28 lines in 4 files changed: 15 ins; 6 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From gcao at openjdk.org Fri Apr 21 13:29:54 2023 From: gcao at openjdk.org (Gui Cao) Date: Fri, 21 Apr 2023 13:29:54 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v24] In-Reply-To: <7KCd9xWDpRwOQCUZN7wfg_eJ8cIBOerEM3p7UJ3sV6k=.06133ed0-fd6d-428f-b730-5a24bee992f2@github.com> References: <7KCd9xWDpRwOQCUZN7wfg_eJ8cIBOerEM3p7UJ3sV6k=.06133ed0-fd6d-428f-b730-5a24bee992f2@github.com> Message-ID: On Fri, 21 Apr 2023 08:58:31 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> rename fp and modify VectorMaskGen node > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp line 234: > >> 232: >> 233: // Clear vector registers independent of previous vl and vtype. >> 234: void clear_register_v(VectorRegister v) { > > Do we still need `clear_register_v`? There seems to be nowhere to use it anymore. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173761273 From gcao at openjdk.org Fri Apr 21 13:30:02 2023 From: gcao at openjdk.org (Gui Cao) Date: Fri, 21 Apr 2023 13:30:02 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v23] In-Reply-To: References: <-tqGJTDwdjWYBNBNZxFAcosCq3_V_fWvGzwJc98uUZI=.013ff8de-380c-464d-8d09-405cfcac17d3@github.com> Message-ID: On Thu, 20 Apr 2023 14:20:43 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename vmaskcmp_DF > > src/hotspot/cpu/riscv/riscv_v.ad line 199: > >> 197: // vector mask float compare >> 198: >> 199: instruct vmaskcmp_FD(vRegMask dst, vReg src1, vReg src2, immI cond, vReg tmp1, vReg tmp2) %{ > > I would prefer renaming this into something like 'vmaskcmp_fp'. Also, would you mind renaming macro-assember functions minmax_FD, minmax_FD_v and reduce_minmax_FD_v into something like minmax_fp, minmax_fp_v, reduce_minmax_fp_v respectively? Thanks. Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 216: > >> 214: %} >> 215: >> 216: instruct vmaskcmp_FD_masked(vRegMask dst, vReg src1, vReg src2, immI cond, vRegMask_V0 vmask, vReg tmp1, vReg tmp2, vReg tmp3) %{ > > Similar here, we can rename this into something like 'vmaskcmp_fp_masked'. Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 413: > >> 411: %} >> 412: >> 413: instruct vaddF_masked(vReg dst_src1, vReg src2, vRegMask_V0 vmask) %{ > > Can we combine this one with the next 'vaddD_masked' into a new instruct and give it a new name like 'vadd_fp_masked'? It looks to me that we can do similar thing for other instructs like 'vmulF_masked/vmulD_masked', etc. Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 1491: > >> 1489: // vector shift >> 1490: >> 1491: instruct vasrBS(vReg dst, vReg src, vReg shift, vRegMask_V0 tmp) %{ > > I really don't like instruct names like 'vasrBS' and 'vasrIL' which looks kind of misleading. I would perfer keep those seperated like before. Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 2393: > >> 2391: format %{ "vmask_gen_I $dst, $src" %} >> 2392: ins_encode %{ >> 2393: __ clear_register_v(as_VectorRegister($dst$$reg)); > > Could you explain why we need to clear 'dst' here? And can we use vector mask-register logical instruction 'vmclr.m' instead if it is really needed? It was indeed redundant here and has been removed. > src/hotspot/cpu/riscv/riscv_v.ad line 2406: > >> 2404: __ clear_register_v(as_VectorRegister($dst$$reg)); >> 2405: __ vsetvli(t0, $src$$Register, Assembler::e8); >> 2406: __ vmxnor_mm(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg)); > > We could introduce assembler pseudoinstructions 'vmmv.m', 'vmclr.m', 'vmset.m' and 'vmnot.m' from the spec to make the vector mask computation more readable. > > vmmv.m vd, vs => vmand.mm vd, vs, vs # Copy mask register > vmclr.m vd => vmxor.mm vd, vd, vd # Clear mask register > vmset.m vd => vmxnor.mm vd, vd, vd # Set mask register > vmnot.m vd, vs => vmnand.mm vd, vs, vs # Invert bits Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173762016 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173762191 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173762356 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173762484 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173763532 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1173763835 From jsjolen at openjdk.org Fri Apr 21 13:39:56 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 21 Apr 2023 13:39:56 GMT Subject: RFR: 8306456: Don't leak _worklist's memory in PhaseLive::compute [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 13:02:23 GMT, Johan Sj?len wrote: >> `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. > > Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into dontleak-worklist > - Fix style and make worklist passed in as argument > - Don't leak the worklist in PhaseLive Passes tier1 and tier2 in Mach5. Thank you for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13535#issuecomment-1517844898 From jsjolen at openjdk.org Fri Apr 21 13:39:57 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 21 Apr 2023 13:39:57 GMT Subject: Integrated: 8306456: Don't leak _worklist's memory in PhaseLive::compute In-Reply-To: References: Message-ID: <_5MgaxQzLWinb62PptAdfLdf6pHrGvqyS4370nG-0OM=.1c967ee2-13a8-44ae-aac0-3948c21f76c7@github.com> On Wed, 19 Apr 2023 14:21:13 GMT, Johan Sj?len wrote: > `PhaseLive::compute` used to do this: `_worklist = new (_arena) Block_List();`. This allocates the `Block_List` to the `_arena`, but the backing array is allocated on the resource area: `Block_List() : Block_Array(Thread::current()->resource_area()), _cnt(0) {}`. This causes at most 5 worklists and at least 4 worklists to be created and not freed until the compilation is finished. This patch allocates the worklist within `PhaseLive::compute`:s local resource mark. This pull request has now been integrated. Changeset: 6e77e14f Author: Johan Sj?len URL: https://git.openjdk.org/jdk/commit/6e77e14fdbf4ab083020467cf2ecb8225f3dcbc7 Stats: 18 lines in 2 files changed: 2 ins; 3 del; 13 mod 8306456: Don't leak _worklist's memory in PhaseLive::compute Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.org/jdk/pull/13535 From vkempik at openjdk.org Fri Apr 21 13:50:57 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Fri, 21 Apr 2023 13:50:57 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if it's unsupported [v7] In-Reply-To: References: Message-ID: > Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. > > Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. Vladimir Kempik has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - Merge - Rewrite few more helpers - Add APH's suggestions and remove some whitespace - Rework the fix to use memcpy in codeBuffer.hpp - Reduce code duplication - Fix 32-bit archs - Fix typo - change long to ulong in type convertion - Fix includes - 8305056: Avoid unaligned access in emit_intX methods if not enabled ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13227/files - new: https://git.openjdk.org/jdk/pull/13227/files/81861821..b221242a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13227&range=05-06 Stats: 275367 lines in 2564 files changed: 238622 ins; 22050 del; 14695 mod Patch: https://git.openjdk.org/jdk/pull/13227.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13227/head:pull/13227 PR: https://git.openjdk.org/jdk/pull/13227 From roland at openjdk.org Fri Apr 21 14:51:45 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 21 Apr 2023 14:51:45 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin [v3] In-Reply-To: References: Message-ID: <__b8goDWSbUtEcqilI4LhOx5CbbD54SwKfGMZQuFxPg=.442ed3ad-55ee-42f9-9c33-a1254295c99d@github.com> On Thu, 20 Apr 2023 13:00:42 GMT, Johan Sj?len wrote: >> Hi, >> >> First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. >> >> Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. >> >> I haven't done any micro benchmarking, but this should be faster and take less space. >> >> This is currently passing tier1. > > Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into opt-chaitin > - Apply Kozlov's comments > - Use nr_blocks in assert > - Merge loops > - Optimize PhaseChaitin Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13533#pullrequestreview-1395921994 From cslucas at openjdk.org Fri Apr 21 15:10:49 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Fri, 21 Apr 2023 15:10:49 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v9] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> Message-ID: <8AmU_ta4meiUmO99Em5bV7XLAV4H9fAcil519yh70fU=.1a28f4a9-a992-43a7-8c4a-d1cf96835963@github.com> On Thu, 20 Apr 2023 00:35:19 GMT, Vladimir Kozlov wrote: > Again got failures in the test on Aarch64 running with -XX:-UseTLAB: > > ``` > testCmpMergeWithNull(boolean,int,int): > - Failed comparison: [found] 0 = 2 [given] > testCmpMergeWithNull_Second(boolean,int,int) > - Failed comparison: [found] 0 = 1 [given] > testMergedAccessAfterCallNoWrite(boolean,int,int) > - Failed comparison: [found] 2 = 3 [given] > testMergedAccessAfterCallWithWrite(boolean,int,int) > - Failed comparison: [found] 2 = 3 [given] > testNestedObjectsArray(boolean,int,int) > - Failed comparison: [found] 2 = 4 [given] > ``` @vnkozlov - The reason for these failures is due to an issue in the test framework ALLOC Regex: https://bugs.openjdk.org/browse/JDK-8306625 . Since only the tests added in this PR are failing due to that problem do you think I should create a separate PR to fix the Regex or just include the fix in this PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1517977996 From qamai at openjdk.org Fri Apr 21 18:23:47 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 21 Apr 2023 18:23:47 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 05:22:35 GMT, Eric Liu wrote: >> But that phi will have an incorrect input, because the return value of this call is used as an input of the transformed phi that uses this node? > >> But that phi will have an incorrect input, because the return value of this call is used as an input of the transformed phi that uses this node? > > the return value of this call is Phi1. Phi1 is used as an input of Phi2 which is used by Phi1 as well. > > The Phi cycle is not an incorrect shape, it's a normal case generated by some simple cases, e.g., I have a test case in this patch. > When expanding VectorBox node, the purpose is to traverse the first input of VectorBox to locate Proj, and replace Proj with some other nodes. The first input of VectorBox can be a graph, contains Phi (maybe Phi cycle) and Proj. > > The process finding and replacing Proj is not in local graph, it creates a new graph at the same time. Return this visited node here is used to maintain that cycle. Besides Proj, nodes in graph should not be changed. I do still not fully grasp it though. When `expand_vbox_helper` is called on `Phi1`, it creates `NewPhi1`, then it calls `expand_vbox_helper` on `Phi2`, a `NewPhi2` is created, then `expand_vbox_helper` is invoked again on `Phi1`, which it shortcircuits to return `Phi1`, which will be attached as an input of `NewPhi2`, at this step should the second invocation on `Phi1` returns `NewPhi1` instead? Thanks a lot. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1174052778 From kvn at openjdk.org Fri Apr 21 19:12:47 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Apr 2023 19:12:47 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v7] In-Reply-To: References: Message-ID: On Tue, 18 Apr 2023 10:13:01 GMT, Emanuel Peter wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Testing** >> >> I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. >> >> Tested up to `tier5` and stress testing. Performance testing **running...** >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Fixed some TOP cases Looks good. Just minor comments. src/hotspot/share/opto/addnode.cpp line 1285: > 1283: // "subtraction with underflow-protection" pattern. These are created during the > 1284: // unrolling, when we have to adjust the limit by subtracting the stride, but want > 1285: // to protect agains underflow: MaxL(SubL(limit, stride), min_jint). May add note that SubL node is replaced with AddL and reversed stride ( I assume that is what happened here). src/hotspot/share/opto/addnode.cpp line 1308: > 1306: // Max/MinL (n) > 1307: // > 1308: Node* fold_subI_no_underflow_pattern(Node* n, PhaseGVN* phase) { Move this method and it comment before `MaxLNode::add_ring` so all MaxL and MinL method stay together. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13269#pullrequestreview-1396281399 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1174086806 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1174088066 From kvn at openjdk.org Fri Apr 21 19:21:42 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Apr 2023 19:21:42 GMT Subject: RFR: 8306636: Disable compiler/c2/Test6905845.java with -XX:TieredStopAtLevel=3 In-Reply-To: <3vWrB1NyJ0jObav66ZyBcnd41zjYERxNfGEOgGkJ9jw=.649c81d0-7d21-4c2a-a13a-f1c7ccf3d177@github.com> References: <3vWrB1NyJ0jObav66ZyBcnd41zjYERxNfGEOgGkJ9jw=.649c81d0-7d21-4c2a-a13a-f1c7ccf3d177@github.com> Message-ID: <93EOGaLLhTNTFFD1wJ5k_ds1FlWL35IuZda3o2jot4w=.8e5eb160-6a27-462e-8809-d55d239d9ade@github.com> On Fri, 21 Apr 2023 13:11:41 GMT, Ramkumar Sunderbabu wrote: > Disable the c2 test for TieredStopAtLevel=3 Good and trivial. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13579#pullrequestreview-1396291968 From kvn at openjdk.org Fri Apr 21 19:26:48 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 21 Apr 2023 19:26:48 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v9] In-Reply-To: <8AmU_ta4meiUmO99Em5bV7XLAV4H9fAcil519yh70fU=.1a28f4a9-a992-43a7-8c4a-d1cf96835963@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <0oNkCfUBIR1hpPwN0i_ONwwyjd0AYux7GkLm-G1PdsU=.b3a5e7ff-e9bf-45b6-b996-691f86aa7057@github.com> <8AmU_ta4meiUmO99Em5bV7XLAV4H9fAcil519yh70fU=.1a28f4a9-a992-43a7-8c4a-d1cf96835963@github.com> Message-ID: On Fri, 21 Apr 2023 15:08:21 GMT, Cesar Soares Lucas wrote: > Since only the tests added in this PR are failing due to that problem do you think I should create a separate PR to fix the Regex or just include the fix in this PR? Create separate PR and fix it first. This PR still need review from @iwanowww and it may take time to address additional comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12897#issuecomment-1518247727 From dcubed at openjdk.org Fri Apr 21 21:54:39 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Fri, 21 Apr 2023 21:54:39 GMT Subject: RFR: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again Message-ID: Trivial fixes to increase timeouts for tests that timeout under heavy stress: [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java ------------- Commit messages: - 8302607: adjust timeout for compiler/jsr292/ContinuousCallSiteTargetChange.java - 8305502: adjust timeouts in three more M&M tests - 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again Changes: https://git.openjdk.org/jdk/pull/13593/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13593&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8301377 Stats: 9 lines in 5 files changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/13593.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13593/head:pull/13593 PR: https://git.openjdk.org/jdk/pull/13593 From xliu at openjdk.org Fri Apr 21 22:02:50 2023 From: xliu at openjdk.org (Xin Liu) Date: Fri, 21 Apr 2023 22:02:50 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v5] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Fri, 14 Apr 2023 20:50:03 GMT, Cesar Soares Lucas wrote: >> src/hotspot/share/opto/escape.cpp line 457: >> >>> 455: found_sr_allocate = true; >>> 456: } else { >>> 457: ptn->set_scalar_replaceable(false); >> >> This member function is const. Do we really need to change ptn's property here? >> >> My reading is ophi is profitable as long as we spot any input object which can be eliminated. how about you just return at line 455? > > This is actually necessary here. By setting the input to NSR I don't need to later, when performing reduction, check that I can eliminate the node. I can just check that I can scalar replace the input. If I removed this line I'd hit a problem if the merge had an input that is SR but that ME can't eliminate. okay. I see you mean. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1174188560 From naoto at openjdk.org Fri Apr 21 22:19:46 2023 From: naoto at openjdk.org (Naoto Sato) Date: Fri, 21 Apr 2023 22:19:46 GMT Subject: RFR: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again In-Reply-To: References: Message-ID: <1XddcAzXBaAyIg0SS6cRwCw_e4dZv_ydPfBOn7jAnQg=.52261e09-7238-468e-9319-a81d72954cce@github.com> On Fri, 21 Apr 2023 21:35:07 GMT, Daniel D. Daugherty wrote: > Trivial fixes to increase timeouts for tests that timeout under heavy stress: > [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again > [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests > [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java Marked as reviewed by naoto (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13593#pullrequestreview-1396444259 From lmesnik at openjdk.org Fri Apr 21 22:48:43 2023 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Fri, 21 Apr 2023 22:48:43 GMT Subject: RFR: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 21:35:07 GMT, Daniel D. Daugherty wrote: > Trivial fixes to increase timeouts for tests that timeout under heavy stress: > [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again > [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests > [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java Marked as reviewed by lmesnik (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13593#pullrequestreview-1396464773 From dcubed at openjdk.org Fri Apr 21 22:53:43 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Fri, 21 Apr 2023 22:53:43 GMT Subject: RFR: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 21:35:07 GMT, Daniel D. Daugherty wrote: > Trivial fixes to increase timeouts for tests that timeout under heavy stress: > [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again > [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests > [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java I forgot to include testing info: - 8301377 has been tested in jdk-20+34 and jdk-21+{9,1[013-9]} stress testing. - 8302607 has been reworked into increasing the timeout and has been tested in jdk-21+1[89] stress testing. - 8305502 has been tested in jdk-21+1[7-9] stress testing. The jdk-21+19 stress run will complete late on Sunday, 2023.04.23. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13593#issuecomment-1518402540 From dcubed at openjdk.org Fri Apr 21 22:53:44 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Fri, 21 Apr 2023 22:53:44 GMT Subject: RFR: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again In-Reply-To: <1XddcAzXBaAyIg0SS6cRwCw_e4dZv_ydPfBOn7jAnQg=.52261e09-7238-468e-9319-a81d72954cce@github.com> References: <1XddcAzXBaAyIg0SS6cRwCw_e4dZv_ydPfBOn7jAnQg=.52261e09-7238-468e-9319-a81d72954cce@github.com> Message-ID: On Fri, 21 Apr 2023 22:16:32 GMT, Naoto Sato wrote: >> Trivial fixes to increase timeouts for tests that timeout under heavy stress: >> [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again >> [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests >> [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java > > Marked as reviewed by naoto (Reviewer). @naotoj and @lmesnik - Thanks for the reviews! This PR will likely integrate on Monday (2023.04.24). ------------- PR Comment: https://git.openjdk.org/jdk/pull/13593#issuecomment-1518403740 From vlivanov at openjdk.org Sat Apr 22 02:00:56 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 22 Apr 2023 02:00:56 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 20 Apr 2023 19:27:58 GMT, Cesar Soares Lucas wrote: >> Can I please get reviews for this PR? >> >> The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. >> >> With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) >> >> What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: >> >> ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) >> >> This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. >> >> The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. >> >> The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. >> >> I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. > > Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Catching up with master > > Merge remote-tracking branch 'origin/master' into rematerialization-of-merges > - Fix tests. Remember previous reducible Phis. > - Address PR review 3. Some comments and be able to abort compilation. > - Merge with Master > - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. > - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. > - Add support for SR'ing some inputs of merges used for field loads > - Fix some typos and do some small refactorings. > - Merge master > - Add support for rematerializing scalar replaced objects participating in allocation merges Nice work, Cesar! I like how the patch shapes now. I'm not done with the review yet, but decided to share the comments I have so far. src/hotspot/share/code/debugInfo.cpp line 232: > 230: // If we call select again on the same merge we should return the same result > 231: if (_selected != nullptr) { > 232: return _selected; I'm not sure I understand how it is intended to work. The code below initializes `_selected`, but returns `nullptr` when `selector >= 0`. Subsequent calls will return non-null value. src/hotspot/share/code/debugInfo.cpp line 257: > 255: } else { > 256: assert(selector < _possible_objects.length(), "sanity"); > 257: _selected = (ObjectValue*) _possible_objects.at(selector); Any particular reason to reuse `ObjectValue` from `_possible_objects` instead of allocating a fresh one (as you do on `selector == -1` bracnh)? I'd prefer `ObjectMergeValue::select()` to always allocate a fresh `ObjectValue` when converting `ObjectMergeValue` + `ObjectMergeCandidateValue` into `ObjectValue`. src/hotspot/share/code/debugInfo.hpp line 199: > 197: // ObjectValue describing an object that was scalar replaced. > 198: > 199: class ObjectMergeValue: public ObjectValue { I find the decision to subclass`ObjectValue` confusing and error prone: now `is_object()` returns true for `ObjectMergeValue`, but you have to apply the selector first to turn it into `ObjectValue`. And now the order of checks matter, so you always have to perform `is_object_merge()` first and then follow it with `is_object()` guard. You have 3 flavors of `ObjectValue` now: * good old `ObjectValue`; * `ObjectMergeValue` * merge candidates (`ObjectMergeCandidateValue`?) Does it make sense to introduce 3 different subclasses under `ObjectValue` to clearly distinguish the scenarios? src/hotspot/share/code/debugInfo.hpp line 242: > 240: bool is_cached() const { return _cached; } > 241: void set_cached(bool cached) { _cached = cached; } > 242: AutoBoxObjectValue(int id, ScopeValue* klass, bool only_merge_candidate = false) : ObjectValue(id, klass, only_merge_candidate), _cached(false) { } Any particular reason to allow `AutoBoxObjectValue` to be a merge candidate? src/hotspot/share/opto/escape.hpp line 593: > 591: // Methods related to Reduce Allocation Merges > 592: > 593: bool can_reduce_this_phi(PhiNode* ophi) const; On naming: IMO referring to "this" doesn't help, but adds noise. If you drop it ("can_reduce_this_phi" => "can_reduce_phi"), it's still clear what the method does. src/java.base/share/classes/java/security/AccessController.java line 786: > 784: // allocation merge Phi leading to it) might become NonEscaping and get > 785: // scalar replaced. The call below enforces 'result' to always escape. > 786: ensureMaterializedForStackWalk(result); Why don't you add the same call in the other `executePrivileged` overload? It has the very same code shape. ------------- PR Review: https://git.openjdk.org/jdk/pull/12897#pullrequestreview-1396497913 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1174242946 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1174249820 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1174248472 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1174250881 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1174248735 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1174235850 From eliu at openjdk.org Sat Apr 22 09:41:45 2023 From: eliu at openjdk.org (Eric Liu) Date: Sat, 22 Apr 2023 09:41:45 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 18:20:27 GMT, Quan Anh Mai wrote: >>> But that phi will have an incorrect input, because the return value of this call is used as an input of the transformed phi that uses this node? >> >> the return value of this call is Phi1. Phi1 is used as an input of Phi2 which is used by Phi1 as well. >> >> The Phi cycle is not an incorrect shape, it's a normal case generated by some simple cases, e.g., I have a test case in this patch. >> When expanding VectorBox node, the purpose is to traverse the first input of VectorBox to locate Proj, and replace Proj with some other nodes. The first input of VectorBox can be a graph, contains Phi (maybe Phi cycle) and Proj. >> >> The process finding and replacing Proj is not in local graph, it creates a new graph at the same time. Return this visited node here is used to maintain that cycle. Besides Proj, nodes in graph should not be changed. > > I do still not fully grasp it though. When `expand_vbox_helper` is called on `Phi1`, it creates `NewPhi1`, then it calls `expand_vbox_helper` on `Phi2`, a `NewPhi2` is created, then `expand_vbox_helper` is invoked again on `Phi1`, which it shortcircuits to return `Phi1`, which will be attached as an input of `NewPhi2`, at this step should the second invocation on `Phi1` returns `NewPhi1` instead? > > Thanks a lot. Yes, I think it should return the `NewPhi1` instead. In my test case, the `NewPhi1` and `NewPhi2` are idealized to `Phi1` and `Phi2`, so it does not matter whether it returns the new one. But I'm not sure if it's certain to be idealize to the old one. Anyway, return the new is more reasonable. I will fix that and do test. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1174351867 From aph at openjdk.org Sat Apr 22 10:21:49 2023 From: aph at openjdk.org (Andrew Haley) Date: Sat, 22 Apr 2023 10:21:49 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if it's unsupported [v7] In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 13:50:57 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: > > - Merge > - Rewrite few more helpers > - Add APH's suggestions and remove some whitespace > - Rework the fix to use memcpy in codeBuffer.hpp > - Reduce code duplication > - Fix 32-bit archs > - Fix typo > - change long to ulong in type convertion > - Fix includes > - 8305056: Avoid unaligned access in emit_intX methods if not enabled This looks good. I'm sure GCC and LLVM will be fine, but it might be worth checking that Windows doesn't generate awful code for `put_native()`.If you don't have a Microsoft box, ping George.Adams at microsoft.com and Bruno Borges ? ------------- PR Review: https://git.openjdk.org/jdk/pull/13227#pullrequestreview-1396664977 From vkempik at openjdk.org Sat Apr 22 15:56:48 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Sat, 22 Apr 2023 15:56:48 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if it's unsupported [v7] In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 13:50:57 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: > > - Merge > - Rewrite few more helpers > - Add APH's suggestions and remove some whitespace > - Rework the fix to use memcpy in codeBuffer.hpp > - Reduce code duplication > - Fix 32-bit archs > - Fix typo > - change long to ulong in type convertion > - Fix includes > - 8305056: Avoid unaligned access in emit_intX methods if not enabled yet to check win64 jvm.dll. so far godbolt shows me this for msvc 19 -O2 void put_native(void *,unsigned short) PROC ; put_native, COMDAT mov WORD PTR [rcx], dx ret 0 void put_native(void *,unsigned short) ENDP ; put_native ------------- PR Comment: https://git.openjdk.org/jdk/pull/13227#issuecomment-1518691168 From duke at openjdk.org Sun Apr 23 04:20:53 2023 From: duke at openjdk.org (Chang Peng) Date: Sun, 23 Apr 2023 04:20:53 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 02:25:45 GMT, Chang Peng wrote: >> We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. >> >> The following instruction sequence >> >> >> movi v17.16b, #12 >> cmpgt p0.b, p7/z, z16.b, z17.b >> >> >> can be optimized to: >> >> >> cmpgt p0.b, p7/z, z16.b, #12 >> >> >> This patch does the following: >> 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. >> SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to >> 127)or 5bit signed integer immediate (range from -16 to 15). >> >> 2. Add optimized match rules to generate the compare-with-immediate instructions. >> >> [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- > > Chang Peng has updated the pull request incrementally with one additional commit since the last revision: > > Refactor some code src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3656: > 3654: ins_pipe(pipe_slow); > 3655: %}')dnl > 3656: VMASKCMP_SVE_IMM_I(immI5, cmp) For the first marco, imm is ConINode. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3659: > 3657: VMASKCMP_SVE_IMM_L(immL5, cmp) > 3658: VMASKCMP_SVE_IMM_I(immIU7, cmpU) > 3659: VMASKCMP_SVE_IMM_L(immLU7, cmpU) And ConLNode for the second marco. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1174505089 PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1174505178 From duke at openjdk.org Sun Apr 23 04:23:55 2023 From: duke at openjdk.org (Chang Peng) Date: Sun, 23 Apr 2023 04:23:55 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 02:25:45 GMT, Chang Peng wrote: >> We can use SVE compare-with-integer-immediate instructions like cmpgt(immediate)[1] to avoid the extra scalar2vector operations. >> >> The following instruction sequence >> >> >> movi v17.16b, #12 >> cmpgt p0.b, p7/z, z16.b, z17.b >> >> >> can be optimized to: >> >> >> cmpgt p0.b, p7/z, z16.b, #12 >> >> >> This patch does the following: >> 1. Add SVE compare-with-7bit-unsigned-immediate instructions to C2's backend. >> SVE cmp(immediate) instructions can support vector comparing with 7bit unsigned integer immediate (range from 0 to >> 127)or 5bit signed integer immediate (range from -16 to 15). >> >> 2. Add optimized match rules to generate the compare-with-immediate instructions. >> >> [1]: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/CMP-cc---immediate---Compare-vector-to-immediate- > > Chang Peng has updated the pull request incrementally with one additional commit since the last revision: > > Refactor some code src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3625: > 3623: instruct vmask$2_immI_sve(pReg dst, vReg src, $1 imm, immI_$2_cond cond, rFlagsReg cr) %{ > 3624: predicate(UseSVE > 0); > 3625: match(Set dst (VectorMaskCmp (Binary src (ReplicateB imm)) cond)); @theRealAph The ReplicateXNodes used in match rules are also different in these two marcos. I think we needn't to merge these two marcos since this will introduce some if-else statements which will reduce the readability. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1174505387 From duke at openjdk.org Sun Apr 23 05:14:56 2023 From: duke at openjdk.org (duke) Date: Sun, 23 Apr 2023 05:14:56 GMT Subject: Withdrawn: 8302267: [jittester] Improve separation of test generation and execution In-Reply-To: References: Message-ID: On Mon, 13 Feb 2023 09:55:52 GMT, Evgeny Nikitin wrote: > Please review a set of improvements that should improve working with other fuzzing generators and usage of JitTesterDriver with tests generated not by the JITTester: > > - Provide better separation of individual test generation from java file writing, compiling, executing, etc.; > - Introduce distinct Phases of the generation process (Generation, Compilation, GoldRun and VerificationRun); > - Extract JItTesterDriver headers generation so that it would be possible to provide other header generators; > - Introduce error tolerance to not get distracted by OOMEs, intrinsics missing in the compiled code, etc.; > - Make it possible to specify time limit for an individual test generation; > - Give better control over temp/workdir creation and cleaning; > - Unify external process running; > - Introduce UTF-8 support in external processes' output and human-readable escaping of it; This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/12527 From qamai at openjdk.org Sun Apr 23 18:29:36 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 23 Apr 2023 18:29:36 GMT Subject: RFR: 8306706: Support out-of-line code generation for MachNodes Message-ID: <3haQdXHxlUHKAqi4MNWaVz3gVcFB9M8A20tGPQIok3c=.940d6d13-9764-449a-a9e1-36247f08b68e@github.com> Hi, This patch adds supports for MachNodes to emit an out-of-line piece of code in the stub section of the compiled method. This allows the separation of the uncommon path from the common one, which speeds up the common path a little bit and increases compiled code density. Please take a look and leave reviews. Thanks a lot. ------------- Commit messages: - move opnd_array to private - copyright - prototype Changes: https://git.openjdk.org/jdk/pull/13602/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13602&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306706 Stats: 232 lines in 10 files changed: 175 ins; 16 del; 41 mod Patch: https://git.openjdk.org/jdk/pull/13602.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13602/head:pull/13602 PR: https://git.openjdk.org/jdk/pull/13602 From qamai at openjdk.org Sun Apr 23 18:49:42 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 23 Apr 2023 18:49:42 GMT Subject: RFR: 8306706: Support out-of-line code generation for MachNodes In-Reply-To: <3haQdXHxlUHKAqi4MNWaVz3gVcFB9M8A20tGPQIok3c=.940d6d13-9764-449a-a9e1-36247f08b68e@github.com> References: <3haQdXHxlUHKAqi4MNWaVz3gVcFB9M8A20tGPQIok3c=.940d6d13-9764-449a-a9e1-36247f08b68e@github.com> Message-ID: On Sun, 23 Apr 2023 18:22:35 GMT, Quan Anh Mai wrote: > Hi, > > This patch adds supports for MachNodes to emit an out-of-line piece of code in the stub section of the compiled method. This allows the separation of the uncommon path from the common one, which speeds up the common path a little bit and increases compiled code density. Please take a look and leave reviews. > > Thanks a lot. With this patch, the compiled code for a float-to-int conversion is changed: Before: vcvttss2si %xmm1,%eax cmp $0x80000000,%eax jne DONE sub $0x8,%rsp vmovss %xmm1,(%rsp) call Stub::f2i_fixup ; {runtime_call StubRoutines (initial stubs)} pop %rax DONE: After: vcvttss2si %xmm1,%eax cmp $0x80000000,%eax je STUB CONTINUE: STUB: sub $0x8,%rsp vmovss %xmm1,(%rsp) call Stub::f2i_fixup ; {runtime_call StubRoutines (initial stubs)} pop %rax jmp CONTINUE And there are slight improvements shown in microbenchmarks, although the result differs run-to-run, the patched version seems to be generally more performant: Before After Benchmark Mode Cnt Score Error Score Error Units Change ConvertF2I.d2iArray avgt 5 266.890 ? 3.277 260.720 ? 1.382 ns/op -2.31% ConvertF2I.d2iSingle avgt 5 0.378 ? 0.005 0.317 ? 0.013 ns/op -16.14% ConvertF2I.d2lArray avgt 5 273.999 ? 12.571 267.862 ? 4.806 ns/op -2.24% ConvertF2I.d2lSingle avgt 5 0.379 ? 0.005 0.348 ? 0.044 ns/op -8.18% ConvertF2I.f2iArray avgt 5 261.549 ? 1.391 255.522 ? 15.133 ns/op -2.30% ConvertF2I.f2iSingle avgt 5 0.378 ? 0.005 0.311 ? 0.007 ns/op -17.72% ConvertF2I.f2lArray avgt 5 272.745 ? 1.661 267.770 ? 7.033 ns/op -1.82% ConvertF2I.f2lSingle avgt 5 0.379 ? 0.007 0.350 ? 0.022 ns/op -7.65% ------------- PR Comment: https://git.openjdk.org/jdk/pull/13602#issuecomment-1519130423 From qamai at openjdk.org Sun Apr 23 18:56:42 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 23 Apr 2023 18:56:42 GMT Subject: RFR: 8306706: Support out-of-line code generation for MachNodes In-Reply-To: <3haQdXHxlUHKAqi4MNWaVz3gVcFB9M8A20tGPQIok3c=.940d6d13-9764-449a-a9e1-36247f08b68e@github.com> References: <3haQdXHxlUHKAqi4MNWaVz3gVcFB9M8A20tGPQIok3c=.940d6d13-9764-449a-a9e1-36247f08b68e@github.com> Message-ID: On Sun, 23 Apr 2023 18:22:35 GMT, Quan Anh Mai wrote: > Hi, > > This patch adds supports for MachNodes to emit an out-of-line piece of code in the stub section of the compiled method. This allows the separation of the uncommon path from the common one, which speeds up the common path a little bit and increases compiled code density. Please take a look and leave reviews. > > Thanks a lot. The generated node for the stub looks like this: class convF2I_reg_regStub : public C2CodeStub { private: const convF2I_reg_regNode* _node; PhaseRegAlloc* ra_; MachOper* opnd_array(uint index) const { return _node->opnd_array(index); } public: convF2I_reg_regStub(const convF2I_reg_regNode* node, PhaseRegAlloc* ra) : _node(node), ra_(ra) {} int max_size() const { return 23; } void emit(C2_MacroAssembler& masm); }; And the corresponding node's `emit` method has an additional section: void convF2I_reg_regNode::emit(CodeBuffer& cbuf, PhaseRegAlloc* ra_) const { cbuf.set_insts_mark(); convF2I_reg_regStub* stub = new (Compile::current()->comp_arena()) convF2I_reg_regStub(this, ra_); if (!Compile::current()->output()->in_scratch_emit_size()) { Compile::current()->output()->add_stub(stub); } ... } ------------- PR Comment: https://git.openjdk.org/jdk/pull/13602#issuecomment-1519131633 From qamai at openjdk.org Sun Apr 23 19:58:50 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 23 Apr 2023 19:58:50 GMT Subject: RFR: 8304676: [vectorapi] x86_32: Crash in Assembler::kmovql(Address, KRegister) Message-ID: Hi, Can I have reviews for this patch which fixes crashes during PrintOptoAssembly of KRegister spilling code. The reason is that we miss the check for nullptr cbuf. Thanks a lot. ------------- Commit messages: - mistakes - Merge branch 'master' into fixformatcrash - fix print assembly Changes: https://git.openjdk.org/jdk/pull/13603/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13603&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304676 Stats: 26 lines in 1 file changed: 20 ins; 2 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/13603.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13603/head:pull/13603 PR: https://git.openjdk.org/jdk/pull/13603 From rsunderbabu at openjdk.org Mon Apr 24 02:22:04 2023 From: rsunderbabu at openjdk.org (Ramkumar Sunderbabu) Date: Mon, 24 Apr 2023 02:22:04 GMT Subject: Integrated: 8306636: Disable compiler/c2/Test6905845.java with -XX:TieredStopAtLevel=3 In-Reply-To: <3vWrB1NyJ0jObav66ZyBcnd41zjYERxNfGEOgGkJ9jw=.649c81d0-7d21-4c2a-a13a-f1c7ccf3d177@github.com> References: <3vWrB1NyJ0jObav66ZyBcnd41zjYERxNfGEOgGkJ9jw=.649c81d0-7d21-4c2a-a13a-f1c7ccf3d177@github.com> Message-ID: On Fri, 21 Apr 2023 13:11:41 GMT, Ramkumar Sunderbabu wrote: > Disable the c2 test for TieredStopAtLevel=3 This pull request has now been integrated. Changeset: 49005174 Author: Ramkumar Sunderbabu Committer: Fairoz Matte URL: https://git.openjdk.org/jdk/commit/4900517479f12b59cd8f1c31ad94ad7487c522f7 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8306636: Disable compiler/c2/Test6905845.java with -XX:TieredStopAtLevel=3 Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/13579 From thartmann at openjdk.org Mon Apr 24 05:26:53 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Apr 2023 05:26:53 GMT Subject: RFR: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 21:35:07 GMT, Daniel D. Daugherty wrote: > Trivial fixes to increase timeouts for tests that timeout under heavy stress: > [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again > [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests > [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13593#pullrequestreview-1397215121 From thartmann at openjdk.org Mon Apr 24 05:33:44 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Apr 2023 05:33:44 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 02:44:00 GMT, Dean Long wrote: > This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. Looks good to me. src/hotspot/share/opto/parse2.cpp line 1200: > 1198: // (check for saturation, integer overflow, and immature counts) > 1199: static bool counters_are_meaningful(int counter1, int counter2, int min) { > 1200: // check for saturation, inluding "uint" values too big to fit it "int" Suggestion: // check for saturation, including "uint" values too big to fit in "int" src/hotspot/share/opto/parse2.cpp line 1211: > 1209: } > 1210: // check if mature > 1211: return counter1 + counter2 >= min; Suggestion: return (counter1 + counter2) >= min; ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13551#pullrequestreview-1397218355 PR Review Comment: https://git.openjdk.org/jdk/pull/13551#discussion_r1174789335 PR Review Comment: https://git.openjdk.org/jdk/pull/13551#discussion_r1174792091 From epeter at openjdk.org Mon Apr 24 06:13:49 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Apr 2023 06:13:49 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v8] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: For Vladimir: add comment and move code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/33e1ad54..ec052b17 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=06-07 Stats: 44 lines in 1 file changed: 23 ins; 21 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From fyang at openjdk.org Mon Apr 24 06:53:58 2023 From: fyang at openjdk.org (Fei Yang) Date: Mon, 24 Apr 2023 06:53:58 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v25] In-Reply-To: <2lw4KEg0sODdIaxQqrihbNU0W-bTM3YPy12jECrgGM0=.675c17f2-0c62-4c47-a2c2-1a5741e6c4ff@github.com> References: <2lw4KEg0sODdIaxQqrihbNU0W-bTM3YPy12jECrgGM0=.675c17f2-0c62-4c47-a2c2-1a5741e6c4ff@github.com> Message-ID: On Fri, 21 Apr 2023 13:29:50 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Add some vector pseudo instructions Changes requested by fyang (Reviewer). src/hotspot/cpu/riscv/riscv_v.ad line 179: > 177: %} > 178: > 179: instruct vmaskcmp_masked(vRegMask dst, vReg src1, vReg src2, immI cond, vRegMask_V0 vmask, vReg tmp) %{ I think we can introduce another new operand type (say 'vRegMaskNoV0') which excludes mask register 'v0' for 'dst' here and other places where 'v0' could not be used as the destination register for a masked vector instruction as required by the RVV spec. Then we could eliminate the use of 'tmp' register and 'vmv1r.v' instruction. Also, I would like to further rename 'vRegMask_V0 vmask' into 'vRegMask_V0 v0'. The RVV spec says that the mask value used to control execution of a masked vector instruction is always supplied by vector register 'v0' for now. ------------- PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1397254398 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1174827238 From rcastanedalo at openjdk.org Mon Apr 24 07:17:49 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Apr 2023 07:17:49 GMT Subject: RFR: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors Message-ID: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> The `mov + inc/dec -> lea` subset of the peephole rules introduced by [JDK-8283699](https://bugs.openjdk.org/browse/JDK-8283699) has been found to cause minor regressions for some common benchmarks on Intel microarchitectures earlier than Cascade Lake. This changeset limits their application to Intel Cascade Lake and microarchitectures with full ALU support for lea (`VM_Version::supports_fast_3op_lea()`), where these peephole rules have been confirmed to be beneficial. The adjustment speeds up SPECjvm2008's MonteCarlo benchmark by between 0.1% and 2.7% on pre-Cascade Lake microarchitectures (Haswell-DT, Coffee Lake-B) across different garbage collectors (G1, ZGC). It additionally yields a speedup of 2.1% on SPECjvm2008's Derby benchmark when using G1 on Coffee Lake-B. Thanks to @ericcaspole for discussions and helping out with benchmarking. #### Testing ##### Functionality - tier1-5 (windows-x64, linux-x64, macosx-x64; release and debug mode). - Checked that the expected combination of peephole rules is enabled for all microarchitectures supported by Intel's Software Development Emulator 9.0. ##### Performance - Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008), different Intel microarchitectures (Haswell-DT, Coffee Lake-B, Cascade Lake, Ice Lake-SP) and operating systems (linux-x64, windows-x64, and macosx-x64). No significant change was observed besides the improvements mentioned above. ------------- Commit messages: - Run x86 mov + inc/dec -> lea peephole rules only on Cascade Lake or later Changes: https://git.openjdk.org/jdk/pull/13605/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13605&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298189 Stats: 23 lines in 3 files changed: 17 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13605.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13605/head:pull/13605 PR: https://git.openjdk.org/jdk/pull/13605 From epeter at openjdk.org Mon Apr 24 07:33:51 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Apr 2023 07:33:51 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v9] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review (Tobias) Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/ec052b17..4a13b00b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=07-08 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From thartmann at openjdk.org Mon Apr 24 07:33:55 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Apr 2023 07:33:55 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v8] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 06:13:49 GMT, Emanuel Peter wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Testing** >> >> I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. >> >> Tested up to `tier5` and stress testing. Performance testing **running...** >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > For Vladimir: add comment and move code Still looks good to me. src/hotspot/share/opto/addnode.cpp line 1261: > 1259: } > 1260: > 1261: // Collapse the "addition with overflow-protection" pattern, and the symetrical Suggestion: // Collapse the "addition with overflow-protection" pattern, and the symmetrical src/hotspot/share/opto/addnode.cpp line 1264: > 1262: // "subtraction with underflow-protection" pattern. These are created during the > 1263: // unrolling, when we have to adjust the limit by subtracting the stride, but want > 1264: // to protect agains underflow: MaxL(SubL(limit, stride), min_jint). Suggestion: // to protect against underflow: MaxL(SubL(limit, stride), min_jint). ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13269#pullrequestreview-1397288749 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1174849896 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1174852398 From dzhang at openjdk.org Mon Apr 24 07:56:54 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 07:56:54 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v26] In-Reply-To: References: Message-ID: <9Oht3blG9TjyM_vsrvAz0BEEyXXH2FbjPY1BWe3xrOo=.586ffa60-c89c-4f15-9778-1a734537b312@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: - Add some comments and modify some operand order - Merge remote-tracking branch 'upstream/master' into JDK-8302908 - Optimize vmaskall and modify some format Add vRegMaskNoV0 - Add some vector pseudo instructions - rename fp and modify VectorMaskGen node - Rename vmaskcmp_DF - Fix build fail after JDK-8305008 - Merge remote-tracking branch 'upstream/master' into JDK-8302908-merge - Fix trailing whitespace - Handle unordered compares - ... and 9 more: https://git.openjdk.org/jdk/compare/e169176f...0bb95839 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/800205bb..0bb95839 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=25 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=24-25 Stats: 14838 lines in 376 files changed: 9848 ins; 3067 del; 1923 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Mon Apr 24 08:04:56 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 08:04:56 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v25] In-Reply-To: References: <2lw4KEg0sODdIaxQqrihbNU0W-bTM3YPy12jECrgGM0=.675c17f2-0c62-4c47-a2c2-1a5741e6c4ff@github.com> Message-ID: On Mon, 24 Apr 2023 06:12:38 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Add some vector pseudo instructions > > src/hotspot/cpu/riscv/riscv_v.ad line 179: > >> 177: %} >> 178: >> 179: instruct vmaskcmp_masked(vRegMask dst, vReg src1, vReg src2, immI cond, vRegMask_V0 vmask, vReg tmp) %{ > > I think we can introduce another new operand type (say 'vRegMaskNoV0') which excludes mask register 'v0' for 'dst' here and other places where 'v0' could not be used as the destination register for a masked vector instruction as required by the RVV spec. Then we could eliminate the use of 'tmp' register and 'vmv1r.v' instruction. > > Also, I would like to further rename 'vRegMask_V0 vmask' into 'vRegMask_V0 v0'. The RVV spec says that the mask value used to control execution of a masked vector instruction is always supplied by vector register 'v0' for now. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1174922689 From shade at openjdk.org Mon Apr 24 08:43:43 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 24 Apr 2023 08:43:43 GMT Subject: RFR: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors In-Reply-To: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> References: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> Message-ID: On Mon, 24 Apr 2023 06:05:21 GMT, Roberto Casta?eda Lozano wrote: > The `mov + inc/dec -> lea` subset of the peephole rules introduced by [JDK-8283699](https://bugs.openjdk.org/browse/JDK-8283699) has been found to cause minor regressions for some common benchmarks on Intel microarchitectures earlier than Cascade Lake. This changeset limits their application to Intel Cascade Lake and microarchitectures with full ALU support for lea (`VM_Version::supports_fast_3op_lea()`), where these peephole rules have been confirmed to be beneficial. The adjustment speeds up SPECjvm2008's MonteCarlo benchmark by between 0.1% and 2.7% on pre-Cascade Lake microarchitectures (Haswell-DT, Coffee Lake-B) across different garbage collectors (G1, ZGC). It additionally yields a speedup of 2.1% on SPECjvm2008's Derby benchmark when using G1 on Coffee Lake-B. > > Thanks to @ericcaspole for discussions and helping out with benchmarking. > > #### Testing > > ##### Functionality > > - tier1-5 (windows-x64, linux-x64, macosx-x64; release and debug mode). > - Checked that the expected combination of peephole rules is enabled for all microarchitectures supported by Intel's Software Development Emulator 9.0. > > ##### Performance > > - Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008), different Intel microarchitectures (Haswell-DT, Coffee Lake-B, Cascade Lake, Ice Lake-SP) and operating systems (linux-x64, windows-x64, and macosx-x64). No significant change was observed besides the improvements mentioned above. This looks okay to me. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13605#pullrequestreview-1397464015 From shade at openjdk.org Mon Apr 24 08:46:44 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 24 Apr 2023 08:46:44 GMT Subject: RFR: 8304676: [vectorapi] x86_32: Crash in Assembler::kmovql(Address, KRegister) In-Reply-To: References: Message-ID: <80deH338vckLLkyfUZGdnKk4XgGRXPgTUm5DmEA_GK0=.50ad95eb-b1bb-475c-90d5-be07ddd6edd6@github.com> On Sun, 23 Apr 2023 19:51:58 GMT, Quan Anh Mai wrote: > Hi, > > Can I have reviews for this patch which fixes crashes during PrintOptoAssembly of KRegister spilling code. The reason is that we miss the check for nullptr cbuf. > > Thanks a lot. This looks fine, and it matches what the surrounding code does when `cbuf` is `nullptr`. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13603#pullrequestreview-1397469708 From dzhang at openjdk.org Mon Apr 24 08:46:56 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 08:46:56 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v27] In-Reply-To: References: Message-ID: <1fp9Phfz1VAL-BRiS0ANWSrfJptV8M8bV6SPM2N6Ivk=.709d55f4-2df9-4fdb-bef9-e3f72790d0a4@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Remove vRegMaskNoV0 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/0bb95839..d338904d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=26 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=25-26 Stats: 23 lines in 2 files changed: 1 ins; 16 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From rcastanedalo at openjdk.org Mon Apr 24 08:53:53 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Apr 2023 08:53:53 GMT Subject: RFR: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors In-Reply-To: References: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> Message-ID: On Mon, 24 Apr 2023 08:40:33 GMT, Aleksey Shipilev wrote: > This looks okay to me. Thanks for reviewing, Aleksey! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13605#issuecomment-1519649285 From dzhang at openjdk.org Mon Apr 24 09:03:12 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 09:03:12 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v28] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Add vRegMaskNoV0 and modify vmask load/store avl ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/d338904d..3f0a3bc4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=27 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=26-27 Stats: 20 lines in 2 files changed: 16 ins; 1 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From chagedorn at openjdk.org Mon Apr 24 09:08:59 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 24 Apr 2023 09:08:59 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v9] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 07:33:51 GMT, Emanuel Peter wrote: >> **Context** >> >> During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. >> >> We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. >> >> **Problem** >> >> The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. >> >> Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. >> Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. >> >> I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. >> >> **Solution** >> >> `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. >> >> I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). >> >> **Discussion** >> >> This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). >> >> There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. >> >> **Caveat** >> >> I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. >> >> I hope that this fix here at least reduces the frequency of failures significantly. >> >> **Testing** >> >> I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. >> >> Tested up to `tier5` and stress testing. Performance testing **running...** >> >> **Future Work** >> >> We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. >> >> Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review (Tobias) > > Co-authored-by: Tobias Hartmann Updates look good! src/hotspot/share/opto/addnode.hpp line 356: > 354: int min_opcode() const { return Op_MinL; } > 355: virtual Node* Identity(PhaseGVN* phase); > 356: virtual Node* Ideal(PhaseGVN *phase, bool can_reshape); Suggestion: virtual Node* Ideal(PhaseGVN* phase, bool can_reshape); ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13269#pullrequestreview-1397508282 PR Review Comment: https://git.openjdk.org/jdk/pull/13269#discussion_r1174994089 From dzhang at openjdk.org Mon Apr 24 09:12:10 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 09:12:10 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v29] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8302908 - Add vRegMaskNoV0 and modify vmask load/store avl - Remove vRegMaskNoV0 - Add some comments and modify some operand order - Merge remote-tracking branch 'upstream/master' into JDK-8302908 - Optimize vmaskall and modify some format Add vRegMaskNoV0 - Add some vector pseudo instructions - rename fp and modify VectorMaskGen node - Rename vmaskcmp_DF - Fix build fail after JDK-8305008 - ... and 12 more: https://git.openjdk.org/jdk/compare/fcfc8118...89b9c1fa ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/3f0a3bc4..89b9c1fa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=28 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=27-28 Stats: 502 lines in 16 files changed: 479 ins; 4 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From rcastanedalo at openjdk.org Mon Apr 24 09:12:53 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Apr 2023 09:12:53 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v3] In-Reply-To: References: Message-ID: On Fri, 14 Apr 2023 12:47:39 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Alternative approaches >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: >> >> | approach | correctness | efficiency | effectiveness | conceptual complexity | >> | -------- | ----------- | ---------- | ------------- | --------------------- | >> | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | >> | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | >> | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | >> >> Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 31 commits: > > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - ... and 21 more: https://git.openjdk.org/jdk/compare/a3137c75...d9fc7b22 Hi @jatin-bhateja, I have written a qualitative comparison between this PR and the generic search approach proposed by @eme64 and you (see **Alternative approaches** section in the updated PR description). I hope the comparison clarifies and motivates the plan outlined in https://github.com/openjdk/jdk/pull/13120#discussion_r1166778814. Please let me know whether you agree with that plan so that we can move forward with this RFE, [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673), and also [JDK-8302652](https://bugs.openjdk.org/browse/JDK-8302652). ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1519682518 From thartmann at openjdk.org Mon Apr 24 09:19:45 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Apr 2023 09:19:45 GMT Subject: RFR: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors In-Reply-To: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> References: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> Message-ID: On Mon, 24 Apr 2023 06:05:21 GMT, Roberto Casta?eda Lozano wrote: > The `mov + inc/dec -> lea` subset of the peephole rules introduced by [JDK-8283699](https://bugs.openjdk.org/browse/JDK-8283699) has been found to cause minor regressions for some common benchmarks on Intel microarchitectures earlier than Cascade Lake. This changeset limits their application to Intel Cascade Lake and microarchitectures with full ALU support for lea (`VM_Version::supports_fast_3op_lea()`), where these peephole rules have been confirmed to be beneficial. The adjustment speeds up SPECjvm2008's MonteCarlo benchmark by between 0.1% and 2.7% on pre-Cascade Lake microarchitectures (Haswell-DT, Coffee Lake-B) across different garbage collectors (G1, ZGC). It additionally yields a speedup of 2.1% on SPECjvm2008's Derby benchmark when using G1 on Coffee Lake-B. > > Thanks to @ericcaspole for discussions and helping out with benchmarking. > > #### Testing > > ##### Functionality > > - tier1-5 (windows-x64, linux-x64, macosx-x64; release and debug mode). > - Checked that the expected combination of peephole rules is enabled for all microarchitectures supported by Intel's Software Development Emulator 9.0. > > ##### Performance > > - Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008), different Intel microarchitectures (Haswell-DT, Coffee Lake-B, Cascade Lake, Ice Lake-SP) and operating systems (linux-x64, windows-x64, and macosx-x64). No significant change was observed besides the improvements mentioned above. Good job in nailing this down! The fix looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13605#pullrequestreview-1397530117 From thartmann at openjdk.org Mon Apr 24 09:23:43 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 24 Apr 2023 09:23:43 GMT Subject: RFR: 8304676: [vectorapi] x86_32: Crash in Assembler::kmovql(Address, KRegister) In-Reply-To: References: Message-ID: On Sun, 23 Apr 2023 19:51:58 GMT, Quan Anh Mai wrote: > Hi, > > Can I have reviews for this patch which fixes crashes during PrintOptoAssembly of KRegister spilling code. The reason is that we miss the check for nullptr cbuf. > > Thanks a lot. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13603#pullrequestreview-1397537655 From rcastanedalo at openjdk.org Mon Apr 24 09:27:46 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Apr 2023 09:27:46 GMT Subject: RFR: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors In-Reply-To: References: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> Message-ID: On Mon, 24 Apr 2023 09:17:19 GMT, Tobias Hartmann wrote: > Good job in nailing this down! The fix looks reasonable to me. Thanks, Tobias! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13605#issuecomment-1519711177 From dzhang at openjdk.org Mon Apr 24 09:33:56 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 09:33:56 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v30] In-Reply-To: References: Message-ID: <0tNGyJpKg41ABKULU6hlvXtwJGyU3AFjTFwVBZSkiFs=.d0e85a63-00d3-468a-b3ed-531b707b72ba@github.com> > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Rename x0 to zr and modify avl of vloadmask/vstoremask ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/89b9c1fa..58fb42e5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=29 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=28-29 Stats: 8 lines in 1 file changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From duke at openjdk.org Mon Apr 24 09:36:46 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Mon, 24 Apr 2023 09:36:46 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: References: Message-ID: <3iM-vkwplJeQZi2YpT15meLWPypFq-gTQCMhJuOTPao=.8dc4e7a2-bb15-4940-aad1-333eace23e14@github.com> On Fri, 10 Mar 2023 22:47:03 GMT, Vladimir Kozlov wrote: >> I looked again at the Level2RecompilationTest and OSRFailureLevel4Test and it seems that I was wrong in these tests the checks in `main()` are different. >> >> So, it only needs to remove the checks from TestTypeProfiling. > > @ikorennoy, I added comment with question to Evgeny about how he hit the issue so we can reproduce it. > These tests can't be run without JTREG which filter them. Based on his answer we either close bug as not issue or try to find why filtering does not work in his configuration. > > If you want to look to do clean to remove unneeded checks and simplify `@requires` I would suggest to file a separate RFE. @vnkozlov seems like there are no updates in Jira. What can we do now? ------------- PR Comment: https://git.openjdk.org/jdk/pull/12981#issuecomment-1519730570 From epeter at openjdk.org Mon Apr 24 09:53:55 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Apr 2023 09:53:55 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v10] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/addnode.hpp (Christian) Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/4a13b00b..d22dc2b5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=08-09 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From aph at openjdk.org Mon Apr 24 10:13:44 2023 From: aph at openjdk.org (Andrew Haley) Date: Mon, 24 Apr 2023 10:13:44 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: References: Message-ID: <6ErYjQMejOeSMbZcq9THGES3rxYF2fKxM9vN9__5D6s=.8b3ff40a-81ae-43d8-a83a-94f1d28b9d44@github.com> On Sun, 23 Apr 2023 04:21:14 GMT, Chang Peng wrote: > @theRealAph The ReplicateXNodes used in match rules are also different in these two marcos. I think we needn't to merge these two marcos since this will introduce some if-else statements which will reduce the readability. It won't introduce if-else, surely. Not if you make the parts that are different into parameters. Then a reviewer cand immediately see which parts of the macros are actually different, rather than needing to do a deep study.; ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1175073413 From aph at openjdk.org Mon Apr 24 11:22:57 2023 From: aph at openjdk.org (Andrew Haley) Date: Mon, 24 Apr 2023 11:22:57 GMT Subject: RFR: 8305056: Avoid unaligned access in emit_intX methods if it's unsupported [v7] In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 13:50:57 GMT, Vladimir Kempik wrote: >> Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. >> >> Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. > > Vladimir Kempik has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: > > - Merge > - Rewrite few more helpers > - Add APH's suggestions and remove some whitespace > - Rework the fix to use memcpy in codeBuffer.hpp > - Reduce code duplication > - Fix 32-bit archs > - Fix typo > - change long to ulong in type convertion > - Fix includes > - 8305056: Avoid unaligned access in emit_intX methods if not enabled Looks good to me. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13227#pullrequestreview-1397743486 From vkempik at openjdk.org Mon Apr 24 11:35:00 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Mon, 24 Apr 2023 11:35:00 GMT Subject: Integrated: 8305056: Avoid unaligned access in emit_intX methods if it's unsupported In-Reply-To: References: Message-ID: On Wed, 29 Mar 2023 12:40:23 GMT, Vladimir Kempik wrote: > Please review this change which attempts to eliminate unaligned memory stores generated by emit_int16/32/64 methods on some platforms. > > Primary aim is risc-v platform. But I had to change some code in ppc/arm32/x86 to prevent possible perf degradation. This pull request has now been integrated. Changeset: f239695b Author: Vladimir Kempik URL: https://git.openjdk.org/jdk/commit/f239695b5670bfbc251430d2f7e632804894a8bc Stats: 19 lines in 1 file changed: 8 ins; 5 del; 6 mod 8305056: Avoid unaligned access in emit_intX methods if it's unsupported Reviewed-by: aph ------------- PR: https://git.openjdk.org/jdk/pull/13227 From jbhateja at openjdk.org Mon Apr 24 11:52:53 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Apr 2023 11:52:53 GMT Subject: RFR: 8304676: [vectorapi] x86_32: Crash in Assembler::kmovql(Address, KRegister) In-Reply-To: References: Message-ID: On Sun, 23 Apr 2023 19:51:58 GMT, Quan Anh Mai wrote: > Hi, > > Can I have reviews for this patch which fixes crashes during PrintOptoAssembly of KRegister spilling code. The reason is that we miss the check for nullptr cbuf. > > Thanks a lot. LGTM. ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13603#pullrequestreview-1397804431 From dzhang at openjdk.org Mon Apr 24 11:58:12 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 11:58:12 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v31] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Modify vector mask logical ops ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/58fb42e5..003ca54a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=30 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=29-30 Stats: 6 lines in 1 file changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From jbhateja at openjdk.org Mon Apr 24 12:08:06 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Apr 2023 12:08:06 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Fri, 14 Apr 2023 12:42:46 GMT, Roberto Casta?eda Lozano wrote: > I tried out your suggestion but unfortunately, the bookkeeping code (marking/storing candidate nodes and their predecessors in the tentative reduction chain) became more complex than the simplifications it enabled. Hi @robcasloz , Ok, my concern was that post path detection we have two occurrences of _original_input_ , this can be optimized if we bookkeep node encountered during path detection. Kindly consider attached rough patch which records the nodes during patch detection. [reduction_patch.txt](https://github.com/openjdk/jdk/files/11310121/reduction_patch.txt) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1175182260 From jbhateja at openjdk.org Mon Apr 24 12:08:04 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Apr 2023 12:08:04 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v3] In-Reply-To: References: Message-ID: On Fri, 14 Apr 2023 12:47:39 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Alternative approaches >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: >> >> | approach | correctness | efficiency | effectiveness | conceptual complexity | >> | -------- | ----------- | ---------- | ------------- | --------------------- | >> | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | >> | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | >> | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | >> >> Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 31 commits: > > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - ... and 21 more: https://git.openjdk.org/jdk/compare/a3137c75...d9fc7b22 src/hotspot/share/opto/superword.cpp line 518: > 516: } > 517: // Test that reduction nodes do not have any users in the loop besides their > 518: // reduction cycle predecessors. Slight nomenclature confusion in comment, I think you meant successor in above comment, phi is a successor of first node, we are using fast_outs in following loop to check for unique successors. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1174991430 From jbhateja at openjdk.org Mon Apr 24 12:39:59 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 24 Apr 2023 12:39:59 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v3] In-Reply-To: References: Message-ID: <5R3fMh_9EdNFTHTzu_Si-X3HOsuqDPJOafoGKoiQ1XU=.2179d7d8-9610-4bd3-ad06-6f14e8b17884@github.com> On Fri, 14 Apr 2023 12:47:39 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Alternative approaches >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: >> >> | approach | correctness | efficiency | effectiveness | conceptual complexity | >> | -------- | ----------- | ---------- | ------------- | --------------------- | >> | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | >> | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | >> | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | >> >> Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 31 commits: > > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - Do not run test in x86-32 > - Update existing test instead of removing it > - ... and 21 more: https://git.openjdk.org/jdk/compare/a3137c75...d9fc7b22 > Hi @jatin-bhateja, I have written a qualitative comparison between this PR and the generic search approach proposed by @eme64 and you (see **Alternative approaches** section in the updated PR description). I hope the comparison clarifies and motivates the plan outlined in [#13120 (comment)](https://github.com/openjdk/jdk/pull/13120#discussion_r1166778814). Please let me know whether you agree with that plan so that we can move forward with this RFE, [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673), and also [JDK-8302652](https://bugs.openjdk.org/browse/JDK-8302652). Hi @robcasloz , Problem occurs due to Min/Max canonicalizing transformations which results into creation of new nodes but does not propagate the _has_swapped_edges_ flags. A forward traversal starting from output of phi node can avoid edge swapping related issues and can give up discovering a path if any node feeds more than one users, want to stress that even if _mark_reductions_ detects a set of nodes as part of reduction chain SLP may still not vectorize it e.g. an AddI reduction chain with different constant inputs. Your approach looks good to me, path finding is strict and flows same edge for path discovery and fixes several missed reduction scenarios. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1520077842 From dzhang at openjdk.org Mon Apr 24 12:57:03 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 24 Apr 2023 12:57:03 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v32] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Remove useless BasicType bt ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/003ca54a..648861c8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=31 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=30-31 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From epeter at openjdk.org Mon Apr 24 14:53:58 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 24 Apr 2023 14:53:58 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v11] In-Reply-To: References: Message-ID: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: - Merge branch 'master' into JDK-8303466 - Update src/hotspot/share/opto/addnode.hpp (Christian) Co-authored-by: Christian Hagedorn - Apply suggestions from code review (Tobias) Co-authored-by: Tobias Hartmann - For Vladimir: add comment and move code - Fixed some TOP cases - Collapse SubL->MaxL->SubL->MaxL pattern, test it - Merge branch 'JDK-8303466' of https://github.com/eme64/jdk into JDK-8303466 - Review suggestion by Tobias Hartmann Co-authored-by: Tobias Hartmann - convert I2L(L2I(x)) => x, when allowed by types - stride_l should be longcon - ... and 9 more: https://git.openjdk.org/jdk/compare/37a5665b...a778346b ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13269/files - new: https://git.openjdk.org/jdk/pull/13269/files/d22dc2b5..a778346b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13269&range=09-10 Stats: 292985 lines in 2867 files changed: 246094 ins; 30275 del; 16616 mod Patch: https://git.openjdk.org/jdk/pull/13269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13269/head:pull/13269 PR: https://git.openjdk.org/jdk/pull/13269 From rcastanedalo at openjdk.org Mon Apr 24 15:06:12 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Apr 2023 15:06:12 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v4] In-Reply-To: References: Message-ID: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Alternative approaches > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: > > | approach | correctness | efficiency | effectiveness | conceptual complexity | > | -------- | ----------- | ---------- | ------------- | --------------------- | > | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | > | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | > | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | > > Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: - Merge master - Fix node naming in reduction chain traversal - Use is_marked_reduction() in new SLP code - Merge master - Emit Node::Flag_has_swapped_edges in IGV graphs - Merge master - Relax the reduction cycle search bound - Remove redundant IR check precondition - Use SuperWord members in reduction marking - Remove redundant opcode checks - ... and 23 more: https://git.openjdk.org/jdk/compare/7400aff3...1510accd ------------- Changes: https://git.openjdk.org/jdk/pull/13120/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13120&range=03 Stats: 821 lines in 17 files changed: 654 ins; 106 del; 61 mod Patch: https://git.openjdk.org/jdk/pull/13120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13120/head:pull/13120 PR: https://git.openjdk.org/jdk/pull/13120 From rcastanedalo at openjdk.org Mon Apr 24 15:10:52 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 24 Apr 2023 15:10:52 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v3] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 09:02:47 GMT, Jatin Bhateja wrote: >> Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 31 commits: >> >> - Use is_marked_reduction() in new SLP code >> - Merge master >> - Emit Node::Flag_has_swapped_edges in IGV graphs >> - Merge master >> - Relax the reduction cycle search bound >> - Remove redundant IR check precondition >> - Use SuperWord members in reduction marking >> - Remove redundant opcode checks >> - Do not run test in x86-32 >> - Update existing test instead of removing it >> - ... and 21 more: https://git.openjdk.org/jdk/compare/a3137c75...d9fc7b22 > > src/hotspot/share/opto/superword.cpp line 518: > >> 516: } >> 517: // Test that reduction nodes do not have any users in the loop besides their >> 518: // reduction cycle predecessors. > > Slight nomenclature confusion in comment, I think you meant successor in above comment, phi is a successor of first node, we are using fast_outs in following loop to check for unique successors. Good catch, thanks! My confusion stemmed from thinking about the traversal order rather than the direction of the edges in the Ideal graph. I have updated the comment and the variable name. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1175431599 From jsjolen at openjdk.org Mon Apr 24 15:17:08 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Mon, 24 Apr 2023 15:17:08 GMT Subject: RFR: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin [v3] In-Reply-To: References: Message-ID: <5hZ6Na5qqsOPgyCC9kYXbzRm5Fg_bphgw8Y74X9r30o=.631926d3-e023-427b-9e1d-fd9dee8ce320@github.com> On Thu, 20 Apr 2023 13:00:42 GMT, Johan Sj?len wrote: >> Hi, >> >> First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. >> >> Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. >> >> I haven't done any micro benchmarking, but this should be faster and take less space. >> >> This is currently passing tier1. > > Johan Sj?len has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into opt-chaitin > - Apply Kozlov's comments > - Use nr_blocks in assert > - Merge loops > - Optimize PhaseChaitin Passes tier1, tier2. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13533#issuecomment-1520365987 From jsjolen at openjdk.org Mon Apr 24 15:17:09 2023 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Mon, 24 Apr 2023 15:17:09 GMT Subject: Integrated: 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin In-Reply-To: References: Message-ID: On Wed, 19 Apr 2023 13:12:00 GMT, Johan Sj?len wrote: > Hi, > > First, `PhaseChaitin::PhaseChaitin` used to create 4 resource array of size `_cfg.number_of_blocks`: one to store all of the block pointers in (`_blks`), and three to do a sorting of the blocks in some order. The latter three weren't freed in the constructor, causing them to hang around for the entire duration of the phase. This is unnecessary, so this patch frees the arrays when we're done with them. It also allocates all of the resources arrays in one go. > > Second, it copied over each partially filled bucket into the `_blks`, one block at a time. This patch changes this so that we don't allocate the `_blks` resource array at all, instead we simply squash all of the partially filled buckets into the first one using `::memmove`. > > I haven't done any micro benchmarking, but this should be faster and take less space. > > This is currently passing tier1. This pull request has now been integrated. Changeset: b2ccc973 Author: Johan Sj?len URL: https://git.openjdk.org/jdk/commit/b2ccc9731e3a183bc6f31480c7d12f110633ea2b Stats: 39 lines in 1 file changed: 25 ins; 6 del; 8 mod 8306444: Don't leak memory in PhaseChaitin::PhaseChaitin Reviewed-by: kvn, roland ------------- PR: https://git.openjdk.org/jdk/pull/13533 From dcubed at openjdk.org Mon Apr 24 16:13:56 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Mon, 24 Apr 2023 16:13:56 GMT Subject: RFR: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 05:24:20 GMT, Tobias Hartmann wrote: >> Trivial fixes to increase timeouts for tests that timeout under heavy stress: >> [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again >> [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests >> [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java > > Looks good. @TobiHartmann - Thanks for the review! My jdk-21+19 stress testing round had no issues with these fixes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13593#issuecomment-1520457554 From dcubed at openjdk.org Mon Apr 24 16:13:58 2023 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Mon, 24 Apr 2023 16:13:58 GMT Subject: Integrated: 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again In-Reply-To: References: Message-ID: <3UbIgqWd7vlTLk6ovrv5dJX2kocPtT147xX26XcA3mU=.bb2b347b-eab3-457f-8acd-ed54ef5b2e61@github.com> On Fri, 21 Apr 2023 21:35:07 GMT, Daniel D. Daugherty wrote: > Trivial fixes to increase timeouts for tests that timeout under heavy stress: > [JDK-8301377](https://bugs.openjdk.org/browse/JDK-8301377) adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again > [JDK-8305502](https://bugs.openjdk.org/browse/JDK-8305502) adjust timeouts in three more M&M tests > [JDK-8302607](https://bugs.openjdk.org/browse/JDK-8302607) increase timeout for ContinuousCallSiteTargetChange.java This pull request has now been integrated. Changeset: 4b23bef5 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/4b23bef51df9c1a5bc8f43748a8d6c8d99995656 Stats: 9 lines in 5 files changed: 0 ins; 0 del; 9 mod 8301377: adjust timeout for JLI GetObjectSizeIntrinsicsTest.java subtest again 8302607: increase timeout for ContinuousCallSiteTargetChange.java 8305502: adjust timeouts in three more M&M tests Reviewed-by: naoto, lmesnik, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13593 From qamai at openjdk.org Mon Apr 24 18:08:04 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 24 Apr 2023 18:08:04 GMT Subject: RFR: 8304676: [vectorapi] x86_32: Crash in Assembler::kmovql(Address, KRegister) In-Reply-To: <80deH338vckLLkyfUZGdnKk4XgGRXPgTUm5DmEA_GK0=.50ad95eb-b1bb-475c-90d5-be07ddd6edd6@github.com> References: <80deH338vckLLkyfUZGdnKk4XgGRXPgTUm5DmEA_GK0=.50ad95eb-b1bb-475c-90d5-be07ddd6edd6@github.com> Message-ID: <6nZk29Bze8iFImhis-sQwP3B310fzxzEx3oU9AlrVU0=.9ea52957-e630-4006-afad-8de3ff330563@github.com> On Mon, 24 Apr 2023 08:43:32 GMT, Aleksey Shipilev wrote: >> Hi, >> >> Can I have reviews for this patch which fixes crashes during PrintOptoAssembly of KRegister spilling code. The reason is that we miss the check for nullptr cbuf. >> >> Thanks a lot. > > This looks fine, and it matches what the surrounding code does when `cbuf` is `nullptr`. @shipilev @TobiHartmann @jatin-bhateja Thanks a lot for your reviews, I will integrate the patch ------------- PR Comment: https://git.openjdk.org/jdk/pull/13603#issuecomment-1520592693 From dlong at openjdk.org Mon Apr 24 18:10:44 2023 From: dlong at openjdk.org (Dean Long) Date: Mon, 24 Apr 2023 18:10:44 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if [v2] In-Reply-To: References: Message-ID: > This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. Dean Long has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/share/opto/parse2.cpp Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/parse2.cpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13551/files - new: https://git.openjdk.org/jdk/pull/13551/files/9e7b087b..5e013405 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13551&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13551&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13551.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13551/head:pull/13551 PR: https://git.openjdk.org/jdk/pull/13551 From dlong at openjdk.org Mon Apr 24 18:10:48 2023 From: dlong at openjdk.org (Dean Long) Date: Mon, 24 Apr 2023 18:10:48 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 02:44:00 GMT, Dean Long wrote: > This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. Thanks Tobias. After looking at the code again, I was wondering if anyone prefers using "count" instead of "counter" here, so this: `static bool counters_are_meaningful(int counter1, int counter2, int min) {` would become this: `static bool counts_are_meaningful(int count1, int count2, int min) {` ------------- PR Comment: https://git.openjdk.org/jdk/pull/13551#issuecomment-1520600862 PR Comment: https://git.openjdk.org/jdk/pull/13551#issuecomment-1520604228 From shade at openjdk.org Mon Apr 24 19:02:59 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 24 Apr 2023 19:02:59 GMT Subject: RFR: 8306773: Problemlist jdk/incubator/vector/ShortMaxVectorTests.java on x86_32 Message-ID: <4F0VKLa0AcqKPygKp9-w6L90xBbHwAO66o5AvyaM468=.9a1d86c9-40a9-4c56-b03d-3a6bab1d7665@github.com> There is a product bug, see the parent bug. Problemlisting to get cleaner GHA runs. Additional testing: - [ ] GHA ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/13623/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13623&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306773 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13623.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13623/head:pull/13623 PR: https://git.openjdk.org/jdk/pull/13623 From kvn at openjdk.org Mon Apr 24 19:45:18 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 24 Apr 2023 19:45:18 GMT Subject: RFR: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors In-Reply-To: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> References: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> Message-ID: On Mon, 24 Apr 2023 06:05:21 GMT, Roberto Casta?eda Lozano wrote: > The `mov + inc/dec -> lea` subset of the peephole rules introduced by [JDK-8283699](https://bugs.openjdk.org/browse/JDK-8283699) has been found to cause minor regressions for some common benchmarks on Intel microarchitectures earlier than Cascade Lake. This changeset limits their application to Intel Cascade Lake and microarchitectures with full ALU support for lea (`VM_Version::supports_fast_3op_lea()`), where these peephole rules have been confirmed to be beneficial. The adjustment speeds up SPECjvm2008's MonteCarlo benchmark by between 0.1% and 2.7% on pre-Cascade Lake microarchitectures (Haswell-DT, Coffee Lake-B) across different garbage collectors (G1, ZGC). It additionally yields a speedup of 2.1% on SPECjvm2008's Derby benchmark when using G1 on Coffee Lake-B. > > Thanks to @ericcaspole for discussions and helping out with benchmarking. > > #### Testing > > ##### Functionality > > - tier1-5 (windows-x64, linux-x64, macosx-x64; release and debug mode). > - Checked that the expected combination of peephole rules is enabled for all microarchitectures supported by Intel's Software Development Emulator 9.0. > > ##### Performance > > - Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008), different Intel microarchitectures (Haswell-DT, Coffee Lake-B, Cascade Lake, Ice Lake-SP) and operating systems (linux-x64, windows-x64, and macosx-x64). No significant change was observed besides the improvements mentioned above. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13605#pullrequestreview-1398683452 From cslucas at openjdk.org Mon Apr 24 19:50:22 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 24 Apr 2023 19:50:22 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Sat, 22 Apr 2023 01:12:32 GMT, Vladimir Ivanov wrote: >> Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Catching up with master >> >> Merge remote-tracking branch 'origin/master' into rematerialization-of-merges >> - Fix tests. Remember previous reducible Phis. >> - Address PR review 3. Some comments and be able to abort compilation. >> - Merge with Master >> - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. >> - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. >> - Add support for SR'ing some inputs of merges used for field loads >> - Fix some typos and do some small refactorings. >> - Merge master >> - Add support for rematerializing scalar replaced objects participating in allocation merges > > src/hotspot/share/code/debugInfo.cpp line 232: > >> 230: // If we call select again on the same merge we should return the same result >> 231: if (_selected != nullptr) { >> 232: return _selected; > > I'm not sure I understand how it is intended to work. The code below initializes `_selected`, but returns `nullptr` when `selector >= 0`. Subsequent calls will return non-null value. This can be improved. I'll fix it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1175715702 From kvn at openjdk.org Mon Apr 24 19:51:09 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 24 Apr 2023 19:51:09 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: <3iM-vkwplJeQZi2YpT15meLWPypFq-gTQCMhJuOTPao=.8dc4e7a2-bb15-4940-aad1-333eace23e14@github.com> References: <3iM-vkwplJeQZi2YpT15meLWPypFq-gTQCMhJuOTPao=.8dc4e7a2-bb15-4940-aad1-333eace23e14@github.com> Message-ID: On Mon, 24 Apr 2023 09:34:11 GMT, Ilya Korennoy wrote: >> @ikorennoy, I added comment with question to Evgeny about how he hit the issue so we can reproduce it. >> These tests can't be run without JTREG which filter them. Based on his answer we either close bug as not issue or try to find why filtering does not work in his configuration. >> >> If you want to look to do clean to remove unneeded checks and simplify `@requires` I would suggest to file a separate RFE. > > @vnkozlov seems like there are no updates in Jira. What can we do now? @ikorennoy I added new comment in Jira. I am suggesting to close this issue if we don't get answer in few days. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12981#issuecomment-1520735536 From kvn at openjdk.org Mon Apr 24 19:58:10 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 24 Apr 2023 19:58:10 GMT Subject: RFR: 8306773: Problemlist jdk/incubator/vector/ShortMaxVectorTests.java on x86_32 In-Reply-To: <4F0VKLa0AcqKPygKp9-w6L90xBbHwAO66o5AvyaM468=.9a1d86c9-40a9-4c56-b03d-3a6bab1d7665@github.com> References: <4F0VKLa0AcqKPygKp9-w6L90xBbHwAO66o5AvyaM468=.9a1d86c9-40a9-4c56-b03d-3a6bab1d7665@github.com> Message-ID: On Mon, 24 Apr 2023 18:26:24 GMT, Aleksey Shipilev wrote: > There is a product bug, see the parent bug. Problemlisting to get cleaner GHA runs. > > Additional testing: > - [ ] GHA Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13623#pullrequestreview-1398700578 From cslucas at openjdk.org Tue Apr 25 00:14:13 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 25 Apr 2023 00:14:13 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Sat, 22 Apr 2023 01:52:37 GMT, Vladimir Ivanov wrote: >> Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Catching up with master >> >> Merge remote-tracking branch 'origin/master' into rematerialization-of-merges >> - Fix tests. Remember previous reducible Phis. >> - Address PR review 3. Some comments and be able to abort compilation. >> - Merge with Master >> - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. >> - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. >> - Add support for SR'ing some inputs of merges used for field loads >> - Fix some typos and do some small refactorings. >> - Merge master >> - Add support for rematerializing scalar replaced objects participating in allocation merges > > src/hotspot/share/code/debugInfo.cpp line 257: > >> 255: } else { >> 256: assert(selector < _possible_objects.length(), "sanity"); >> 257: _selected = (ObjectValue*) _possible_objects.at(selector); > > Any particular reason to reuse `ObjectValue` from `_possible_objects` instead of allocating a fresh one (as you do on `selector == -1` bracnh)? I'd prefer `ObjectMergeValue::select()` to always allocate a fresh `ObjectValue` when converting `ObjectMergeValue` + `ObjectMergeCandidateValue` into `ObjectValue`. @iwanowww - may I ask why always allocating a fresh object might be better than returning a pointer to a previous "selected" object? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1175897406 From cslucas at openjdk.org Tue Apr 25 00:38:23 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 25 Apr 2023 00:38:23 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Sat, 22 Apr 2023 01:42:41 GMT, Vladimir Ivanov wrote: > Does it make sense to introduce 3 different subclasses under ObjectValue to clearly distinguish the scenarios? I think that's a good idea. I'll give it a shot. Thanks. > src/java.base/share/classes/java/security/AccessController.java line 786: > >> 784: // allocation merge Phi leading to it) might become NonEscaping and get >> 785: // scalar replaced. The call below enforces 'result' to always escape. >> 786: ensureMaterializedForStackWalk(result); > > Why don't you add the same call in the other `executePrivileged` overload? It has the very same code shape. Totally missed that! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1175906046 PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1175905602 From cslucas at openjdk.org Tue Apr 25 01:25:25 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 25 Apr 2023 01:25:25 GMT Subject: RFR: 8306625 - Missing instructions on IR-based test framework ALLOC Regex Message-ID: On AArch64 with -XX:-UseTLAB, C2 can add an `add`, `mulw` or `addw` around the method call to allocate an object/array. When this happens the current Regex of the IR-based test framework will NOT recognize the instruction sequence as an allocation and the result will be a false-negative test results. This PR is to adjust the four Regex to account for those possible instructions. ------------- Commit messages: - Include add, addw, mulw in ALLOC regexes Changes: https://git.openjdk.org/jdk/pull/13631/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13631&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306625 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/13631.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13631/head:pull/13631 PR: https://git.openjdk.org/jdk/pull/13631 From jkarthikeyan at openjdk.org Tue Apr 25 02:51:28 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 25 Apr 2023 02:51:28 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v4] In-Reply-To: References: Message-ID: > Hi, I've created optimizations for the expansion of `Conv2B` nodes, especially when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). This change replaces `Conv2B` nodes in the middle-end during macro expansion with conditional moves, allowing the bit flip with `xor` to be subsumed with an inversion of the comparison instead. This change also reduces the overhead of the matcher in the backend, as fewer rules need to be traversed in order to match an ideal node. Performance results from my (Zen 2) machine: > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% > > Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Make transform conditional ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13345/files - new: https://git.openjdk.org/jdk/pull/13345/files/59a68a10..91a898ee Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=02-03 Stats: 414 lines in 15 files changed: 354 ins; 44 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/13345.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13345/head:pull/13345 PR: https://git.openjdk.org/jdk/pull/13345 From jkarthikeyan at openjdk.org Tue Apr 25 02:51:29 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 25 Apr 2023 02:51:29 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v3] In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 00:38:32 GMT, Fei Yang wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove Conv2B from backend as it's macro expanded now > > Hello, I wonder if we could make this transformation of Conv2B conditional? Architectures like RISC-V doesn't have support of conditional moves at the ISA level for now. So we set ConditionalMoveLimit parameter to 0 for this platform and conditionals moves are emulated with normal compare and branch instructions instead [1]. I don't think we would achieve better performance numbers on this platform with this change. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L9583 Hey @RealFYang, thanks for this info! I wasn't aware that RISC-V didn't have conditional moves, and I agree that it doesn't sound like this transform would be so profitable there. To make the transformation conditional I've moved it to post loop opts IGVN, and only run it if the match rule for `Conv2B` isn't found. In an effort to not accidentally cause performance regressions, I've limited the transform to x86_64, aarch64, and arm32. @merykitty I've also implemented your change with idealization and would like your thoughts on it, thanks! I'll attach aarch64 perf results soon. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1521085079 From jkarthikeyan at openjdk.org Tue Apr 25 02:55:52 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 25 Apr 2023 02:55:52 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v5] In-Reply-To: References: Message-ID: > Hi, I've created optimizations for the expansion of `Conv2B` nodes, especially when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). This change replaces `Conv2B` nodes in the middle-end during macro expansion with conditional moves, allowing the bit flip with `xor` to be subsumed with an inversion of the comparison instead. This change also reduces the overhead of the matcher in the backend, as fewer rules need to be traversed in order to match an ideal node. Performance results from my (Zen 2) machine: > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% > > Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: Whitespace tweak ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13345/files - new: https://git.openjdk.org/jdk/pull/13345/files/91a898ee..556b2ab3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=03-04 Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13345.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13345/head:pull/13345 PR: https://git.openjdk.org/jdk/pull/13345 From ysuenaga at openjdk.org Tue Apr 25 04:49:06 2023 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Tue, 25 Apr 2023 04:49:06 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Sat, 8 Apr 2023 02:24:44 GMT, Yasumasa Suenaga wrote: > `os::Linux::available_memory()` returns available memory from cgroups or sysinfo(2). In case of the process which run on out of container, that value is based on `freeram` from sysinfo(2). > > `freeram` is equivalent to `MemFree` in `/proc/meminfo` [1]. However it means just a free RAM. We should use `MemAvailable` when we want to know how much memory is available for the process [2]. `MemAvailable` is available in modern Linux kernel, and it has been backported some older kernels (e.g. RHEL). In `sar` from sysstat, it refers that value and shows it as `kbavail` [3]. > > AFAIK PhysicalMemory event in JFR depends on `os::Linux::available_memory()`, and it is used in automated analysis in JMC. So the JFR/JMC user could misunderstand physical memory was exhausted even if the memory was available enough. > > [1] https://github.com/torvalds/linux/blob/c9c3395d5e3dcc6daee66c6908354d47bf98cb0c/fs/proc/meminfo.c#L59 > [2] https://docs.kernel.org/filesystems/proc.html?highlight=memavailable > [3] https://github.com/sysstat/sysstat/blob/ac1df71ca252c158e8d418ded93e5ed52f5e8765/rd_stats.c#L325-L328 Hi hotspot-compiler folks, I'd like to change `os::Linux::available_memory()` to refer `MemAvailable` in /proc/meminfo on Linux. One of user of this function is JIT compiler. It is used for determine number of compiler threads. After this change, more compiler threads would be started because `MemAvailable` includes not only free memory but also some caches - it means `MemAvailable` is bigger than `MemFree`. I think this change is not a problem because number of compiler threads is limited by `CICompilerCount`. Do you have any concerns in compiler perspective? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1521146546 From jbhateja at openjdk.org Tue Apr 25 05:24:19 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 25 Apr 2023 05:24:19 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v4] In-Reply-To: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> References: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> Message-ID: On Mon, 24 Apr 2023 15:06:12 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Alternative approaches >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: >> >> | approach | correctness | efficiency | effectiveness | conceptual complexity | >> | -------- | ----------- | ---------- | ------------- | --------------------- | >> | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | >> | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | >> | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | >> >> Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: > > - Merge master > - Fix node naming in reduction chain traversal > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - ... and 23 more: https://git.openjdk.org/jdk/compare/7400aff3...1510accd Hi @robcasloz , Apart from some earlier shared concerns on path detection traversal which are not blocking issues, patch looks good to me. Best Regards, Jatin ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13120#pullrequestreview-1399147288 From chagedorn at openjdk.org Tue Apr 25 05:29:06 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 25 Apr 2023 05:29:06 GMT Subject: RFR: 8306625 - Missing instructions on IR-based test framework ALLOC Regex In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 01:18:39 GMT, Cesar Soares Lucas wrote: > On AArch64 with -XX:-UseTLAB, C2 can add an `add`, `mulw` or `addw` around the method call to allocate an object/array. When this happens the current Regex of the IR-based test framework will NOT recognize the instruction sequence as an allocation and the result will be a false-negative test results. > > This PR is to adjust the four Regex to account for those possible instructions. Looks good! Thanks for fixing this. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13631#pullrequestreview-1399150368 From shade at openjdk.org Tue Apr 25 05:59:07 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 25 Apr 2023 05:59:07 GMT Subject: RFR: 8306773: Problemlist jdk/incubator/vector/ShortMaxVectorTests.java on x86_32 In-Reply-To: <4F0VKLa0AcqKPygKp9-w6L90xBbHwAO66o5AvyaM468=.9a1d86c9-40a9-4c56-b03d-3a6bab1d7665@github.com> References: <4F0VKLa0AcqKPygKp9-w6L90xBbHwAO66o5AvyaM468=.9a1d86c9-40a9-4c56-b03d-3a6bab1d7665@github.com> Message-ID: On Mon, 24 Apr 2023 18:26:24 GMT, Aleksey Shipilev wrote: > There is a product bug, see the parent bug. Problemlisting to get cleaner GHA runs. > > Additional testing: > - [x] GHA Thanks! I am integrating to get cleaner GHA runs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13623#issuecomment-1521191411 From shade at openjdk.org Tue Apr 25 06:02:21 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 25 Apr 2023 06:02:21 GMT Subject: Integrated: 8306773: Problemlist jdk/incubator/vector/ShortMaxVectorTests.java on x86_32 In-Reply-To: <4F0VKLa0AcqKPygKp9-w6L90xBbHwAO66o5AvyaM468=.9a1d86c9-40a9-4c56-b03d-3a6bab1d7665@github.com> References: <4F0VKLa0AcqKPygKp9-w6L90xBbHwAO66o5AvyaM468=.9a1d86c9-40a9-4c56-b03d-3a6bab1d7665@github.com> Message-ID: <739mtQOKeUkvgDEYS8OqJH6j1gGlxtBxz_Q-3F_zz6s=.2a555937-f302-4eb8-b1aa-b3a26e08a4b1@github.com> On Mon, 24 Apr 2023 18:26:24 GMT, Aleksey Shipilev wrote: > There is a product bug, see the parent bug. Problemlisting to get cleaner GHA runs. > > Additional testing: > - [x] GHA This pull request has now been integrated. Changeset: 2985738f Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/2985738f1584735fee34bbe706014f43ec369bdd Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8306773: Problemlist jdk/incubator/vector/ShortMaxVectorTests.java on x86_32 Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/13623 From epeter at openjdk.org Tue Apr 25 06:24:11 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Apr 2023 06:24:11 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) Message-ID: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> An other case of `uncast` not being type-propagated through. We have a case like this: `Phi -> ShiftL -> CastII -> AndI` The Phi has an updated type, so we should re-run Value on the AndI. In PhaseCCP::push_and, we do update a similar pattern: `X -> ShiftL -> AndI` I extended it to handle this pattern: `parent -> LShift (use) -> ConstraintCast* -> And` For this, I implemented: https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` **Discussion** https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. ------------- Commit messages: - Remove ResourceMark, which lead to bad de-allocations - spurious new line - 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) Changes: https://git.openjdk.org/jdk/pull/13611/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13611&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306042 Stats: 118 lines in 3 files changed: 89 ins; 20 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/13611.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13611/head:pull/13611 PR: https://git.openjdk.org/jdk/pull/13611 From thartmann at openjdk.org Tue Apr 25 06:27:12 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 25 Apr 2023 06:27:12 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if [v2] In-Reply-To: References: Message-ID: <8d_Rz32X-c4T1L60pLf3mv-Lklm5JYMze2XIPsBCLDU=.1e1928eb-c82a-4ef7-aea7-b37d6eb2285e@github.com> On Mon, 24 Apr 2023 18:10:44 GMT, Dean Long wrote: >> This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. > > Dean Long has updated the pull request incrementally with two additional commits since the last revision: > > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann Sounds reasonable to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13551#issuecomment-1521216079 From thartmann at openjdk.org Tue Apr 25 06:29:06 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 25 Apr 2023 06:29:06 GMT Subject: RFR: 8306625 - Missing instructions on IR-based test framework ALLOC Regex In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 01:18:39 GMT, Cesar Soares Lucas wrote: > On AArch64 with -XX:-UseTLAB, C2 can add an `add`, `mulw` or `addw` around the method call to allocate an object/array. When this happens the current Regex of the IR-based test framework will NOT recognize the instruction sequence as an allocation and the result will be a false-negative test results. > > This PR is to adjust the four Regex to account for those possible instructions. Should we add a test that triggers this? ------------- PR Review: https://git.openjdk.org/jdk/pull/13631#pullrequestreview-1399207446 From rcastanedalo at openjdk.org Tue Apr 25 07:57:08 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 25 Apr 2023 07:57:08 GMT Subject: RFR: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors In-Reply-To: References: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> Message-ID: On Mon, 24 Apr 2023 09:25:11 GMT, Roberto Casta?eda Lozano wrote: > Good. Thanks for reviewing, Vladimir! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13605#issuecomment-1521331001 From rcastanedalo at openjdk.org Tue Apr 25 08:30:09 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 25 Apr 2023 08:30:09 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 04:45:56 GMT, Yasumasa Suenaga wrote: >> `os::Linux::available_memory()` returns available memory from cgroups or sysinfo(2). In case of the process which run on out of container, that value is based on `freeram` from sysinfo(2). >> >> `freeram` is equivalent to `MemFree` in `/proc/meminfo` [1]. However it means just a free RAM. We should use `MemAvailable` when we want to know how much memory is available for the process [2]. `MemAvailable` is available in modern Linux kernel, and it has been backported some older kernels (e.g. RHEL). In `sar` from sysstat, it refers that value and shows it as `kbavail` [3]. >> >> AFAIK PhysicalMemory event in JFR depends on `os::Linux::available_memory()`, and it is used in automated analysis in JMC. So the JFR/JMC user could misunderstand physical memory was exhausted even if the memory was available enough. >> >> [1] https://github.com/torvalds/linux/blob/c9c3395d5e3dcc6daee66c6908354d47bf98cb0c/fs/proc/meminfo.c#L59 >> [2] https://docs.kernel.org/filesystems/proc.html?highlight=memavailable >> [3] https://github.com/sysstat/sysstat/blob/ac1df71ca252c158e8d418ded93e5ed52f5e8765/rd_stats.c#L325-L328 > > Hi hotspot-compiler folks, > > I'd like to change `os::Linux::available_memory()` to refer `MemAvailable` in /proc/meminfo on Linux. One of user of this function is JIT compiler. It is used for determine number of compiler threads. After this change, more compiler threads would be started because `MemAvailable` includes not only free memory but also some caches - it means `MemAvailable` is bigger than `MemFree`. > > I think this change is not a problem because number of compiler threads is limited by `CICompilerCount`. Do you have any concerns in compiler perspective? Hi @YaSuenag, just to confirm that this change would not lead to excessive creation/deletion of compiler threads (which can have a significant cost in terms of memory usage, see e.g. [JDK-8302264](https://bugs.openjdk.org/browse/JDK-8302264)), it would be useful to see some measurements about number of created compiler threads over the execution of some applications in a environment configured so that `available_memory / (200*M)` becomes the limiting factor in https://github.com/openjdk/jdk/blob/f968da97a5a5c68c28ad29d13fdfbe3a4adf5ef7/src/hotspot/share/compiler/compileBroker.cpp#L1024-L1027, before and after the change. This could be easily measured e.g. using `-XX:+TraceCompilerThreads`. Do you have (or could produce) such measurements? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1521385514 From fyang at openjdk.org Tue Apr 25 08:41:19 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 25 Apr 2023 08:41:19 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v32] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 12:57:03 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless BasicType bt Thanks for the update. Would you mind a few more tweaks please? src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1755: > 1753: // exception on both signaling and quiet NaN inputs, so we should > 1754: // mask the signaling compares when either input is NaN > 1755: // to implement floating-point quiet compares. I think we can check the vector elements for NaNs here with the Vector Floating-Point Classify Instruction? That would avoid raising invalid operation exception as a side effect with the current solution. src/hotspot/cpu/riscv/riscv_v.ad line 1528: > 1526: match(Set dst_src (RShiftVB (Binary dst_src shift) v0)); > 1527: ins_cost(VEC_COST); > 1528: effect(TEMP_DEF dst_src, USE v0, TEMP tmp); No need to add 'USE v0' in effect for this instruct and several other ones. ------------- Changes requested by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1399298164 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1176117019 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1176191979 From duke at openjdk.org Tue Apr 25 10:13:10 2023 From: duke at openjdk.org (Chang Peng) Date: Tue, 25 Apr 2023 10:13:10 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: <6ErYjQMejOeSMbZcq9THGES3rxYF2fKxM9vN9__5D6s=.8b3ff40a-81ae-43d8-a83a-94f1d28b9d44@github.com> References: <6ErYjQMejOeSMbZcq9THGES3rxYF2fKxM9vN9__5D6s=.8b3ff40a-81ae-43d8-a83a-94f1d28b9d44@github.com> Message-ID: <8o1XTW5X6zHEuL7txai1gyy2eBGSZCQ0mujPlqyhgGc=.ec703e36-2b4c-40f4-af52-bbc4ba2927b4@github.com> On Mon, 24 Apr 2023 10:11:08 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3625: >> >>> 3623: instruct vmask$2_immI_sve(pReg dst, vReg src, $1 imm, immI_$2_cond cond, rFlagsReg cr) %{ >>> 3624: predicate(UseSVE > 0); >>> 3625: match(Set dst (VectorMaskCmp (Binary src (ReplicateB imm)) cond)); >> >> @theRealAph >> The ReplicateXNodes used in match rules are also different in these two marcos. >> I think we needn't to merge these two marcos since this will introduce some if-else statements which will reduce the readability. > >> @theRealAph The ReplicateXNodes used in match rules are also different in these two marcos. I think we needn't to merge these two marcos since this will introduce some if-else statements which will reduce the readability. > > It won't introduce if-else, surely. Not if you make the parts that are different into parameters. Then a reviewer can immediately see which parts of the macros are actually different, rather than needing to do a deep study.; Sorry for confusion. I will try to rewrite the marcos according to your comments. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1176301892 From duke at openjdk.org Tue Apr 25 10:21:08 2023 From: duke at openjdk.org (Chang Peng) Date: Tue, 25 Apr 2023 10:21:08 GMT Subject: RFR: 8301739: AArch64: Add optimized rules for vector compare with immediate for SVE [v3] In-Reply-To: <8o1XTW5X6zHEuL7txai1gyy2eBGSZCQ0mujPlqyhgGc=.ec703e36-2b4c-40f4-af52-bbc4ba2927b4@github.com> References: <6ErYjQMejOeSMbZcq9THGES3rxYF2fKxM9vN9__5D6s=.8b3ff40a-81ae-43d8-a83a-94f1d28b9d44@github.com> <8o1XTW5X6zHEuL7txai1gyy2eBGSZCQ0mujPlqyhgGc=.ec703e36-2b4c-40f4-af52-bbc4ba2927b4@github.com> Message-ID: On Tue, 25 Apr 2023 10:10:03 GMT, Chang Peng wrote: >>> @theRealAph The ReplicateXNodes used in match rules are also different in these two marcos. I think we needn't to merge these two marcos since this will introduce some if-else statements which will reduce the readability. >> >> It won't introduce if-else, surely. Not if you make the parts that are different into parameters. Then a reviewer can immediately see which parts of the macros are actually different, rather than needing to do a deep study.; > > Sorry for confusion. I will try to rewrite the marcos according to your comments. If we merge these marcos like following, there will be more generated matching rules. But I think it is trivial since the number of class in ad_aarch64.hpp will not change. dnl VMASKCMP_SVE_IMM($1 , $2 , $3 , $4 ) dnl VMASKCMP_SVE_IMM(element_size, element_type, type_imm, type_condition) define(`VMASKCMP_SVE_IMM', ` instruct vmask$4_imm$2_sve(pReg dst, vReg src, $3 imm, immI_$4_cond cond, rFlagsReg cr) %{ predicate(UseSVE > 0); match(Set dst (VectorMaskCmp (Binary src (Replicate$2 imm)) cond)); effect(KILL cr); format %{ "vmask$4_imm$2_sve $dst, $src, $imm, $cond\t# KILL cr" %} ins_encode %{ Assembler::Condition condition = to_assembler_cond((BoolTest::mask)$cond$$constant); uint length_in_bytes = Matcher::vector_length_in_bytes(this); assert(length_in_bytes == MaxVectorSize, "invalid vector length"); __ sve_cmp(condition, $dst$$PRegister, __ $1, ptrue, $src$$FloatRegister, (int)$imm$$constant); %} ins_pipe(pipe_slow); %}')dnl VMASKCMP_SVE_IMM(B, B, immI5, cmp) VMASKCMP_SVE_IMM(B, B, immIU7, cmpU) VMASKCMP_SVE_IMM(H, S, immI5, cmp) VMASKCMP_SVE_IMM(H, S, immIU7, cmpU) VMASKCMP_SVE_IMM(S, I, immI5, cmp) VMASKCMP_SVE_IMM(S, I, immIU7, cmpU) VMASKCMP_SVE_IMM(D, L, immL5, cmp) VMASKCMP_SVE_IMM(D, L, immLU7, cmpU) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13200#discussion_r1176310384 From rcastanedalo at openjdk.org Tue Apr 25 12:02:12 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 25 Apr 2023 12:02:12 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v2] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 11:59:45 GMT, Jatin Bhateja wrote: >> I tried out your suggestion but unfortunately, the bookkeeping code (marking/storing candidate nodes and their predecessors in the tentative reduction chain) became more complex than the simplifications it enabled. > >> I tried out your suggestion but unfortunately, the bookkeeping code (marking/storing candidate nodes and their predecessors in the tentative reduction chain) became more complex than the simplifications it enabled. > > Hi @robcasloz , Ok, my concern was that post path detection we have two occurrences of _original_input_ , this can be optimized if we bookkeep node encountered during path detection. Kindly consider attached rough patch which records the nodes during patch detection. > [reduction_patch.txt](https://github.com/openjdk/jdk/files/11310121/reduction_patch.txt) Thanks for the patch, it is very similar to what I had tried before (see my comment above), except your patch rejects reduction chains with external users, whereas this PR allows reduction chain nodes to be used as long as the user is not within the loop. This follows the original logic more closely (although I have not found a case where the distinction matters yet): https://github.com/openjdk/jdk/blob/a4a5385831b58e66fe3f34cef618643f9be68c9e/src/hotspot/share/opto/loopTransform.cpp#L2539-L2549) To ease comparison, I adapted your patch to perform the same test as this PR. Here is the result: https://github.com/robcasloz/jdk/compare/JDK-8287087...robcasloz:jdk:JDK-8287087-single-original-input-call In my opinion, even if the patch gets rid of the last reduction path traversal, the result is not necessarily more readable or efficient, due to the cognitive and computational overhead of introducing an auxiliary `GrowableArray` data structure. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13120#discussion_r1176409706 From jbhateja at openjdk.org Tue Apr 25 12:22:22 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 25 Apr 2023 12:22:22 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6] In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 11:57:21 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> style > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1914: > >> 1912: if (vector_klass->const_oop() == NULL || elem_klass->const_oop() == NULL || >> 1913: !vlen->is_con() || !origin_type->is_con()) { >> 1914: if (C->print_intrinsics()) { > > Hi @merykitty , your inline expander is not handling non-constant origin case, this will introduce performance regressions w.r.t to existing implementation. You can extend expander to generate IR corresponding to fallback implementation to handle non-constant origin case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176410139 From jbhateja at openjdk.org Tue Apr 25 12:22:20 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 25 Apr 2023 12:22:20 GMT Subject: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6] In-Reply-To: References: Message-ID: On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai wrote: >> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method. >> >> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > style src/hotspot/cpu/x86/x86.ad line 7953: > 7951: __ punpckldq($dst$$XMMRegister, $src$$XMMRegister); > 7952: } > 7953: __ psrldq($dst$$XMMRegister, $origin$$constant * type2aelembytes(bt)); Move it to a new macro assembly routine. src/hotspot/cpu/x86/x86.ad line 7962: > 7960: !VM_Version::supports_ssse3()); > 7961: match(Set dst (VectorSlice (Binary dst src) origin)); > 7962: effect(TEMP xtmp); Please also associate TEMP_DEF / TEMP with dst to avoid early source overwrite in case dst/src are allocated same register. src/hotspot/cpu/x86/x86.ad line 7970: > 7968: __ movdqu($xtmp$$XMMRegister, $src$$XMMRegister); > 7969: __ pslldq($xtmp$$XMMRegister, 16 - shift_count); > 7970: __ por($dst$$XMMRegister, $xtmp$$XMMRegister); Move to macro assembly routine. src/hotspot/cpu/x86/x86.ad line 8007: > 8005: } > 8006: __ vpsrldq($dst$$XMMRegister, $dst$$XMMRegister, shift_count, Assembler::AVX_128bit); > 8007: } Move to macro assembly routine. src/hotspot/cpu/x86/x86.ad line 8063: > 8061: (type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int()) % 4U != 0 && > 8062: (type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() < 16 || > 8063: type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() > 48)); Move these bulky predications to source_hpp section, like done at https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L8786 src/hotspot/cpu/x86/x86.ad line 8082: > 8080: type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() > 16 && > 8081: type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() < 48); > 8082: match(Set dst (VectorSlice (Binary src1 src2) origin)); Same as above. src/hotspot/cpu/x86/x86.ad line 8099: > 8097: (Matcher::vector_length_in_bytes(n) == 64 || > 8098: (Matcher::vector_length_in_bytes(n) == 32 && > 8099: VM_Version::supports_avx512vl()))); Same as above. src/hotspot/share/opto/vectorIntrinsics.cpp line 1914: > 1912: if (vector_klass->const_oop() == NULL || elem_klass->const_oop() == NULL || > 1913: !vlen->is_con() || !origin_type->is_con()) { > 1914: if (C->print_intrinsics()) { Hi @merykitty , your inline expander is not handling non-constant origin case, this will introduce performance regressions w.r.t to existing implementation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428922 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176424735 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429190 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429410 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428080 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428309 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428542 PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176407424 From jbhateja at openjdk.org Tue Apr 25 12:25:08 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 25 Apr 2023 12:25:08 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: On Sat, 22 Apr 2023 09:38:37 GMT, Eric Liu wrote: >> I do still not fully grasp it though. When `expand_vbox_helper` is called on `Phi1`, it creates `NewPhi1`, then it calls `expand_vbox_helper` on `Phi2`, a `NewPhi2` is created, then `expand_vbox_helper` is invoked again on `Phi1`, which it shortcircuits to return `Phi1`, which will be attached as an input of `NewPhi2`, at this step should the second invocation on `Phi1` returns `NewPhi1` instead? >> >> Thanks a lot. > > Yes, I think it should return the `NewPhi1` instead. > > In my test case, the `NewPhi1` and `NewPhi2` are idealized to `Phi1` and `Phi2`, so it does not matter whether it returns the new one. But I'm not sure if it's certain to be idealize to the old one. Anyway, return the new is more reasonable. I will fix that and do test. Hi @e1iu , Can you please elaborate the reason for CYCLIC Phi creation in this case. Ideally we should not have seen that pallet in first place. You fix is circumventing this problem but will be good to know the cause of such graph shape. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1176432532 From dzhang at openjdk.org Tue Apr 25 12:48:12 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 25 Apr 2023 12:48:12 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v33] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Modify some effect and params ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/648861c8..4788345a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=32 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=31-32 Stats: 102 lines in 4 files changed: 11 ins; 28 del; 63 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From dzhang at openjdk.org Tue Apr 25 12:48:15 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 25 Apr 2023 12:48:15 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v32] In-Reply-To: References: Message-ID: <4C9PWZ7PybUAo2R_TQiMpIGb3WyLWn_QTIUYJWb8nW0=.6a3c73be-af14-40e8-ba9b-b909ccf4e73d@github.com> On Tue, 25 Apr 2023 07:29:57 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove useless BasicType bt > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1755: > >> 1753: // exception on both signaling and quiet NaN inputs, so we should >> 1754: // mask the signaling compares when either input is NaN >> 1755: // to implement floating-point quiet compares. > > I think we can check the vector elements for NaNs here with the Vector Floating-Point Classify Instruction? That would avoid raising invalid operation exception as a side effect with the current solution. Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 1528: > >> 1526: match(Set dst_src (RShiftVB (Binary dst_src shift) v0)); >> 1527: ins_cost(VEC_COST); >> 1528: effect(TEMP_DEF dst_src, USE v0, TEMP tmp); > > No need to add 'USE v0' in effect for this instruct and several other ones. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1176459908 PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1176460417 From rcastanedalo at openjdk.org Tue Apr 25 13:08:22 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 25 Apr 2023 13:08:22 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v3] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 09:09:54 GMT, Roberto Casta?eda Lozano wrote: >> Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 31 commits: >> >> - Use is_marked_reduction() in new SLP code >> - Merge master >> - Emit Node::Flag_has_swapped_edges in IGV graphs >> - Merge master >> - Relax the reduction cycle search bound >> - Remove redundant IR check precondition >> - Use SuperWord members in reduction marking >> - Remove redundant opcode checks >> - Do not run test in x86-32 >> - Update existing test instead of removing it >> - ... and 21 more: https://git.openjdk.org/jdk/compare/a3137c75...d9fc7b22 > > Hi @jatin-bhateja, I have written a qualitative comparison between this PR and the generic search approach proposed by @eme64 and you (see **Alternative approaches** section in the updated PR description). I hope the comparison clarifies and motivates the plan outlined in https://github.com/openjdk/jdk/pull/13120#discussion_r1166778814. Please let me know whether you agree with that plan so that we can move forward with this RFE, [JDK-8302673](https://bugs.openjdk.org/browse/JDK-8302673), and also [JDK-8302652](https://bugs.openjdk.org/browse/JDK-8302652). > Hi @robcasloz , Apart from some earlier shared concerns on path detection traversal which are not blocking issues, patch looks good to me. Best Regards, Jatin Thanks for reviewing, Jatin! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1521757717 From dzhang at openjdk.org Tue Apr 25 13:17:25 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 25 Apr 2023 13:17:25 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v34] In-Reply-To: References: Message-ID: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Modify some format ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12682/files - new: https://git.openjdk.org/jdk/pull/12682/files/4788345a..c0b96d9c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=33 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12682&range=32-33 Stats: 24 lines in 1 file changed: 0 ins; 0 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/12682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12682/head:pull/12682 PR: https://git.openjdk.org/jdk/pull/12682 From fyang at openjdk.org Tue Apr 25 13:28:20 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 25 Apr 2023 13:28:20 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v34] In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 13:17:25 GMT, Dingli Zhang wrote: >> HI, >> >> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! >> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. >> >> ## Load/Store/Cmp Mask >> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? >> >> 218 loadV V1, [R7] # vector (rvv) >> 220 vloadmask V0, V1 >> ... >> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 >> 24c vstoremask V1, V0 >> 258 storeV [R7], V1 # vector (rvv) >> >> >> The corresponding generated jit assembly? >> >> # loadV >> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef95c: vle8.v v1,(t2) >> >> # vloadmask >> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, >> 0x000000400c8ef964: vmsne.vx v0,v1,zero >> >> # vmaskcmp_rvv_masked >> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef980: vmclr.m v1 >> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t >> 0x000000400c8ef988: vmv1r.v v0,v1 >> >> # vstoremask >> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c8ef990: vmv.v.x v1,zero >> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 >> >> >> ## Masked vector arithmetic instructions (e.g. vadd) >> AddMaskTestMerge case: >> >> import jdk.incubator.vector.IntVector; >> import jdk.incubator.vector.VectorMask; >> import jdk.incubator.vector.VectorOperators; >> import jdk.incubator.vector.VectorSpecies; >> >> public class AddMaskTestMerge { >> >> static final VectorSpecies SPECIES = IntVector.SPECIES_128; >> static final int SIZE = 1024; >> static int[] a = new int[SIZE]; >> static int[] b = new int[SIZE]; >> static int[] r = new int[SIZE]; >> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; >> static { >> for (int i = 0; i < SIZE; i++) { >> a[i] = i; >> b[i] = i; >> } >> } >> >> static void workload(int idx) { >> VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); >> IntVector av = IntVector.fromArray(SPECIES, a, idx); >> IntVector bv = IntVector.fromArray(SPECIES, b, idx); >> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); >> } >> >> public static void main(String[] args) { >> for (int i = 0; i < 30_0000; i++) { >> for (int j = 0; j < SIZE; j += SPECIES.length()) { >> workload(j); >> } >> } >> } >> } >> >> >> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. >> >> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: >> >> >> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 >> 0ae loadV V1, [R31] # vector (rvv) >> 0b6 vloadmask V0, V2 >> 0be vadd.vv V3, V1, V0 #@vaddI_masked >> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r >> 0ca decode_heap_oop R28, R28 #@decodeHeapOop >> 0cc lwu R7, [R28, #12] # range, #@loadRange >> 0d0 NullCheck R28 >> >> >> And the jit code is as follows: >> >> >> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) >> ; - AddMaskTestMerge::workload at 46 (line 25) >> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu >> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) >> ; - AddMaskTestMerge::workload at 7 (line 22) >> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu >> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) >> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) >> ; - AddMaskTestMerge::workload at 39 (line 25) >> >> >> ## Mask register allocation & mask bit opreation >> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. >> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: >> >> >> >> >> >> >> >> >> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: >> >> vloadmask V0, V1 >> vloadmask V30, V2 >> vmask_and V0, V30, V0 >> >> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. >> >> ## vector load/store - predicated & blend opreation >> >> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: >> >> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 >> 152 vmask_gen_L V0, R12 >> 162 loadV_masked V1, V0, [R10] >> 16e storeV_masked [R11], V0, V1 >> >> >> And `VectorBlend` will generate the following compilation log (part of rotate opreation): >> >> 1ea vlsrBS V6, V1, V3 V0 >> 1fe vlslBS V5, V1, V2 V0 >> 212 vor.vv V2, V5, V6 #@vor >> 21a vloadmask V0, V4 >> 222 vmerge_vvm V1, V1, V2 # vector blend >> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 >> >> >> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. >> >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java >> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 >> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java >> >> ### Testing: >> >> qemu with UseRVV: >> - [x] Tier1 tests (release) >> - [x] Tier2 tests (release) >> - [x] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (release/fastdebug) > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Modify some format Updated change looks fine. We might need more tunning about the number of mask registers in the future. But I think it's reasonable to keep a relatively small number of mask registers for now. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1399919146 From epeter at openjdk.org Tue Apr 25 13:29:16 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Apr 2023 13:29:16 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v4] In-Reply-To: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> References: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> Message-ID: On Mon, 24 Apr 2023 15:06:12 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Alternative approaches >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: >> >> | approach | correctness | efficiency | effectiveness | conceptual complexity | >> | -------- | ----------- | ---------- | ------------- | --------------------- | >> | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | >> | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | >> | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | >> >> Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: > > - Merge master > - Fix node naming in reduction chain traversal > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - ... and 23 more: https://git.openjdk.org/jdk/compare/7400aff3...1510accd Still looks good. ------------- Marked as reviewed by epeter (Committer). PR Review: https://git.openjdk.org/jdk/pull/13120#pullrequestreview-1399921140 From jkarthikeyan at openjdk.org Tue Apr 25 14:43:23 2023 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 25 Apr 2023 14:43:23 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v6] In-Reply-To: References: Message-ID: <2ODJH1IFMOVjRgjQIeobF2eb_nxTCgnxcV__ttNz9nw=.7cbf388a-0a65-4d1c-8b60-d29ae3502123@github.com> > Hi, I've created optimizations for the expansion of `Conv2B` nodes, especially when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). This change replaces `Conv2B` nodes in the middle-end during post loop opts IGVN with conditional moves on supported platforms (x86_64, aarch64, arm32), allowing the bit flip with `xor` to be subsumed with an inversion of the comparison instead. This change also reduces the overhead of the matcher in the backends, as fewer rules need to be traversed in order to match an ideal node. Performance results from my (Zen 2) machine: > > > Baseline Patch Improvement > Benchmark Mode Cnt Score Error Units Score Error Units > Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% > Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% > Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) > Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% > Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% > > Reviews would be greatly appreciated! > > Testing: tier1-2 on linux x64, GHA Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - Merge branch 'master' into conv2b-x86-lowering - Whitespace tweak - Make transform conditional - Remove Conv2B from backend as it's macro expanded now - Re-work transform to happen in macro expansion - Fix whitespace and add bug tag to IR test - Merge branch 'master' into conv2b-x86-lowering - Merge branch 'master' into conv2b-x86-lowering - Merge branch 'master' into conv2b-x86-lowering - Merge branch 'master' into conv2b-x86-lowering - ... and 1 more: https://git.openjdk.org/jdk/compare/bad6aa68...295b9a67 ------------- Changes: https://git.openjdk.org/jdk/pull/13345/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13345&range=05 Stats: 358 lines in 14 files changed: 241 ins; 114 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/13345.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13345/head:pull/13345 PR: https://git.openjdk.org/jdk/pull/13345 From epeter at openjdk.org Tue Apr 25 15:07:19 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 25 Apr 2023 15:07:19 GMT Subject: RFR: 8304720: SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph Message-ID: `SuperWord:schedule`, and specifically `SuperWord::co_locate_pack` is broken. The problem is with the basic approach of it, as far as I know. Hence, I had to completely re-design the `schedule` algorithm, based on the `PacksetGraph` ([JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042), https://git.openjdk.org/jdk/pull/13078). **The current approach** The idea is to leave the non-vectorized memory ops in their place, and find the right place for the vectorized memops to be "sandwiched" into. The logic is very complex and has already had a few bugs fixed. **Why this does not work** However, in some rare cases, we have to reorder non-vectorized operations. See this example that I added as a regression test: https://github.com/openjdk/jdk/blob/a771a61005aea272cc51fa3f3e1637c217582fce/test/hotspot/jtreg/compiler/loopopts/superword/TestScheduleReordersScalarMemops.java#L82-L109 I found this issue during work on https://github.com/openjdk/jdk/pull/13078, where I had to restrict/disable some tests that are now passing. **Solution** Abandon the idea of "sandwiching" memops. Rewrite `SuperWord:schedule`: https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2567-L2576 We first schedule all memops into a linear order. We do this scheduling based on the `PacksetGraph`, which gives us a `DAG` based on the `packset` and the dependency-graph (which in turn respects the data use-defs, as well as the memory dependencies, unless we can prove that they do not reference the same memory). In other words: we have a linearization that respects all dependencies that must be respected. Further, we make sure that ops from the same pack are scheduled as a block (all adjacent to each other), and in order that the packset has internally. https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2489-L2493 Now that we have this order (and we have not aborted because we found a cycle in the `PacksetGraph`), we must apply this schedule to each memory slice, and reorder the memops in the slices accordingly. https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2617-L2619 This scheduling has the nice side-effect of simplifying `SuperWord::output` a little. We know now that the first element in a pack is also first in the slice order, and the last element in the pack is last in the slice (because we schedule the packs as a block, i.e. in the pack order). **Discussion** This seems to me to be a much more straight forward approach, and it uses the code I recently added for verification of cyclic dependencies in the packset ([JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042), https://git.openjdk.org/jdk/pull/13078). One potential improvement to my fix: We now sometimes re-order the non-vectorized memory slices, even though it may not be necessary. This is now wrong, but it makes updates to the graph that may be confusing when debugging. Further, the re-ordering may have performance impacts. I could use a priority-queue (min-heap, would have to implement it since it does not yet exist), and schedule the `PacksetGraph` whenever possible with the lower `bb_idx` first. This would make the new linear order the same/closer to the old one. However, I am not sure if this is worth the effort and overhead of a priority-queue. **Testing** Github-actions pass. tier1-6 + stress testing passes. Performance testing showed no significant performance change. ------------- Commit messages: - Merge branch 'master' into JDK-8304720 - re-restrict a IR test to 64 bit, fails on x86-32 - bigger cleanup - weaken the assert that fails because of my_pack inconsistency with -XX:+UseVectorCmov - Improve some bad comments - fixed last memory state in loop, and enabled tests from prevous bug that should now pass - more to last commit - special handling for Loads with memory state outside loop - regression test - Update to SuperWord::output - may need more work there - ... and 1 more: https://git.openjdk.org/jdk/compare/f239695b...677400bb Changes: https://git.openjdk.org/jdk/pull/13354/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13354&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8304720 Stats: 617 lines in 5 files changed: 223 ins; 273 del; 121 mod Patch: https://git.openjdk.org/jdk/pull/13354.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13354/head:pull/13354 PR: https://git.openjdk.org/jdk/pull/13354 From eliu at openjdk.org Tue Apr 25 15:12:12 2023 From: eliu at openjdk.org (Eric Liu) Date: Tue, 25 Apr 2023 15:12:12 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v3] In-Reply-To: References: Message-ID: <1wE09YSA0YEQ5CdPJkBHiyO7HFOUF1dh9mbMOuT5W04=.22b77c9b-3b7a-4107-bd6d-44f9a3a6e5d5@github.com> > This patch fixes C2 failure with SIGSEGV due to endless recursion. > > With test case VectorBoxExpandTest.java in this patch, C2 would generate IR graph like below: > > > ------------ > / \ > Region | VectorBox | > \ | / | > Phi | > | | > | | > Region | VectorBox | > \ | / | > Phi | > | | > |------------/ > | > > > > This Phi will be optimized by merge_through_phi [1], which transforms `Phi (VectorBox VectorBox)` into `VectorBox (Phi Phi)` to pursue opportunity of combining VectorBox with VectorUnbox. In this process, either the pre type check [2] or the process cloning Phi nodes [3], the circle case is well considered to avoid falling into endless loop. > > After merge_through_phi, each input Phi of new VectorBox has the same shape with original root Phi before merging (only VectorBox has been replaced). After several other optimizations, C2 would expand VectorBox [4] on a graph like below: > > > ------------ > / \ > Region | Proj | > \ | / | > Phi | > | | > | | > Region | Proj | > \ | / | > Phi | > | | > |------------/ > | > | Phi > | / > VectorBox > > > which the circle case should be taken into consideration as well. > > [TEST] > Full Jtreg passed without new failure. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2554 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2571 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2531 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vector.cpp#L311 Eric Liu has updated the pull request incrementally with one additional commit since the last revision: expand vector box in local Change-Id: Ie7bddc049b479aad4f953ec920d83b91e7de2152 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13489/files - new: https://git.openjdk.org/jdk/pull/13489/files/7119ed69..f4630cdc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13489&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13489&range=01-02 Stats: 52 lines in 1 file changed: 25 ins; 11 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/13489.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13489/head:pull/13489 PR: https://git.openjdk.org/jdk/pull/13489 From eliu at openjdk.org Tue Apr 25 16:10:07 2023 From: eliu at openjdk.org (Eric Liu) Date: Tue, 25 Apr 2023 16:10:07 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: Message-ID: <0PLhkQb-zatS4ecUfr1e69h5BVo9yPNh6TCLHX7UOfg=.41a730a6-827b-4e6d-bb1f-8a01f7393d91@github.com> On Sat, 22 Apr 2023 09:38:37 GMT, Eric Liu wrote: >> I do still not fully grasp it though. When `expand_vbox_helper` is called on `Phi1`, it creates `NewPhi1`, then it calls `expand_vbox_helper` on `Phi2`, a `NewPhi2` is created, then `expand_vbox_helper` is invoked again on `Phi1`, which it shortcircuits to return `Phi1`, which will be attached as an input of `NewPhi2`, at this step should the second invocation on `Phi1` returns `NewPhi1` instead? >> >> Thanks a lot. > > Yes, I think it should return the `NewPhi1` instead. > > In my test case, the `NewPhi1` and `NewPhi2` are idealized to `Phi1` and `Phi2`, so it does not matter whether it returns the new one. But I'm not sure if it's certain to be idealize to the old one. Anyway, return the new is more reasonable. I will fix that and do test. > Hi @e1iu , Can you please elaborate the reason for CYCLIC Phi creation in this case. To explain this issue, I also added a test case with the IR graph. The cycled Phi, I think is normal in C2 especially when it's two tiered loop. The two Phi nodes represent the value of `a` after inner loop and outer loop respectively. After outer loop, `a` should be a Phi(Phi1), one value is `a0` which is the initial value, another is the value after inner loop. The value of `a` after inner loop should also be a Phi(Phi2), one value is the initial `a1`, another is the value after outer loop, which is Phi1. In Vector API, `a0` and `a1` can be VectorBox. > Ideally we should not have seen that pallet in first place. Cycled Phi has already been taken into consideration in Phi optimization. Please refer to the links [2] and [3]. If we disabled this shape in merge_thorugh_phi, I think the bug will be gone in this place. But that would loss some chances matching VectorBox and VectorUnbox. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1176736542 From qamai at openjdk.org Tue Apr 25 16:16:24 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 25 Apr 2023 16:16:24 GMT Subject: Integrated: 8304676: [vectorapi] x86_32: Crash in Assembler::kmovql(Address, KRegister) In-Reply-To: References: Message-ID: On Sun, 23 Apr 2023 19:51:58 GMT, Quan Anh Mai wrote: > Hi, > > Can I have reviews for this patch which fixes crashes during PrintOptoAssembly of KRegister spilling code. The reason is that we miss the check for nullptr cbuf. > > Thanks a lot. This pull request has now been integrated. Changeset: e8f62de1 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/e8f62de1cf791d0212805c7a5a97497b67e2a34a Stats: 26 lines in 1 file changed: 20 ins; 2 del; 4 mod 8304676: [vectorapi] x86_32: Crash in Assembler::kmovql(Address, KRegister) Reviewed-by: shade, thartmann, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/13603 From eliu at openjdk.org Tue Apr 25 16:19:13 2023 From: eliu at openjdk.org (Eric Liu) Date: Tue, 25 Apr 2023 16:19:13 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: <0PLhkQb-zatS4ecUfr1e69h5BVo9yPNh6TCLHX7UOfg=.41a730a6-827b-4e6d-bb1f-8a01f7393d91@github.com> References: <0PLhkQb-zatS4ecUfr1e69h5BVo9yPNh6TCLHX7UOfg=.41a730a6-827b-4e6d-bb1f-8a01f7393d91@github.com> Message-ID: On Tue, 25 Apr 2023 16:05:30 GMT, Eric Liu wrote: >> Yes, I think it should return the `NewPhi1` instead. >> >> In my test case, the `NewPhi1` and `NewPhi2` are idealized to `Phi1` and `Phi2`, so it does not matter whether it returns the new one. But I'm not sure if it's certain to be idealize to the old one. Anyway, return the new is more reasonable. I will fix that and do test. > >> Hi @e1iu , Can you please elaborate the reason for CYCLIC Phi creation in this case. > > To explain this issue, I also added a test case with the IR graph. The cycled Phi, I think is normal in C2 especially when it's two tiered loop. > > The two Phi nodes represent the value of `a` after inner loop and outer loop respectively. After outer loop, `a` should be a Phi(Phi1), one value is `a0` which is the initial value, another is the value after inner loop. The value of `a` after inner loop should also be a Phi(Phi2), one value is the initial `a1`, another is the value after outer loop, which is Phi1. > > In Vector API, `a0` and `a1` can be VectorBox. > >> Ideally we should not have seen that pallet in first place. > > Cycled Phi has already been taken into consideration in Phi optimization. Please refer to the links [2] and [3]. If we disabled this shape in merge_thorugh_phi, I think the bug will be gone in this place. But that would loss some chances matching VectorBox and VectorUnbox. > I do still not fully grasp it though. When `expand_vbox_helper` is called on `Phi1`, it creates `NewPhi1`, then it calls `expand_vbox_helper` on `Phi2`, a `NewPhi2` is created, then `expand_vbox_helper` is invoked again on `Phi1`, which it shortcircuits to return `Phi1`, which will be attached as an input of `NewPhi2`, at this step should the second invocation on `Phi1` returns `NewPhi1` instead? > > Thanks a lot. Hi @merykitty , the fix is done. I think we dont need to create a new graph at all. Searching in local and just to replace the aimed Proj is more clear. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1176749326 From amitkumar at openjdk.org Tue Apr 25 17:24:20 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 25 Apr 2023 17:24:20 GMT Subject: RFR: 8306855: [s390x] fix difference in abi sizes Message-ID: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> This PR equals the interpreter, native abi sizes. JIT abi is completely removed as it was not being used anywhere in s390x code base. Apart form it, `z_abi_16` was replaced by `z_common_abi` and some other renaming followed this ladder. Builds `slowdebug, fastdebug, release, optimized` are okay. `tier1-tests on fastdebug-build` are not affected as well. ------------- Commit messages: - fixes different size issue for different abi frames Changes: https://git.openjdk.org/jdk/pull/13650/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13650&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306855 Stats: 52 lines in 9 files changed: 14 ins; 11 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/13650.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13650/head:pull/13650 PR: https://git.openjdk.org/jdk/pull/13650 From dlong at openjdk.org Tue Apr 25 17:26:08 2023 From: dlong at openjdk.org (Dean Long) Date: Tue, 25 Apr 2023 17:26:08 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if [v2] In-Reply-To: References: Message-ID: <9HMLJeVXVyh0Y0LCcdPcmlxaT-9SBWGJi8ErBZxukxg=.33755bbe-c010-4c94-93bb-ff7221298938@github.com> On Mon, 24 Apr 2023 18:10:44 GMT, Dean Long wrote: >> This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. > > Dean Long has updated the pull request incrementally with two additional commits since the last revision: > > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann I think I'll just leave the names as-is. The comments in the code seem to use "count" and "counter" interchangeably. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13551#issuecomment-1522153183 From amitkumar at openjdk.org Tue Apr 25 17:30:18 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Tue, 25 Apr 2023 17:30:18 GMT Subject: RFR: 8306855: [s390x] fix difference in abi sizes [v2] In-Reply-To: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> References: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> Message-ID: <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> > This PR equals the interpreter, native abi sizes. JIT abi is completely removed as it was not being used anywhere in s390x code base. Apart form it, `z_abi_16` was replaced by `z_common_abi` and some other renaming followed this ladder. > > Builds `slowdebug, fastdebug, release, optimized` are okay. `tier1-tests on fastdebug-build` are not affected as well. Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: updates header years ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13650/files - new: https://git.openjdk.org/jdk/pull/13650/files/c9fdd828..cb1f8537 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13650&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13650&range=00-01 Stats: 10 lines in 7 files changed: 0 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/13650.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13650/head:pull/13650 PR: https://git.openjdk.org/jdk/pull/13650 From jbhateja at openjdk.org Tue Apr 25 17:48:10 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 25 Apr 2023 17:48:10 GMT Subject: RFR: 8304948: [vectorapi] C2 crashes when expanding VectorBox [v2] In-Reply-To: References: <0PLhkQb-zatS4ecUfr1e69h5BVo9yPNh6TCLHX7UOfg=.41a730a6-827b-4e6d-bb1f-8a01f7393d91@github.com> Message-ID: On Tue, 25 Apr 2023 16:15:18 GMT, Eric Liu wrote: >>> Hi @e1iu , Can you please elaborate the reason for CYCLIC Phi creation in this case. >> >> To explain this issue, I also added a test case with the IR graph. The cycled Phi, I think is normal in C2 especially when it's two tiered loop. >> >> The two Phi nodes represent the value of `a` after inner loop and outer loop respectively. After outer loop, `a` should be a Phi(Phi1), one value is `a0` which is the initial value, another is the value after inner loop. The value of `a` after inner loop should also be a Phi(Phi2), one value is the initial `a1`, another is the value after outer loop, which is Phi1. >> >> In Vector API, `a0` and `a1` can be VectorBox. >> >>> Ideally we should not have seen that pallet in first place. >> >> Cycled Phi has already been taken into consideration in Phi optimization. Please refer to the links [2] and [3]. If we disabled this shape in merge_thorugh_phi, I think the bug will be gone in this place. But that would loss some chances matching VectorBox and VectorUnbox. > >> I do still not fully grasp it though. When `expand_vbox_helper` is called on `Phi1`, it creates `NewPhi1`, then it calls `expand_vbox_helper` on `Phi2`, a `NewPhi2` is created, then `expand_vbox_helper` is invoked again on `Phi1`, which it shortcircuits to return `Phi1`, which will be attached as an input of `NewPhi2`, at this step should the second invocation on `Phi1` returns `NewPhi1` instead? >> >> Thanks a lot. > > Hi @merykitty , the fix is done. I think we dont need to create a new graph at all. Searching in local and just to replace the aimed Proj is more clear. > > Hi @e1iu , Can you please elaborate the reason for CYCLIC Phi creation in this case. > > To explain this issue, I also added a test case with the IR graph. The cycled Phi, I think is normal in C2 especially when it's two tiered loop. > > The two Phi nodes represent the value of `a` after inner loop and outer loop respectively. After outer loop, `a` should be a Phi(Phi1), one value is `a0` which is the initial value, another is the value after inner loop. The value of `a` after inner loop should also be a Phi(Phi2), one value is the initial `a1`, another is the value after outer loop, which is Phi1. > > In Vector API, `a0` and `a1` can be VectorBox. > > > Ideally we should not have seen that pallet in first place. > > Cycled Phi has already been taken into consideration in Phi optimization. Please refer to the links [2] and [3]. If we disabled this shape in merge_thorugh_phi, I think the bug will be gone in this place. But that would loss some chances matching VectorBox and VectorUnbox. Thanks for the explanation, yes it makes sense now. During _merge_through_phi_ actual boxes are forwarded across the phi nodes and VBA are rewired to mimic original phi pattern, during box expansion we expand allocations in all the converging paths and because of cyclic pattern we enter into endless recusion resulting into stack overflow. I tired getting rid of VBA through early buffering but it leads to other set of [issues](https://github.com/openjdk/valhalla/pull/833#discussion_r1166386677) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13489#discussion_r1176841376 From vlivanov at openjdk.org Tue Apr 25 21:22:35 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 25 Apr 2023 21:22:35 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: <8OlLs3nmBxAKP_OcaZPhC3g1lNpfXJQ6zYCx2XNB43A=.055edcf3-96b7-4f97-9153-fdfe26bb0c0b@github.com> On Tue, 25 Apr 2023 00:10:53 GMT, Cesar Soares Lucas wrote: >> src/hotspot/share/code/debugInfo.cpp line 257: >> >>> 255: } else { >>> 256: assert(selector < _possible_objects.length(), "sanity"); >>> 257: _selected = (ObjectValue*) _possible_objects.at(selector); >> >> Any particular reason to reuse `ObjectValue` from `_possible_objects` instead of allocating a fresh one (as you do on `selector == -1` bracnh)? I'd prefer `ObjectMergeValue::select()` to always allocate a fresh `ObjectValue` when converting `ObjectMergeValue` + `ObjectMergeCandidateValue` into `ObjectValue`. > > @iwanowww - may I ask why always allocating a fresh object might be better than returning a pointer to a previous "selected" object? I don't mind there's caching happening if it gives any noticeable benefit. As of now, the code around doesn't care, probably, because it is allocated in resource arena. What I'm against is repurposing existing instances: don't modify a candidate object into a "real object", allocate a fresh one instead. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1177069013 From xliu at openjdk.org Tue Apr 25 21:36:24 2023 From: xliu at openjdk.org (Xin Liu) Date: Tue, 25 Apr 2023 21:36:24 GMT Subject: RFR: 8306872: Rename Node_Array::Size() Message-ID: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> Node_List inherits Node_Array, so it inherits Size(). To resolve naming conflict, Node_List uses size() (with little s) and keep Size() as capacity. I don't think it's a good practice to distinct two function using capital initial. In particular they have different meanings. A true story is that it took me 2 days to find a bug that I chose wrong one between them. This patch just renames Node_Array::Size to max. By using different names, this will avoid from misusing. We need to changes 6 places. 4 are for Node_List::Size() intentionally. 2 of them are for Node_Array::Size(). ------------- Commit messages: - 8306872: Rename Node_Array::Size() Changes: https://git.openjdk.org/jdk/pull/13659/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13659&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306872 Stats: 7 lines in 5 files changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/13659.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13659/head:pull/13659 PR: https://git.openjdk.org/jdk/pull/13659 From cslucas at openjdk.org Tue Apr 25 21:40:15 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Tue, 25 Apr 2023 21:40:15 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: <8OlLs3nmBxAKP_OcaZPhC3g1lNpfXJQ6zYCx2XNB43A=.055edcf3-96b7-4f97-9153-fdfe26bb0c0b@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <8OlLs3nmBxAKP_OcaZPhC3g1lNpfXJQ6zYCx2XNB43A=.055edcf3-96b7-4f97-9153-fdfe26bb0c0b@github.com> Message-ID: <5HmQ644vtSgf7NylFeEsQXfU8DC9W-zk3ayYq373LF4=.8fba7762-07b2-45b8-a002-1ad2c0c05b0e@github.com> On Tue, 25 Apr 2023 21:19:06 GMT, Vladimir Ivanov wrote: >> @iwanowww - may I ask why always allocating a fresh object might be better than returning a pointer to a previous "selected" object? > > I don't mind there's caching happening if it gives any noticeable benefit. As of now, the code around doesn't care, probably, because it is allocated in resource arena. > > What I'm against is repurposing existing instances: don't modify a candidate object into a "real object", allocate a fresh one instead. Thanks for clarifying. There is one scenario where turning the candidate into a "real object" simplify the implementation _greatly_. The scenario is when the ObjectValue is not just a candidate. I.e., the ObjectValue is also used independently of the merge. Example: Point p = new Point(); Point q = new Point(); if (cond) p = q; trap(p, q); Second issue, is that allocating a fresh ObjectValue will require copying the array of field values from the candidate object to the newly allocated object. That's not a big issue, just pointing that out, though. I propose that we allocate a fresh ObjectValue if the candidate is just a candidate (not used independent of merge) and if the candidate is not just a candidate we return the existing ObjectValue (turned 'real object'). I have that implemented, I can push it for you to take a look. What do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1177087887 From kvn at openjdk.org Tue Apr 25 22:16:08 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Apr 2023 22:16:08 GMT Subject: RFR: 8306872: Rename Node_Array::Size() In-Reply-To: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> References: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> Message-ID: On Tue, 25 Apr 2023 21:28:43 GMT, Xin Liu wrote: > Node_List inherits Node_Array, so it inherits Size(). To resolve naming conflict, > Node_List uses size() (with little s) and keep Size() as capacity. I don't think > it's a good practice to distinct two function using capital initial. In particular > they have different meanings. A true story is that it took me 2 days to find a bug > that I chose wrong one between them. > > This patch just renames Node_Array::Size to max. By using different names, this will > avoid from misusing. We need to changes 6 places. 4 are for Node_List::Size() intentionally. > 2 of them are for Node_Array::Size(). Agree. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13659#pullrequestreview-1400835456 From ysuenaga at openjdk.org Wed Apr 26 00:09:26 2023 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Wed, 26 Apr 2023 00:09:26 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 08:27:44 GMT, Roberto Casta?eda Lozano wrote: >> Hi hotspot-compiler folks, >> >> I'd like to change `os::Linux::available_memory()` to refer `MemAvailable` in /proc/meminfo on Linux. One of user of this function is JIT compiler. It is used for determine number of compiler threads. After this change, more compiler threads would be started because `MemAvailable` includes not only free memory but also some caches - it means `MemAvailable` is bigger than `MemFree`. >> >> I think this change is not a problem because number of compiler threads is limited by `CICompilerCount`. Do you have any concerns in compiler perspective? > > Hi @YaSuenag, just to confirm that this change would not lead to excessive creation/deletion of compiler threads (which can have a significant cost in terms of memory usage, see e.g. [JDK-8302264](https://bugs.openjdk.org/browse/JDK-8302264)), it would be useful to see some measurements about number of created compiler threads over the execution of some applications in a environment configured so that `available_memory / (200*M)` becomes the limiting factor in https://github.com/openjdk/jdk/blob/f968da97a5a5c68c28ad29d13fdfbe3a4adf5ef7/src/hotspot/share/compiler/compileBroker.cpp#L1024-L1027, before and after the change. This could be easily measured e.g. using `-XX:+TraceCompilerThreads`. Do you have (or could produce) such measurements? @robcasloz Thanks for your reply! The results of measurement with `-XX:+TraceCompilerThreads` is here. They look like no significant difference. Do you have any concerns? and we go ahead this change? # before $ head -n 3 /proc/meminfo; ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+TraceCompilerThreads --version MemTotal: 8117076 kB MemFree: 123228 kB MemAvailable: 7457284 kB 184 Added initial compiler thread C2 CompilerThread0 184 Added initial compiler thread C1 CompilerThread0 openjdk 21-internal 2023-09-19 OpenJDK Runtime Environment (fastdebug build 21-internal-adhoc.ysuenaga.jdk) OpenJDK 64-Bit Server VM (fastdebug build 21-internal-adhoc.ysuenaga.jdk, mixed mode, sharing) # after $ head -n 3 /proc/meminfo; ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+TraceCompilerThreads --version MemTotal: 8117076 kB MemFree: 118276 kB MemAvailable: 7625760 kB 185 Added initial compiler thread C2 CompilerThread0 185 Added initial compiler thread C1 CompilerThread0 openjdk 21-internal 2023-09-19 OpenJDK Runtime Environment (fastdebug build 21-internal-adhoc.ysuenaga.jdk) OpenJDK 64-Bit Server VM (fastdebug build 21-internal-adhoc.ysuenaga.jdk, mixed mode, sharing) They were measured on Fedora 38 on Hyper-V which assigned 4 vCPUs. I consumed memory with following program. It run on background in each measurement. const char *fname = "test.dat"; const unsigned long sz = 8000UL * 1024UL * 1024UL; char *mem; int fd; fd = open(fname, O_CREAT | O_CLOEXEC | O_RDWR, S_IRUSR | S_IWUSR); lseek(fd, sz, SEEK_SET); write(fd, &sz, 1); mem = (char *)mmap(NULL, sz, PROT_WRITE, MAP_SHARED, fd, 0); memset(mem, 1, sz); ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1522562257 From dzhang at openjdk.org Wed Apr 26 02:18:29 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 26 Apr 2023 02:18:29 GMT Subject: RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v19] In-Reply-To: <64ngA7LrbMqHMbMns6QT9pKQh4NLyeJJlgjbXyTmYJk=.c07b846c-0108-4460-a20b-80661709f67c@github.com> References: <64ngA7LrbMqHMbMns6QT9pKQh4NLyeJJlgjbXyTmYJk=.c07b846c-0108-4460-a20b-80661709f67c@github.com> Message-ID: On Tue, 18 Apr 2023 07:02:33 GMT, Yanhong Zhu wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix match_rule_supported_vector_masked > > Marked as reviewed by yzhu (Author). @yhzhu20 @pfustc @feilongjiang @RealFYang Thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/12682#issuecomment-1522651491 From dzhang at openjdk.org Wed Apr 26 02:41:23 2023 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 26 Apr 2023 02:41:23 GMT Subject: Integrated: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 06:17:06 GMT, Dingli Zhang wrote: > HI, > > We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot! > This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1]. > > ## Load/Store/Cmp Mask > `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`? > > 218 loadV V1, [R7] # vector (rvv) > 220 vloadmask V0, V1 > ... > 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0 > 24c vstoremask V1, V0 > 258 storeV [R7], V1 # vector (rvv) > > > The corresponding generated jit assembly? > > # loadV > 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef95c: vle8.v v1,(t2) > > # vloadmask > 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu, > 0x000000400c8ef964: vmsne.vx v0,v1,zero > > # vmaskcmp_rvv_masked > 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef980: vmclr.m v1 > 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t > 0x000000400c8ef988: vmv1r.v v0,v1 > > # vstoremask > 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c8ef990: vmv.v.x v1,zero > 0x000000400c8ef994: vmerge.vim v1,v1,1,v0 > > > ## Masked vector arithmetic instructions (e.g. vadd) > AddMaskTestMerge case: > > import jdk.incubator.vector.IntVector; > import jdk.incubator.vector.VectorMask; > import jdk.incubator.vector.VectorOperators; > import jdk.incubator.vector.VectorSpecies; > > public class AddMaskTestMerge { > > static final VectorSpecies SPECIES = IntVector.SPECIES_128; > static final int SIZE = 1024; > static int[] a = new int[SIZE]; > static int[] b = new int[SIZE]; > static int[] r = new int[SIZE]; > static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false}; > static { > for (int i = 0; i < SIZE; i++) { > a[i] = i; > b[i] = i; > } > } > > static void workload(int idx) { > VectorMask vmask = VectorMask.fromArray(SPECIES, c, 0); > IntVector av = IntVector.fromArray(SPECIES, a, idx); > IntVector bv = IntVector.fromArray(SPECIES, b, idx); > av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx); > } > > public static void main(String[] args) { > for (int i = 0; i < 30_0000; i++) { > for (int j = 0; j < SIZE; j += SPECIES.length()) { > workload(j); > } > } > } > } > > > This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar. > > Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows: > > > 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991 > 0ae loadV V1, [R31] # vector (rvv) > 0b6 vloadmask V0, V2 > 0be vadd.vv V3, V1, V0 #@vaddI_masked > 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r > 0ca decode_heap_oop R28, R28 #@decodeHeapOop > 0cc lwu R7, [R28, #12] # range, #@loadRange > 0d0 NullCheck R28 > > > And the jit code is as follows: > > > 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228) > ; - AddMaskTestMerge::workload at 46 (line 25) > 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu > 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208) > ; - AddMaskTestMerge::workload at 7 (line 22) > 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu > 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} > ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834) > ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291) > ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41) > ; - AddMaskTestMerge::workload at 39 (line 25) > > > ## Mask register allocation & mask bit opreation > Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3]. > When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated: > > > > > > > > > So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like: > > vloadmask V0, V1 > vloadmask V30, V2 > vmask_and V0, V30, V0 > > We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0. > > ## vector load/store - predicated & blend opreation > > Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store: > > 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984 > 152 vmask_gen_L V0, R12 > 162 loadV_masked V1, V0, [R10] > 16e storeV_masked [R11], V0, V1 > > > And `VectorBlend` will generate the following compilation log (part of rotate opreation): > > 1ea vlsrBS V6, V1, V3 V0 > 1fe vlslBS V5, V1, V2 V0 > 212 vor.vv V2, V5, V6 #@vor > 21a vloadmask V0, V4 > 222 vmerge_vvm V1, V1, V2 # vector blend > 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000 > > > At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done. > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java > [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526 > [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java > > ### Testing: > > qemu with UseRVV: > - [x] Tier1 tests (release) > - [x] Tier2 tests (release) > - [x] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (release/fastdebug) This pull request has now been integrated. Changeset: 1c1a73f7 Author: Dingli Zhang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/1c1a73f715b291faabbc77d09d0f7b0ae65ebea7 Stats: 1015 lines in 7 files changed: 881 ins; 26 del; 108 mod 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API Co-authored-by: zifeihan Reviewed-by: fyang, fjiang, yzhu ------------- PR: https://git.openjdk.org/jdk/pull/12682 From kvn at openjdk.org Wed Apr 26 04:19:22 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Apr 2023 04:19:22 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: <3iM-vkwplJeQZi2YpT15meLWPypFq-gTQCMhJuOTPao=.8dc4e7a2-bb15-4940-aad1-333eace23e14@github.com> References: <3iM-vkwplJeQZi2YpT15meLWPypFq-gTQCMhJuOTPao=.8dc4e7a2-bb15-4940-aad1-333eace23e14@github.com> Message-ID: On Mon, 24 Apr 2023 09:34:11 GMT, Ilya Korennoy wrote: >> @ikorennoy, I added comment with question to Evgeny about how he hit the issue so we can reproduce it. >> These tests can't be run without JTREG which filter them. Based on his answer we either close bug as not issue or try to find why filtering does not work in his configuration. >> >> If you want to look to do clean to remove unneeded checks and simplify `@requires` I would suggest to file a separate RFE. > > @vnkozlov seems like there are no updates in Jira. What can we do now? @ikorennoy Evgeny closed bug so you can withdraw this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/12981#issuecomment-1522747664 From epeter at openjdk.org Wed Apr 26 05:59:54 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Apr 2023 05:59:54 GMT Subject: RFR: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL [v9] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 09:05:46 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review (Tobias) >> >> Co-authored-by: Tobias Hartmann > > Updates look good! Thanks @chhagedorn @TobiHartmann @vnkozlov @rwestrel for the reviews and suggestions! The assert will now still fail with the fuzzer occasionally because of the `assertion / skeleton predicate` bug that @chhagedorn is already working on for a while. But I hope this fix will drastically reduce the rate of fuzzer failures with this assert. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13269#issuecomment-1522815564 From epeter at openjdk.org Wed Apr 26 05:59:57 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Apr 2023 05:59:57 GMT Subject: Integrated: 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL In-Reply-To: References: Message-ID: <3dp5kRtSdfzKTwDzCVEtKlnOEltjrA3fKS7bWx7axPE=.f3e57b04-802a-4a11-b504-982b6191da77@github.com> On Fri, 31 Mar 2023 12:44:17 GMT, Emanuel Peter wrote: > **Context** > > During `PhaseIdealLoop::do_unroll`, we hack the loop-limit, and subtract `stride` from it. We have to prevent underflow on that subtract. Currently, we do this with a `CMoveI`. The problem with this: `CMoveI` is not smart enough to generate a precise type. For example, there are many cases where the input types get better, and underflow is not possible anymore. But the `CMoveI` does not detect this, and still has type `min_jint..hi`. > > We have the same issue in `PhaseIdealLoop::adjust_limit`, where we use `CMoveL` to implement long max/min. The types are not as precise as they could and should be. > > **Problem** > > The imprecise type is used for the zero-trip-guard. It does not fold to false, even though the data-path into the post loop does constant fold to `TOP`. The graph breaks, and assert `malformed control flow` triggers. > > Details: In these cases, we have the super-unrolled main-loop (SuperWord'ed, then further unrolled) directly leading to a vectorized post-loop. The effect is that there is no `region/phi` merging main-exit and main-zero-trip-guard. So the types are already more narrow here. It may be possible that the values are such that we find out that we should never enter the vectorized post-loop. But if data finds out and control does not, we get a broken graph. > Note: we have pre-loop. Then a main-loop and vectorized post loop. Then we merge the main-zero-trip-guard. And at the end we have the scalar post loop. > > I have already recently fixed a bug around this `CMoveI`. https://github.com/openjdk/jdk/commit/5a4945c0d95423d0ab07762c915e9cb4d3c66abb I would now like to have a more satisfactory fix, that properly propagates the types. > > **Solution** > > `PhaseIdealLoop::adjust_limit` already converts the limit from int to long, and does all computations in long, including taking max/min with a `CMoveL`. I now use the so far unused `MaxL/MinL`. I implemented some missing `Value/Identity` components for it. Since `MaxL/MinL` is not implemented in the backend, I just expand it in macro-expansion to a `CMoveL`. At that point the loop-opts are over, and it is most likely ok that we do not make the types more precise after this. > > I take the same approach for `PhaseIdealLoop::do_unroll`: convert limits to long, do subtraction in long, take `MinL/MaxL` to clamp it to the int-range (prevent subtraction underflow). > > **Discussion** > > This solution seems much cleaner to me, and I hope that we will see less bugs because of imprecise types in the limit computation, which were often due to the `CMove` not being smart enough to analyze all inputs (it would have to recognize a multitude of patterns, for the Cmp inputs and the direct inputs to the CMove - we currently do not do that, but just take the union of the input types - this is very inprecise). > > There is a bit of an overhead here: We use longs even though we only want to have int values. But I think we should prefer a clean implementation here, with correct type computation. The performance impact is probably non-existent on 64-bit machines anyway. > > **Caveat** > > I found some cases with the same assert `malformed control flow` that are most likely skeleton/assertion predicate bugs [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981). Some of those cases were new patterns, for example where we PreMainPost a main loop. > > I hope that this fix here at least reduces the frequency of failures significantly. > > **Testing** > > I added 2 regression tests. Our fuzzer seems to spit out examples regularly, so that gives us extra coverage. > > Tested up to `tier5` and stress testing. Performance testing **running...** > > **Future Work** > > We should implement `MaxL/MinL` in the backend. We should also use them during parsing. This would also allow to `SuperWord` the instruction, on the platforms that support it. > > Should we add such an assert during IGVN? I think after IGVN, we should never have a `MultiBranchNode` that does not have the required number of outputs, right? We could add it to `VerifyIterativeGVN`. This pull request has now been integrated. Changeset: cc894d84 Author: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/cc894d849aa5f730d5a806acfc7a237cf5170af1 Stats: 455 lines in 9 files changed: 364 ins; 69 del; 22 mod 8303466: C2: failed: malformed control flow. Limit type made precise with MaxL/MinL Reviewed-by: roland, kvn, chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13269 From thartmann at openjdk.org Wed Apr 26 07:08:24 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Apr 2023 07:08:24 GMT Subject: RFR: 8306872: Rename Node_Array::Size() In-Reply-To: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> References: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> Message-ID: <_XO_864kaLHC82xanFG9kwuMr_2Pq2pzSF8ImzIXetQ=.0e799c08-6abf-485f-9c76-9268d1ef2f16@github.com> On Tue, 25 Apr 2023 21:28:43 GMT, Xin Liu wrote: > Node_List inherits Node_Array, so it inherits Size(). To resolve naming conflict, > Node_List uses size() (with little s) and keep Size() as capacity. I don't think > it's a good practice to distinct two function using capital initial. In particular > they have different meanings. A true story is that it took me 2 days to find a bug > that I chose wrong one between them. > > This patch just renames Node_Array::Size to max. By using different names, this will > avoid from misusing. We need to changes 6 places. 4 are for Node_List::Size() intentionally. > 2 of them are for Node_Array::Size(). Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13659#pullrequestreview-1401279828 From shade at openjdk.org Wed Apr 26 08:01:53 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 26 Apr 2023 08:01:53 GMT Subject: RFR: 8306872: Rename Node_Array::Size() In-Reply-To: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> References: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> Message-ID: On Tue, 25 Apr 2023 21:28:43 GMT, Xin Liu wrote: > Node_List inherits Node_Array, so it inherits Size(). To resolve naming conflict, > Node_List uses size() (with little s) and keep Size() as capacity. I don't think > it's a good practice to distinct two function using capital initial. In particular > they have different meanings. A true story is that it took me 2 days to find a bug > that I chose wrong one between them. > > This patch just renames Node_Array::Size to max. By using different names, this will > avoid from misusing. We need to changes 6 places. 4 are for Node_List::Size() intentionally. > 2 of them are for Node_Array::Size(). Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13659#pullrequestreview-1401361487 From rcastanedalo at openjdk.org Wed Apr 26 08:26:25 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 26 Apr 2023 08:26:25 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v4] In-Reply-To: References: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> Message-ID: On Tue, 25 Apr 2023 13:26:09 GMT, Emanuel Peter wrote: > Still looks good. Thanks for looking at this again, Emanuel! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1522983805 From duke at openjdk.org Wed Apr 26 08:28:55 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Wed, 26 Apr 2023 08:28:55 GMT Subject: Withdrawn: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: References: Message-ID: On Fri, 10 Mar 2023 19:31:45 GMT, Ilya Korennoy wrote: > Changing RuntimeException to SkippedException when TieredStopAtLevel < 4. > > The main part of this problem was done in [JDK-8226795](https://bugs.openjdk.org/browse/JDK-8226795) but for some reason, the exception was not changed. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/12981 From rcastanedalo at openjdk.org Wed Apr 26 08:41:55 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 26 Apr 2023 08:41:55 GMT Subject: Integrated: 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors In-Reply-To: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> References: <-rGBBQHk5en1kP23u9akxzvubL5KEC_s73H6Ox2Yk4U=.42502cee-d135-491e-aa90-2d5634db4df9@github.com> Message-ID: On Mon, 24 Apr 2023 06:05:21 GMT, Roberto Casta?eda Lozano wrote: > The `mov + inc/dec -> lea` subset of the peephole rules introduced by [JDK-8283699](https://bugs.openjdk.org/browse/JDK-8283699) has been found to cause minor regressions for some common benchmarks on Intel microarchitectures earlier than Cascade Lake. This changeset limits their application to Intel Cascade Lake and microarchitectures with full ALU support for lea (`VM_Version::supports_fast_3op_lea()`), where these peephole rules have been confirmed to be beneficial. The adjustment speeds up SPECjvm2008's MonteCarlo benchmark by between 0.1% and 2.7% on pre-Cascade Lake microarchitectures (Haswell-DT, Coffee Lake-B) across different garbage collectors (G1, ZGC). It additionally yields a speedup of 2.1% on SPECjvm2008's Derby benchmark when using G1 on Coffee Lake-B. > > Thanks to @ericcaspole for discussions and helping out with benchmarking. > > #### Testing > > ##### Functionality > > - tier1-5 (windows-x64, linux-x64, macosx-x64; release and debug mode). > - Checked that the expected combination of peephole rules is enabled for all microarchitectures supported by Intel's Software Development Emulator 9.0. > > ##### Performance > > - Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008), different Intel microarchitectures (Haswell-DT, Coffee Lake-B, Cascade Lake, Ice Lake-SP) and operating systems (linux-x64, windows-x64, and macosx-x64). No significant change was observed besides the improvements mentioned above. This pull request has now been integrated. Changeset: 8d899925 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/8d899925dc281c5dabbef14d85a6df807f8d300e Stats: 23 lines in 3 files changed: 17 ins; 1 del; 5 mod 8298189: Regression in SPECjvm2008-MonteCarlo for pre-Cascade Lake Intel processors Co-authored-by: Quan Anh Mai Reviewed-by: shade, thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/13605 From duke at openjdk.org Wed Apr 26 08:42:24 2023 From: duke at openjdk.org (Ilya Korennoy) Date: Wed, 26 Apr 2023 08:42:24 GMT Subject: RFR: 8299226: compiler/profiling/TestTypeProfiling.java: make it not throw if C2 is not enabled. In-Reply-To: References: <3iM-vkwplJeQZi2YpT15meLWPypFq-gTQCMhJuOTPao=.8dc4e7a2-bb15-4940-aad1-333eace23e14@github.com> Message-ID: On Wed, 26 Apr 2023 04:06:01 GMT, Vladimir Kozlov wrote: >> @vnkozlov seems like there are no updates in Jira. What can we do now? > > @ikorennoy Evgeny closed bug so you can withdraw this PR. @vnkozlov thank you for your help! ------------- PR Comment: https://git.openjdk.org/jdk/pull/12981#issuecomment-1522994796 From thartmann at openjdk.org Wed Apr 26 09:06:30 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Apr 2023 09:06:30 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v2] In-Reply-To: <3kQ7Oh5f128BWKBL_AiQwGhJTMgihcJCF7nGauh_X1A=.9b463cc0-8457-46ef-ba1d-1b4639bc813a@github.com> References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> <3kQ7Oh5f128BWKBL_AiQwGhJTMgihcJCF7nGauh_X1A=.9b463cc0-8457-46ef-ba1d-1b4639bc813a@github.com> Message-ID: On Wed, 26 Apr 2023 08:54:35 GMT, Emanuel Peter wrote: >> An other case of `uncast` not being type-propagated through. >> >> We have a case like this: >> `Phi -> ShiftL -> CastII -> AndI` >> >> The Phi has an updated type, so we should re-run Value on the AndI. >> >> In PhaseCCP::push_and, we do update a similar pattern: >> `X -> ShiftL -> AndI` >> >> I extended it to handle this pattern: >> `parent -> LShift (use) -> ConstraintCast* -> And` >> >> For this, I implemented: >> https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 >> >> I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` >> >> **Discussion** >> >> https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 >> I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. >> >> Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > corrected java whitespaces Looks good to me. I like the new `visit_uncasted_uses`. src/hotspot/share/opto/castnode.hpp line 80: > 78: static void visit_uncasted_uses(const Node* n, Callback callback) { > 79: Unique_Node_List internals; > 80: internals.push((Node*)n); // start traversal We should get rid of these casts with [JDK-8252694](https://bugs.openjdk.org/browse/JDK-8252694). I added a comment there. test/hotspot/jtreg/compiler/ccp/TestShiftCastAndNotification.java line 37: > 35: > 36: public class TestShiftCastAndNotification { > 37: static int N; Should use 4-whitespace indentation. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13611#pullrequestreview-1401476822 PR Review Comment: https://git.openjdk.org/jdk/pull/13611#discussion_r1177551856 PR Review Comment: https://git.openjdk.org/jdk/pull/13611#discussion_r1177552254 From epeter at openjdk.org Wed Apr 26 09:06:30 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 26 Apr 2023 09:06:30 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v2] In-Reply-To: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: <3kQ7Oh5f128BWKBL_AiQwGhJTMgihcJCF7nGauh_X1A=.9b463cc0-8457-46ef-ba1d-1b4639bc813a@github.com> > An other case of `uncast` not being type-propagated through. > > We have a case like this: > `Phi -> ShiftL -> CastII -> AndI` > > The Phi has an updated type, so we should re-run Value on the AndI. > > In PhaseCCP::push_and, we do update a similar pattern: > `X -> ShiftL -> AndI` > > I extended it to handle this pattern: > `parent -> LShift (use) -> ConstraintCast* -> And` > > For this, I implemented: > https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 > > I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` > > **Discussion** > > https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 > I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. > > Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: corrected java whitespaces ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13611/files - new: https://git.openjdk.org/jdk/pull/13611/files/4d350f8f..69c265e2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13611&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13611&range=00-01 Stats: 28 lines in 1 file changed: 5 ins; 5 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/13611.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13611/head:pull/13611 PR: https://git.openjdk.org/jdk/pull/13611 From amitkumar at openjdk.org Wed Apr 26 10:37:24 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 26 Apr 2023 10:37:24 GMT Subject: RFR: 8306855: [s390x] fix difference in abi sizes [v2] In-Reply-To: <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> References: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> Message-ID: On Tue, 25 Apr 2023 17:30:18 GMT, Amit Kumar wrote: >> This PR equals the interpreter, native abi sizes. JIT abi is completely removed as it was not being used anywhere in s390x code base. Apart form it, `z_abi_16` was replaced by `z_common_abi` and some other renaming followed this ladder. >> >> Builds `slowdebug, fastdebug, release, optimized` are okay. `tier1-tests on fastdebug-build` are not affected as well. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > updates header years x86 test failure is unrelated.. @RealLucy , @TheRealMDoerr , @reinrich Please review this PR. Thank you :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/13650#issuecomment-1523181317 From mdoerr at openjdk.org Wed Apr 26 11:00:53 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 26 Apr 2023 11:00:53 GMT Subject: RFR: 8306855: [s390x] fix difference in abi sizes [v2] In-Reply-To: <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> References: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> Message-ID: On Tue, 25 Apr 2023 17:30:18 GMT, Amit Kumar wrote: >> This PR equals the interpreter, native abi sizes. JIT abi is completely removed as it was not being used anywhere in s390x code base. Apart form it, `z_abi_16` was replaced by `z_common_abi` and some other renaming followed this ladder. >> >> Builds `slowdebug, fastdebug, release, optimized` are okay. `tier1-tests on fastdebug-build` are not affected as well. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > updates header years LGTM. Seems like `uint64_t toc` was taken from PPC64, but doesn't make sense for s390x. At least, I couldn't find it in the ABI spec. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13650#pullrequestreview-1401707831 From lucy at openjdk.org Wed Apr 26 11:00:54 2023 From: lucy at openjdk.org (Lutz Schmidt) Date: Wed, 26 Apr 2023 11:00:54 GMT Subject: RFR: 8306855: [s390x] fix difference in abi sizes [v2] In-Reply-To: <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> References: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> Message-ID: On Tue, 25 Apr 2023 17:30:18 GMT, Amit Kumar wrote: >> This PR equals the interpreter, native abi sizes. JIT abi is completely removed as it was not being used anywhere in s390x code base. Apart form it, `z_abi_16` was replaced by `z_common_abi` and some other renaming followed this ladder. >> >> Builds `slowdebug, fastdebug, release, optimized` are okay. `tier1-tests on fastdebug-build` are not affected as well. > > Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: > > updates header years LGTM. Thank you for the harmonization. ------------- Marked as reviewed by lucy (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13650#pullrequestreview-1401724831 From amitkumar at openjdk.org Wed Apr 26 11:01:24 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 26 Apr 2023 11:01:24 GMT Subject: RFR: 8306855: [s390x] fix difference in abi sizes [v2] In-Reply-To: References: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> Message-ID: On Wed, 26 Apr 2023 10:43:43 GMT, Martin Doerr wrote: >> Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: >> >> updates header years > > LGTM. Seems like `uint64_t toc` was taken from PPC64, but doesn't make sense for s390x. At least, I couldn't find it in the ABI spec. Thanks @TheRealMDoerr for Review.. >LGTM. Seems like uint64_t toc was taken from PPC64, but doesn't make sense for s390x. At least, I couldn't find it in the ABI spec. I had the same issue finding uses for `tmp` and `toc`; it appears they were never used. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13650#issuecomment-1523215317 From thartmann at openjdk.org Wed Apr 26 11:01:53 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Apr 2023 11:01:53 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v4] In-Reply-To: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> References: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> Message-ID: <0sjbo7yOoWn6PcnOAm1h4-T1nXLgZ5lvYlAoPjYxauo=.802890dc-5abb-4625-a8d6-aa0841ad3723@github.com> On Mon, 24 Apr 2023 15:06:12 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Alternative approaches >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: >> >> | approach | correctness | efficiency | effectiveness | conceptual complexity | >> | -------- | ----------- | ---------- | ------------- | --------------------- | >> | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | >> | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | >> | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | >> >> Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: > > - Merge master > - Fix node naming in reduction chain traversal > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - ... and 23 more: https://git.openjdk.org/jdk/compare/7400aff3...1510accd Great, thorough analysis. The fix looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13120#pullrequestreview-1401715770 From chagedorn at openjdk.org Wed Apr 26 11:22:23 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 26 Apr 2023 11:22:23 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v2] In-Reply-To: <3kQ7Oh5f128BWKBL_AiQwGhJTMgihcJCF7nGauh_X1A=.9b463cc0-8457-46ef-ba1d-1b4639bc813a@github.com> References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> <3kQ7Oh5f128BWKBL_AiQwGhJTMgihcJCF7nGauh_X1A=.9b463cc0-8457-46ef-ba1d-1b4639bc813a@github.com> Message-ID: On Wed, 26 Apr 2023 09:06:30 GMT, Emanuel Peter wrote: >> An other case of `uncast` not being type-propagated through. >> >> We have a case like this: >> `Phi -> ShiftL -> CastII -> AndI` >> >> The Phi has an updated type, so we should re-run Value on the AndI. >> >> In PhaseCCP::push_and, we do update a similar pattern: >> `X -> ShiftL -> AndI` >> >> I extended it to handle this pattern: >> `parent -> LShift (use) -> ConstraintCast* -> And` >> >> For this, I implemented: >> https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 >> >> I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` >> >> **Discussion** >> >> https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 >> I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. >> >> Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > corrected java whitespaces Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13611#pullrequestreview-1401746329 From rcastanedalo at openjdk.org Wed Apr 26 11:22:26 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 26 Apr 2023 11:22:26 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v4] In-Reply-To: <0sjbo7yOoWn6PcnOAm1h4-T1nXLgZ5lvYlAoPjYxauo=.802890dc-5abb-4625-a8d6-aa0841ad3723@github.com> References: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> <0sjbo7yOoWn6PcnOAm1h4-T1nXLgZ5lvYlAoPjYxauo=.802890dc-5abb-4625-a8d6-aa0841ad3723@github.com> Message-ID: On Wed, 26 Apr 2023 10:49:13 GMT, Tobias Hartmann wrote: > Great, thorough analysis. The fix looks good to me. Thanks for reviewing, Tobias! ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1523237431 From chagedorn at openjdk.org Wed Apr 26 11:22:23 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 26 Apr 2023 11:22:23 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if [v2] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 18:10:44 GMT, Dean Long wrote: >> This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. > > Dean Long has updated the pull request incrementally with two additional commits since the last revision: > > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13551#pullrequestreview-1401750308 From amitkumar at openjdk.org Wed Apr 26 11:25:54 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 26 Apr 2023 11:25:54 GMT Subject: RFR: 8306855: [s390x] fix difference in abi sizes [v2] In-Reply-To: References: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> <8oTtVlCVx_ELHQcLm4esknWWEm-5ucXKDtPv1ih3w0g=.3bb9222d-6316-40b0-8495-91ce201610a3@github.com> Message-ID: On Wed, 26 Apr 2023 10:55:22 GMT, Lutz Schmidt wrote: >> Amit Kumar has updated the pull request incrementally with one additional commit since the last revision: >> >> updates header years > > LGTM. > Thank you for the harmonization. Thanks @RealLucy for a quick review-response.. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13650#issuecomment-1523238070 From amitkumar at openjdk.org Wed Apr 26 11:25:57 2023 From: amitkumar at openjdk.org (Amit Kumar) Date: Wed, 26 Apr 2023 11:25:57 GMT Subject: Integrated: 8306855: [s390x] fix difference in abi sizes In-Reply-To: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> References: <2GGc3c254on4RieO7rt8QGcxK9r-LtmamRbJdjYqYR4=.ed0aa5b9-7ecb-4414-beda-1e30cdbd151e@github.com> Message-ID: On Tue, 25 Apr 2023 17:11:47 GMT, Amit Kumar wrote: > This PR equals the interpreter, native abi sizes. JIT abi is completely removed as it was not being used anywhere in s390x code base. Apart form it, `z_abi_16` was replaced by `z_common_abi` and some other renaming followed this ladder. > > Builds `slowdebug, fastdebug, release, optimized` are okay. `tier1-tests on fastdebug-build` are not affected as well. This pull request has now been integrated. Changeset: 35e7bc21 Author: Amit Kumar Committer: Lutz Schmidt URL: https://git.openjdk.org/jdk/commit/35e7bc21d3c1b38e2268924b20ae4b149b4f8cd8 Stats: 62 lines in 9 files changed: 14 ins; 11 del; 37 mod 8306855: [s390x] fix difference in abi sizes Reviewed-by: mdoerr, lucy ------------- PR: https://git.openjdk.org/jdk/pull/13650 From rcastanedalo at openjdk.org Wed Apr 26 11:45:30 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 26 Apr 2023 11:45:30 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 08:27:44 GMT, Roberto Casta?eda Lozano wrote: >> Hi hotspot-compiler folks, >> >> I'd like to change `os::Linux::available_memory()` to refer `MemAvailable` in /proc/meminfo on Linux. One of user of this function is JIT compiler. It is used for determine number of compiler threads. After this change, more compiler threads would be started because `MemAvailable` includes not only free memory but also some caches - it means `MemAvailable` is bigger than `MemFree`. >> >> I think this change is not a problem because number of compiler threads is limited by `CICompilerCount`. Do you have any concerns in compiler perspective? > > Hi @YaSuenag, just to confirm that this change would not lead to excessive creation/deletion of compiler threads (which can have a significant cost in terms of memory usage, see e.g. [JDK-8302264](https://bugs.openjdk.org/browse/JDK-8302264)), it would be useful to see some measurements about number of created compiler threads over the execution of some applications in a environment configured so that `available_memory / (200*M)` becomes the limiting factor in https://github.com/openjdk/jdk/blob/f968da97a5a5c68c28ad29d13fdfbe3a4adf5ef7/src/hotspot/share/compiler/compileBroker.cpp#L1024-L1027, before and after the change. This could be easily measured e.g. using `-XX:+TraceCompilerThreads`. Do you have (or could produce) such measurements? > @robcasloz Thanks for your reply! The results of measurement with `-XX:+TraceCompilerThreads` is here. They look like no significant difference. Do you have any concerns? and we go ahead this change? > # before > > ``` > $ head -n 3 /proc/meminfo; ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+TraceCompilerThreads --version > MemTotal: 8117076 kB > MemFree: 123228 kB > MemAvailable: 7457284 kB > 184 Added initial compiler thread C2 CompilerThread0 > 184 Added initial compiler thread C1 CompilerThread0 > openjdk 21-internal 2023-09-19 > OpenJDK Runtime Environment (fastdebug build 21-internal-adhoc.ysuenaga.jdk) > OpenJDK 64-Bit Server VM (fastdebug build 21-internal-adhoc.ysuenaga.jdk, mixed mode, sharing) > ``` > > # after > > ``` > $ head -n 3 /proc/meminfo; ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+TraceCompilerThreads --version > MemTotal: 8117076 kB > MemFree: 118276 kB > MemAvailable: 7625760 kB > 185 Added initial compiler thread C2 CompilerThread0 > 185 Added initial compiler thread C1 CompilerThread0 > openjdk 21-internal 2023-09-19 > OpenJDK Runtime Environment (fastdebug build 21-internal-adhoc.ysuenaga.jdk) > OpenJDK 64-Bit Server VM (fastdebug build 21-internal-adhoc.ysuenaga.jdk, mixed mode, sharing) > ``` > > They were measured on Fedora 38 on Hyper-V which assigned 4 vCPUs. I consumed memory with following program. It run on background in each measurement. > > ```c > const char *fname = "test.dat"; > const unsigned long sz = 8000UL * 1024UL * 1024UL; > char *mem; > int fd; > > fd = open(fname, O_CREAT | O_CLOEXEC | O_RDWR, S_IRUSR | S_IWUSR); > lseek(fd, sz, SEEK_SET); > write(fd, &sz, 1); > > mem = (char *)mmap(NULL, sz, PROT_WRITE, MAP_SHARED, fd, 0); > memset(mem, 1, sz); > ``` Thanks for checking @YaSuenag! Do I get it right that, in your example, `CompileBroker::possibly_add_compiler_threads()` would see almost 8GB of available memory (`MemAvailable` in `/proc/meminfo`) after this change, whereas before the change it would see only around 120 MB (`MemFree`)? If that is the case, this is a significant increase that might indeed affect the behavior of the compiler thread creation policy, at least in environments where `MemFree` would otherwise be the limiting factor. I think it would be good to get more compiler thread creation statistics from running Java applications or benchmarks that require a significant amount of compilation (so that the compilation policy has a reason to create more compiler threads in the first place) on memory-constrained environments (so that the decision to create or not more compiler threads depends on `os::available_memory()` and not the other three factors in https://github.com/openjdk/jdk/blob/f968da97a5a5c68c28ad29d13fdfbe3a4adf5ef7/src/hotspot/share/compiler/compileBroker.cpp#L1024-L1027). Would it be possible to collect these measurements? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1523261296 From stuefe at openjdk.org Wed Apr 26 13:24:27 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 26 Apr 2023 13:24:27 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Sat, 8 Apr 2023 02:24:44 GMT, Yasumasa Suenaga wrote: > `os::Linux::available_memory()` returns available memory from cgroups or sysinfo(2). In case of the process which run on out of container, that value is based on `freeram` from sysinfo(2). > > `freeram` is equivalent to `MemFree` in `/proc/meminfo` [1]. However it means just a free RAM. We should use `MemAvailable` when we want to know how much memory is available for the process [2]. `MemAvailable` is available in modern Linux kernel, and it has been backported some older kernels (e.g. RHEL). In `sar` from sysstat, it refers that value and shows it as `kbavail` [3]. > > AFAIK PhysicalMemory event in JFR depends on `os::Linux::available_memory()`, and it is used in automated analysis in JMC. So the JFR/JMC user could misunderstand physical memory was exhausted even if the memory was available enough. > > [1] https://github.com/torvalds/linux/blob/c9c3395d5e3dcc6daee66c6908354d47bf98cb0c/fs/proc/meminfo.c#L59 > [2] https://docs.kernel.org/filesystems/proc.html?highlight=memavailable > [3] https://github.com/sysstat/sysstat/blob/ac1df71ca252c158e8d418ded93e5ed52f5e8765/rd_stats.c#L325-L328 We could also just bypass the compiler thread creation question for now. Let the compiler continue to use the old metric when calculating its thread count, but let all other users of os::available_memory() the new one. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1523400205 From rcastanedalo at openjdk.org Wed Apr 26 13:58:24 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 26 Apr 2023 13:58:24 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Wed, 26 Apr 2023 13:12:51 GMT, Thomas Stuefe wrote: > We could also just bypass the compiler thread creation question for now. Let the compiler continue to use the old metric when calculating its thread count, but let all other users of os::available_memory() the new one. I agree, this could be the most pragmatic way forward. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1523451549 From ysuenaga at openjdk.org Wed Apr 26 14:30:27 2023 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Wed, 26 Apr 2023 14:30:27 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Wed, 26 Apr 2023 13:45:50 GMT, Roberto Casta?eda Lozano wrote: > > We could also just bypass the compiler thread creation question for now. Let the compiler continue to use the old metric when calculating its thread count, but let all other users of os::available_memory() the new one. > > I agree, this could be the most pragmatic way forward. Ok, so how do we implement that? It is better to add that function to `os` class like `os::free_memory()`, but it affects all of supported platforms. I think we can add static function for get "free memory" to compilerBroker.cpp . If it runs on Linux, it returns old metric, otherwise it delegates to `os::available_memory()`. Is it ok? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1523501016 From coleenp at openjdk.org Wed Apr 26 15:50:30 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 26 Apr 2023 15:50:30 GMT Subject: RFR: 8305079: Remove finalize() from compiler/c2/Test719030 In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 07:33:16 GMT, Afshin Zafari wrote: > The `finalize()` method is replaced by a Cleaner callback. This looks good! ------------- Marked as reviewed by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13418#pullrequestreview-1402269947 From coleenp at openjdk.org Wed Apr 26 15:50:31 2023 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 26 Apr 2023 15:50:31 GMT Subject: RFR: 8305080: Remove the 'removal' warning for finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests [v2] In-Reply-To: References: Message-ID: On Mon, 17 Apr 2023 08:56:42 GMT, Afshin Zafari wrote: >> The finalize() methods are removed and replaced by Cleaner callbacks. >> >> Note: >> `test/hotspot/jtreg/compiler/jvmci/compilerToVM/HasFinalizableSubclassTest.java` may be removed since there is no need to test if finalize() exists in the subclasses or not.. > > Afshin Zafari has updated the pull request incrementally with one additional commit since the last revision: > > Remove the 'removal' warning for finalize() from test/hotspot/jtreg/compiler/jvmci/common/testcases that used in compiler/jvmci/compilerToVM/ tests This is fine once the bug and PR titles are changed to "Suppress". Thanks. ------------- Marked as reviewed by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13419#pullrequestreview-1402267206 From roland at openjdk.org Wed Apr 26 15:54:00 2023 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 26 Apr 2023 15:54:00 GMT Subject: RFR: 8306933: C2: "assert(false) failed: infinite loop" failure Message-ID: The assert fires because an infinite loop appears in the graph after loop opts are over. After loop opts, the `for(;;)` loop contains a null check and a range check for `array[i]`. So it's not considered an infinite loop (it has exits to uncommon traps). The null check and range check are redundant with the one right before the loop: `int v = array2[k];` IGVN can optimize it but it doesn't happen until after loop opts when a `ConvI2L` for the `array[i]` access is processed as part of post loop opts IGVN. The `for(;;)` loop is then emptied and only contains a `Loop` and a `Safepoint` nodes. I propose removing the assert (at least for now) as I don't see a way to guarantee no infinite loop can appear after loop opts. ------------- Commit messages: - test & fix Changes: https://git.openjdk.org/jdk/pull/13672/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13672&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306933 Stats: 64 lines in 2 files changed: 61 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13672.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13672/head:pull/13672 PR: https://git.openjdk.org/jdk/pull/13672 From xliu at openjdk.org Wed Apr 26 16:12:53 2023 From: xliu at openjdk.org (Xin Liu) Date: Wed, 26 Apr 2023 16:12:53 GMT Subject: Integrated: 8306872: Rename Node_Array::Size() In-Reply-To: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> References: <4BxSieT0OJp48RTlZtGWqWv_L63InYCdMEWfxWZI-Mg=.ab5c8678-21bb-4f37-ad6e-2f2cd2871434@github.com> Message-ID: On Tue, 25 Apr 2023 21:28:43 GMT, Xin Liu wrote: > Node_List inherits Node_Array, so it inherits Size(). To resolve naming conflict, > Node_List uses size() (with little s) and keeps Size() as capacity. I don't think > it's a good practice to distinct two function using capital initial. In particular > they have different meanings. A true story is that it took me 2 days to find a bug > that I chose wrong one between them. > > This patch just renames Node_Array::Size to max. By using different names, this will > avoid from misusing. We need to changes 6 places. 4 are for Node_List::Size() intentionally. > 2 of them are for Node_Array::Size(). This pull request has now been integrated. Changeset: 35e80237 Author: Xin Liu URL: https://git.openjdk.org/jdk/commit/35e802374c18123687ccb5d74a9c2eac0f1b4c52 Stats: 7 lines in 5 files changed: 0 ins; 0 del; 7 mod 8306872: Rename Node_Array::Size() Reviewed-by: kvn, thartmann, shade ------------- PR: https://git.openjdk.org/jdk/pull/13659 From cslucas at openjdk.org Wed Apr 26 17:28:53 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 26 Apr 2023 17:28:53 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v11] In-Reply-To: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: <6I1KVkFSekhMTTDq6nXQNoKPE96bycERRtsPrTnZZvU=.c1933f7f-e659-4e22-93a3-e7fbbcdf53a1@github.com> > Can I please get reviews for this PR? > > The most common and frequent use of NonEscaping Phis merging object allocations is for debugging information. The two graphs below show numbers for Renaissance and DaCapo benchmarks - similar results are obtained for all other applications that I tested. > > With what frequency does each IR node type occurs as an allocation merge user? I.e., if the same node type uses a Phi N times the counter is incremented by N: > > ![image](https://user-images.githubusercontent.com/2249648/222280517-4dcf5871-2564-4207-b49e-22aee47fa49d.png) > > What are the most common users of allocation merges? I.e., if the same node type uses a Phi N times the counter is incremented by 1: > > ![image](https://user-images.githubusercontent.com/2249648/222280608-ca742a4e-1622-4e69-a778-e4db6805ea02.png) > > This PR adds support scalar replacing allocations participating in merges used as debug information OR as a base for field loads. I plan to create subsequent PRs to enable scalar replacement of merges used by other node types (CmpP is next on the list) subsequently. > > The approach I used for _rematerialization_ is pretty straightforward. It consists basically of the following. 1) New IR node (suggested by V. Kozlov), named SafePointScalarMergeNode, to represent a set of SafePointScalarObjectNode; 2) Each scalar replaceable input participating in a merge will get a SafePointScalarObjectNode like if it weren't part of a merge. 3) Add a new Class to support the rematerialization of SR objects that are part of a merge; 4) Patch HotSpot to be able to serialize and deserialize debug information related to allocation merges; 5) Patch C2 to generate unique types for SR objects participating in some allocation merges. > > The approach I used for _enabling the scalar replacement of some of the inputs of the allocation merge_ is also pretty straightforward: call `MemNode::split_through_phi` to, well, split AddP->Load* through the merge which will render the Phi useless. > > I tested this with JTREG tests tier 1-4 (Windows, Linux, and Mac) and didn't see regression. I also experimented with several applications and didn't see any failure. I also ran tests with "-ea -esa -Xbatch -Xcomp -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation -server -XX:+IgnoreUnrecognizedVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+StressLCM -XX:+StressGCM -XX:+StressCCP" and didn't observe any related failures. Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision: Address part of PR review 4 & fix a bug setting only_candidate ------------- Changes: - all: https://git.openjdk.org/jdk/pull/12897/files - new: https://git.openjdk.org/jdk/pull/12897/files/329d9f40..78435065 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12897&range=09-10 Stats: 72 lines in 8 files changed: 15 ins; 17 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/12897.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/12897/head:pull/12897 PR: https://git.openjdk.org/jdk/pull/12897 From vlivanov at openjdk.org Wed Apr 26 17:53:25 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 26 Apr 2023 17:53:25 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: <5HmQ644vtSgf7NylFeEsQXfU8DC9W-zk3ayYq373LF4=.8fba7762-07b2-45b8-a002-1ad2c0c05b0e@github.com> References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <8OlLs3nmBxAKP_OcaZPhC3g1lNpfXJQ6zYCx2XNB43A=.055edcf3-96b7-4f97-9153-fdfe26bb0c0b@github.com> <5HmQ644vtSgf7NylFeEsQXfU8DC9W-zk3ayYq373LF4=.8fba7762-07b2-45b8-a002-1ad2c0c05b0e@github.com> Message-ID: On Tue, 25 Apr 2023 21:37:11 GMT, Cesar Soares Lucas wrote: > ObjectValue is not just a candidate. I.e., the ObjectValue is also used independently of the merge. And now I'm wondering how it all plays with `is_only_merge_candidate()()`/`set_merge_candidate()`... Is it possible for an `ObjectValue` to be shared between multiple merges? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1178207147 From kvn at openjdk.org Wed Apr 26 18:03:24 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Apr 2023 18:03:24 GMT Subject: RFR: 8304720: SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph In-Reply-To: References: Message-ID: On Wed, 5 Apr 2023 14:55:55 GMT, Emanuel Peter wrote: > `SuperWord:schedule`, and specifically `SuperWord::co_locate_pack` is broken. > The problem is with the basic approach of it, as far as I know. > Hence, I had to completely re-design the `schedule` algorithm, based on the `PacksetGraph` ([JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042), https://git.openjdk.org/jdk/pull/13078). > > **The current approach** > > The idea is to leave the non-vectorized memory ops in their place, and find the right place for the vectorized memops to be "sandwiched" into. The logic is very complex and has already had a few bugs fixed. > > **Why this does not work** > > However, in some rare cases, we have to reorder non-vectorized operations. See this example that I added as a regression test: > > https://github.com/openjdk/jdk/blob/a771a61005aea272cc51fa3f3e1637c217582fce/test/hotspot/jtreg/compiler/loopopts/superword/TestScheduleReordersScalarMemops.java#L82-L109 > > I found this issue during work on https://github.com/openjdk/jdk/pull/13078, where I had to restrict/disable some tests that are now passing. > > **Solution** > > Abandon the idea of "sandwiching" memops. Rewrite `SuperWord:schedule`: > > https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2567-L2576 > > We first schedule all memops into a linear order. > We do this scheduling based on the `PacksetGraph`, which gives us a `DAG` based on the `packset` and the dependency-graph (which in turn respects the data use-defs, as well as the memory dependencies, unless we can prove that they do not reference the same memory). > In other words: we have a linearization that respects all dependencies that must be respected. > Further, we make sure that ops from the same pack are scheduled as a block (all adjacent to each other), and in order that the packset has internally. > > https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2489-L2493 > > Now that we have this order (and we have not aborted because we found a cycle in the `PacksetGraph`), we must apply this schedule to each memory slice, and reorder the memops in the slices accordingly. > > https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2617-L2619 > > This scheduling has the nice side-effect of simplifying `SuperWord::output` a little. > We know now that the first element in a pack is also first in the slice order, and the last element in the pack is last in the slice (because we schedule the packs as a block, i.e. in the pack order). > > **Discussion** > > This seems to me to be a much more straight forward approach, and it uses the code I recently added for verification of cyclic dependencies in the packset ([JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042), https://git.openjdk.org/jdk/pull/13078). > > One potential improvement to my fix: > We now sometimes re-order the non-vectorized memory slices, even though it may not be necessary. > This is not wrong, but it makes updates to the graph that may be confusing when debugging. > Further, the re-ordering may have performance impacts. > I could use a priority-queue (min-heap, would have to implement it since it does not yet exist), and schedule the `PacksetGraph` whenever possible with the lower `bb_idx` first. This would make the new linear order the same/closer to the old one. However, I am not sure if this is worth the effort and overhead of a priority-queue. > > **Testing** > Github-actions pass. tier1-6 + stress testing passes. > Performance testing showed no significant performance change. Nice rewrite. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13354#pullrequestreview-1402512534 From cslucas at openjdk.org Wed Apr 26 18:52:00 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Wed, 26 Apr 2023 18:52:00 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> <8OlLs3nmBxAKP_OcaZPhC3g1lNpfXJQ6zYCx2XNB43A=.055edcf3-96b7-4f97-9153-fdfe26bb0c0b@github.com> <5HmQ644vtSgf7NylFeEsQXfU8DC9W-zk3ayYq373LF4=.8fba7762-07b2-45b8-a002-1ad2c0c05b0e@github.com> Message-ID: <06yEWgFGq-XvgFp61UllqFWWpkYSY9fMqQRmkaNwi7Y=.2316f990-2ba1-4fa0-848e-b7e84fc744f7@github.com> On Wed, 26 Apr 2023 17:42:23 GMT, Vladimir Ivanov wrote: > Is it possible for an ObjectValue to be shared between multiple merges? When I posted my previous comment I thought that could happen. But now I realize that in the current implementation that won't happen: an ObjectValue is created for a combination of Phi x SafePointNode. However, one situation would _require_ sharing the ObjectValue in multiple merges: when different merges share at least one SR input are used as debug info _in the same_ SafePointNode. It's required because in the same SafePointNode all ObjectValues coming from same Allocate needs to have the same value. I think the example below will trigger that - I'll check and patch the current implementation to not RAM in that case. Point p = new Point(); Point q = new Point(); Point r = new Point(); if (cond_one) p = q; if (cond_two) r = q; trap(p, q, r); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1178259628 From qamai at openjdk.org Thu Apr 27 03:24:28 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 27 Apr 2023 03:24:28 GMT Subject: Integrated: 8283699: Improve the peephole mechanism of hotspot In-Reply-To: References: Message-ID: On Tue, 29 Mar 2022 23:58:39 GMT, Quan Anh Mai wrote: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. This pull request has now been integrated. Changeset: 703a6ef5 Author: Quan Anh Mai Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/703a6ef591d56b9e5441cb3ca0c70b2b8685f6e1 Stats: 837 lines in 14 files changed: 693 ins; 24 del; 120 mod 8283699: Improve the peephole mechanism of hotspot Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.org/jdk/pull/8025 From gcao at openjdk.org Thu Apr 27 03:57:23 2023 From: gcao at openjdk.org (Gui Cao) Date: Thu, 27 Apr 2023 03:57:23 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API Message-ID: Hi, we have added some implementations related to vector cast, It was implemented by referring to RVV v1.0 [1]. please take a look and have some reviews. Thanks a lot. We can use the VectorReshapeTests.java[2] to print the compilation log, verify and observe the generation of nodes. For example, we can use the following command to print the compilation log of a jtreg test case: /home/zifeihan/jdk-tools/jtreg/bin/jtreg \ -v:default \ -concurrency:16 -timeout:50 \ -javaoption:-XX:+UnlockExperimentalVMOptions \ -javaoption:-XX:+UseRVV \ -javaoption:-XX:+PrintOptoAssembly \ -javaoption:-XX:LogFile=/home/zifeihan/jdk-rvv/VectorReshapeTests_PrintOptoAssembly_20230426.log \ -jdk:/home/zifeihan/jdk-rvv/build/linux-riscv64-server-fastdebug/jdk \ -compilejdk:/home/zifeihan/jdk-rvv/build/linux-x86_64-server-release/images/jdk \ /home/zifeihan/jdk/test/jdk/jdk/incubator/vector/VectorReshapeTests.java #### VectorCast/VectorCastB2X/VectorCastD2X/VectorCastF2X/VectorCastI2X/VectorCastL2X/VectorCastS2X There are too many nodes here, and the following shows the log of `VectorCastB2X` nodes: ``` 1ba0 ld R28, [R23, #280] # ptr, #@loadP 1ba4 addi R29, R7, #32 # ptr, #@addP_reg_imm 1ba8 reinterpretResize V1, V5 1bb0 vcvtBtoX V4, V1 1bb8 far_bgeu R29, R28, B465 #@far_cmpP_branch P=0.000100 C=-1.000000 ``` #### VectorRearrange/VectorReinterpret When the original vector is transformed to the target vector, if the actual number of elements of the original vector is larger than the number of elements of the target vector, a slice action is performed to provide data for the subsequent cast nodes. the slice action depends on the `VectorRearrange` and `VectorReinterpret` nodes. The compilation log for the `VectorRearrange` node: ``` 1f6 spill R7 -> [sp, #320] # spill size = 64 1f8 spill [sp, #128] -> V1 # vector spill size = 256 200 spill [sp, #160] -> V2 # vector spill size = 256 208 rearrange V3, V1, V2 210 spill V3 -> [sp, #96] # vector spill size = 256 218 li R11, #4 # int, #@loadConI ``` The compilation log for the `VectorReinterpret` node: 1218 spill [sp, #32] -> V4 # vector spill size = 256 1220 spill [sp, #176] -> V3 # vector spill size = 256 1228 rearrange V2, V4, V3 1230 spill [sp, #72] -> V0 # vmask spill size = 32 123c vmerge_vvm V1, V1, V2, v0 #@vector blend 1244 reinterpretResize V2, V1 124c vcvtStoX_extend V5, V2 1254 bgeu R28, R7, B169 #@cmpP_branch P=0.000100 C=-1.000000 #### LShiftCntV/RShiftCntV/MaskAll We have merged `LShiftCntV`, `RShiftCntV` nodes and support boolean types The compilation log for the LShiftCntV/RShiftCntV node: 24c vasrB V3, V1, V2 260 storeV [R19], V3 # vector (rvv) 268 lbu R19, [R29, #48] # byte, #@loadUB 26c andi R19, R19, #7 #@andI_reg_imm 270 loadV V1, [R25] # vector (rvv) 278 vshiftcnt V2, R19 280 vasrB V3, V1, V2 294 storeV [R26], V3 # vector (rvv) 29c lbu R19, [R29, #80] # byte, #@loadUB 2a0 andi R19, R19, #7 #@andI_reg_imm 2a4 loadV V1, [R22] # vector (rvv) 2ac vshiftcnt V2, R19 By the way, the mask version of MaskAll is supported. [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorReshapeTests.java Testing: qemu with UseRVV: - [ ] Tier1 tests (release) - [ ] Tier2 tests (release) - [ ] Tier3 tests (release) - [x] test/jdk/jdk/incubator/vector (fastdebug) ------------- Commit messages: - 8306966: RISC-V: Support vector cast node for Vector API Changes: https://git.openjdk.org/jdk/pull/13684/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306966 Stats: 500 lines in 5 files changed: 444 ins; 46 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/13684.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13684/head:pull/13684 PR: https://git.openjdk.org/jdk/pull/13684 From epeter at openjdk.org Thu Apr 27 06:49:53 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Apr 2023 06:49:53 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v2] In-Reply-To: References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> <3kQ7Oh5f128BWKBL_AiQwGhJTMgihcJCF7nGauh_X1A=.9b463cc0-8457-46ef-ba1d-1b4639bc813a@github.com> Message-ID: On Wed, 26 Apr 2023 11:09:52 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> corrected java whitespaces > > Looks good! @chhagedorn @TobiHartmann Thanks for the reviews. I could integrate now. But I have one more discovery: The reason why I cannot place the mentioned `ResoureMark` is that the `worklist` has its arena as the `Thread::current()->resource_area()` in some cases. One example is the `Unique_Node_List worklist` in `PhaseCCP::analyze`. So then in `PhaseCCP::push_and`, this `worklist` gets captured by the lambda, and passed to `ConstraintCastNode::visit_uncasted_uses`. If there is a `ResourceMark` inside it, and we re-allocate memory for the `worklist` because it grows beyond its current capacity, then we will de-allocate that memory when we exit the `ResourceMark`, and the `worklist` now points to de-allocated memory. Without a `ResourceMark`, this can be avoided. But then we never release the memory for the `Unique_Node_List internals`. The second use of `ConstraintCastNode::visit_uncasted_uses` in `PhaseIterGVN::add_users_to_worklist` does not run into this issue, even if we have a `ResourceMark`. The reason is that there the `worklist` that is captured allocates from a separate arena, namely the `Compile::current()->comp_arena()`. We could thus also consider having the `worklist` inside `PhaseCCP::analyze` allocate also from the `Compile::current()->comp_arena()`. One downside is that we never de-allocate memory in that arena until the end of `Compile`. But currently, there is also no `ResourceMark` upstream of `PhaseCCP::analyze`, all the way up to `Compile`. So this `worklist` never has its memory de-allocated anyway. What is your opinion? Should I make such changes, or just integrate what I have now? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13611#issuecomment-1524868040 From gcao at openjdk.org Thu Apr 27 07:42:01 2023 From: gcao at openjdk.org (Gui Cao) Date: Thu, 27 Apr 2023 07:42:01 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API [v2] In-Reply-To: References: Message-ID: > Hi, > > we have added some implementations related to vector cast, It was implemented by referring to RVV v1.0 [1]. please take a look and have some reviews. Thanks a lot. > > We can use the VectorReshapeTests.java[2] to print the compilation log, verify and observe the generation of nodes. > > For example, we can use the following command to print the compilation log of a jtreg test case: > > > /home/zifeihan/jdk-tools/jtreg/bin/jtreg \ > -v:default \ > -concurrency:16 -timeout:50 \ > -javaoption:-XX:+UnlockExperimentalVMOptions \ > -javaoption:-XX:+UseRVV \ > -javaoption:-XX:+PrintOptoAssembly \ > -javaoption:-XX:LogFile=/home/zifeihan/jdk-rvv/VectorReshapeTests_PrintOptoAssembly_20230426.log \ > -jdk:/home/zifeihan/jdk-rvv/build/linux-riscv64-server-fastdebug/jdk \ > -compilejdk:/home/zifeihan/jdk-rvv/build/linux-x86_64-server-release/images/jdk \ > /home/zifeihan/jdk/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > > > #### VectorCast/VectorCastB2X/VectorCastD2X/VectorCastF2X/VectorCastI2X/VectorCastL2X/VectorCastS2X > There are too many nodes here, and the following shows the log of `VectorCastB2X` nodes: > > ``` > 1ba0 ld R28, [R23, #280] # ptr, #@loadP > 1ba4 addi R29, R7, #32 # ptr, #@addP_reg_imm > 1ba8 reinterpretResize V1, V5 > 1bb0 vcvtBtoX V4, V1 > 1bb8 far_bgeu R29, R28, B465 #@far_cmpP_branch P=0.000100 C=-1.000000 > ``` > > #### VectorRearrange/VectorReinterpret > > When the original vector is transformed to the target vector, if the actual number of elements of the original vector is larger than the number of elements of the target vector, a slice action is performed to provide data for the subsequent cast nodes. the slice action depends on the `VectorRearrange` and `VectorReinterpret` nodes. > > The compilation log for the `VectorRearrange` node: > > ``` > 1f6 spill R7 -> [sp, #320] # spill size = 64 > 1f8 spill [sp, #128] -> V1 # vector spill size = 256 > 200 spill [sp, #160] -> V2 # vector spill size = 256 > 208 rearrange V3, V1, V2 > 210 spill V3 -> [sp, #96] # vector spill size = 256 > 218 li R11, #4 # int, #@loadConI > ``` > > The compilation log for the `VectorReinterpret` node: > > > 1218 spill [sp, #32] -> V4 # vector spill size = 256 > 1220 spill [sp, #176] -> V3 # vector spill size = 256 > 1228 rearrange V2, V4, V3 > 1230 spill [sp, #72] -> V0 # vmask spill size = 32 > 123c vmerge_vvm V1, V1, V2, v0 #@vector blend > 1244 reinterpretResize V2, V1 > 124c vcvtStoX_extend V5, V2 > 1254 bgeu R28, R7, B169 #@cmpP_branch P=0.000100 C=-1.000000 > > > #### LShiftCntV/RShiftCntV/MaskAll > > We have merged `LShiftCntV`, `RShiftCntV` nodes and support boolean types > > The compilation log for the LShiftCntV/RShiftCntV node: > > > 24c vasrB V3, V1, V2 > 260 storeV [R19], V3 # vector (rvv) > 268 lbu R19, [R29, #48] # byte, #@loadUB > 26c andi R19, R19, #7 #@andI_reg_imm > 270 loadV V1, [R25] # vector (rvv) > 278 vshiftcnt V2, R19 > 280 vasrB V3, V1, V2 > 294 storeV [R26], V3 # vector (rvv) > 29c lbu R19, [R29, #80] # byte, #@loadUB > 2a0 andi R19, R19, #7 #@andI_reg_imm > 2a4 loadV V1, [R22] # vector (rvv) > 2ac vshiftcnt V2, R19 > > > By the way, the mask version of MaskAll is supported. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > Testing: > qemu with UseRVV: > > - [ ] Tier1 tests (release) > - [ ] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (fastdebug) Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Use zr register instead of x0 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13684/files - new: https://git.openjdk.org/jdk/pull/13684/files/f77bdd8f..b2216ec9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=00-01 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/13684.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13684/head:pull/13684 PR: https://git.openjdk.org/jdk/pull/13684 From epeter at openjdk.org Thu Apr 27 09:14:53 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 27 Apr 2023 09:14:53 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v3] In-Reply-To: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: > An other case of `uncast` not being type-propagated through. > > We have a case like this: > `Phi -> ShiftL -> CastII -> AndI` > > The Phi has an updated type, so we should re-run Value on the AndI. > > In PhaseCCP::push_and, we do update a similar pattern: > `X -> ShiftL -> AndI` > > I extended it to handle this pattern: > `parent -> LShift (use) -> ConstraintCast* -> And` > > For this, I implemented: > https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 > > I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` > > **Discussion** > > https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 > I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. > > Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: re-introduced ResourceMark. Made CCP worklist allocate from comp_arena() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13611/files - new: https://git.openjdk.org/jdk/pull/13611/files/69c265e2..f4df73bd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13611&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13611&range=01-02 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/13611.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13611/head:pull/13611 PR: https://git.openjdk.org/jdk/pull/13611 From rcastanedalo at openjdk.org Thu Apr 27 09:50:53 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 27 Apr 2023 09:50:53 GMT Subject: Integrated: 8287087: C2: perform SLP reduction analysis on-demand In-Reply-To: References: Message-ID: On Tue, 21 Mar 2023 14:49:26 GMT, Roberto Casta?eda Lozano wrote: > Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). > > This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: > > ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) > > The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. > > ## Performance Benefits > > As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. > > ### Increased Auto-Vectorization Scope > > There are two main scenarios in which the proposed changeset enables further auto-vectorization: > > #### Reductions Using Global Accumulators > > > public class Foo { > int acc = 0; > (..) > void reduce(int[] array) { > for (int i = 0; i < array.length; i++) { > acc += array[i]; > } > } > } > > Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: > > ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) > > #### Reductions of partially unrolled loops > > > (..) > for (int i = 0; i < array.length / 2; i++) { > acc += array[2*i]; > acc += array[2*i + 1]; > } > (..) > > > These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. > > ### Increased Performance of x64 Floating-Point `Math.min()/max()` > > Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). > > ## Implementation details > > The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). > > The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. > > ## Alternative approaches > > A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: > > | approach | correctness | efficiency | effectiveness | conceptual complexity | > | -------- | ----------- | ---------- | ------------- | --------------------- | > | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | > | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | > | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | > > Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. > > ## Testing > > ### Functionality > > - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). > - fuzzing (12 h. on linux-x64 and linux-aarch64). > > ##### TestGeneralizedReductions.java > > Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. > > ##### TestFpMinMaxReductions.java > > Tests the matching of floating-point max/min implementations in x64. > > ##### TestSuperwordFailsUnrolling.java > > This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. > > ### Performance > > #### General Benchmarks > > The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. > > #### Micro-benchmarks > > The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). > > > ##### VectorReduction.java > > These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. > > ##### MaxIntrinsics.java > > This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): > > | micro-benchmark | speedup compared to mainline | > | --- | --- | > | `fMinReduceInOuterLoop` | 1.1x | > | `fMinReduceNonCounted` | 2.3x | > | `fMinReduceGlobalAccumulator` | 2.4x | > | `fMinReducePartiallyUnrolled` | 3.9x | > > ## Acknowledgments > > Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. This pull request has now been integrated. Changeset: 1be80a44 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/1be80a4445cf74adc9b2cd5bf262a897f9ede74f Stats: 821 lines in 17 files changed: 654 ins; 106 del; 61 mod 8287087: C2: perform SLP reduction analysis on-demand Reviewed-by: epeter, jbhateja, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/13120 From rcastanedalo at openjdk.org Thu Apr 27 10:25:54 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 27 Apr 2023 10:25:54 GMT Subject: RFR: 8287087: C2: perform SLP reduction analysis on-demand [v4] In-Reply-To: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> References: <-5XPy4I-SVMVvxwGusambZ0QlT_UjpwB_pmq5IWpWlk=.dba6fe41-16f4-4ed4-a4df-62094fc33a86@github.com> Message-ID: On Mon, 24 Apr 2023 15:06:12 GMT, Roberto Casta?eda Lozano wrote: >> Reduction analysis finds cycles of reduction operations within loops. The result of this analysis is used by SLP auto-vectorization (to vectorize reductions if deemed profitable) and by x64 instruction matching (to select specialized scalar floating-point `Math.min()/max()` implementations). Currently, reduction analysis is applied early (before loop unrolling), and the result is propagated through loop unrolling by marking nodes and loops with special reduction flags. Applying reduction analysis early is efficient, but propagating the results correctly through loop unrolling and arbitrary graph transformations is challenging and often leads to inconsistent node-loop reduction flag states, some of which have led to actual miscompilations in the past (see [JDK-8261147](https://bugs.openjdk.org/browse/JDK-8261147) and [JDK-8279622](https://bugs.openjdk.org/browse/JDK-8279622)). >> >> This changeset postpones reduction analysis to the point where its results are actually used. To do so, it generalizes the analysis to find reduction cycles on unrolled loops: >> >> ![reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226725587-b7d68509-3717-4bbe-8d54-f9a105853fda.png) >> >> The generalized analysis precludes the need to maintain and propagate node and loop reduction flags through arbitrary IR transformations, reducing the risk of miscompilations due to invalidation of the analysis results. The generalization is slightly more costly than the current analysis, but still negligible in micro- and general benchmarks. >> >> ## Performance Benefits >> >> As a side benefit, the proposed generalization is able to find more reductions, increasing the scope of auto-vectorization and the performance of x64 floating-point `Math.min()/max()` in multiple scenarios. >> >> ### Increased Auto-Vectorization Scope >> >> There are two main scenarios in which the proposed changeset enables further auto-vectorization: >> >> #### Reductions Using Global Accumulators >> >> >> public class Foo { >> int acc = 0; >> (..) >> void reduce(int[] array) { >> for (int i = 0; i < array.length; i++) { >> acc += array[i]; >> } >> } >> } >> >> Initially, such reductions are wrapped by load and store nodes, which defeats the current reduction analysis. However, after unrolling and other optimizations are applied, the reduction becomes recognizable by the proposed analysis: >> >> ![global-reduction-before-after-unrolling](https://user-images.githubusercontent.com/8792647/226745351-33494e40-7c07-4a8b-8bf6-d3a96e84b1c2.png) >> >> #### Reductions of partially unrolled loops >> >> >> (..) >> for (int i = 0; i < array.length / 2; i++) { >> acc += array[2*i]; >> acc += array[2*i + 1]; >> } >> (..) >> >> >> These reductions are manually unrolled from the beginning, so the current reduction analysis fails to find them, while the proposed analysis is able to detect them as if they were unrolled automatically. >> >> ### Increased Performance of x64 Floating-Point `Math.min()/max()` >> >> Besides the above scenarios, the proposed generalization allows the x64 matcher to select specialized floating-point `Math.min()/max()` implementations for reductions in non-counted and outer loops (see the new micro-benchmarks in `FpMinMaxIntrinsics.java` for more details). >> >> ## Implementation details >> >> The generalized reduction analysis finds reductions in a loop by looking for chains of reduction operators of the same node type starting and finishing on each phi node in the loop. To avoid a combinatorial explosion, the analysis assumes that all nodes in a chain are connected via the same edge index, which is realistic because chains usually consist of identical nodes cloned by loop unrolling. This assumption allows the analysis to test only two paths for each examined phi node. A failure of this assumption (e.g. as illustrated in test case `testReductionOnPartiallyUnrolledLoopWithSwappedInputs` from `TestGeneralizedReductions.java`) results in mising vectorization but does not affect correctness. Note that the same-index assumption can only fail in cases where current auto-vectorization would also fail to vectorize (manually unrolled loops). >> >> The changeset implements a more relaxed version of the reduction analysis for x64 matching, suitable for queries on single nodes. This analysis is run only in the presence of `[Min|Max][F|D]` nodes. >> >> ## Alternative approaches >> >> A complication results from edge swapping in the nodes cloned by loop unrolling (see [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/addnode.cpp#L123) and [here](https://github.com/openjdk/jdk/blob/bbde2158d1d11be909292d0c8625211e6cf5359e/src/hotspot/share/opto/mulnode.cpp#L113)), which can lead to reduction chains connected via different input indices. This is addressed by tracking whether nodes have swapped edges and adjusting the explored input indices in the reduction analysis accordingly. An alternative (proposed by @eme64 and @jatin-bhateja ) is to replace this changeset's linear chain finding approach with some form of general path-finding algorithm. This alternative would preclude the need for tracking edge swapping at a potentially higher computational cost. The following table summarizes the pros and cons of the current mainline approach, this changeset, and the proposed alternative: >> >> | approach | correctness | efficiency | effectiveness | conceptual complexity | >> | -------- | ----------- | ---------- | ------------- | --------------------- | >> | mainline (current) | hard to establish due to need of maintaining reduction flags through arbitrary graph transformations (has led to miscompilations, see JDK-8261147 and JDK-8279622) | high | low (misses substantial reduction vectorization opportunities) | high (requires maintaining non-local reduction node state) | >> | this changeset | easy to establish since client transformations operate on the same graph that is analyzed | medium (limited search for chains of nodes) | high (finds all reduction cycles except for partially unrolled loops with manually-swapped inputs) | medium (requires maintaining local swapped-edge node state) | >> | general search | easy to establish (same as above) | low (general search), particularly for x64 matching where the analysis runs once for every node in a chain | high (similar to above but also covering manually-swapped inputs) | low (no node state required, use of well-known graph search algorithms) | >> >> Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. >> >> ## Testing >> >> ### Functionality >> >> - tier1-5 (linux-x64, linux-aarch64, windows-x64, macosx-x64, and macosx-aarch64). >> - fuzzing (12 h. on linux-x64 and linux-aarch64). >> >> ##### TestGeneralizedReductions.java >> >> Tests the new scenarios in which vectorization occurs. These tests are restricted to 64-bits platforms, since I do not have access to 32-bits ones. `testReductionOnPartiallyUnrolledLoop` has been observed to fail on [linux-x86](https://github.com/robcasloz/jdk/actions/runs/4478959520/jobs/7873827856#logs) due to missing vectorization. If anyone wants to have a look and derive the necessary IR test framework preconditions for the test to pass on linux-x86, I am happy to lift the 64-bits restriction. >> >> ##### TestFpMinMaxReductions.java >> >> Tests the matching of floating-point max/min implementations in x64. >> >> ##### TestSuperwordFailsUnrolling.java >> >> This test file is updated to ensure auto-vectorization is never triggered, because this changeset would otherwise enable it and defeat the purpose of the test. >> >> ### Performance >> >> #### General Benchmarks >> >> The changeset does not cause any performance regression on the DaCapo, SPECjvm 2008, and SPECjbb2015 benchmark suites for linux-x64 and linux-aarch64. >> >> #### Micro-benchmarks >> >> The changeset extends two existing files with additional micro-benchmarks that show the benefit of the generalized reduction analysis ([full results](https://github.com/openjdk/jdk/files/11039207/microbenchmark-results.ods)). >> >> >> ##### VectorReduction.java >> >> These micro-benchmarks are first adjusted to actually vectorize in the mainline approach, since they suffered from the global-accumulator limitation. Two micro-benchmarks are added to exercise vectorization in the presence of global accumulators and partially unrolled loops. Running `VectorReduction.java` on an x64 (Cascade Lake) machine confirms the expectations: compared to mainline (with the adjustment mentioned above), this changeset yields similar performance results except for `andRedIOnGlobalAccumulator` and `andRedIPartiallyUnrolled`, where the changeset improves performance by 2.4x in both cases. >> >> ##### MaxIntrinsics.java >> >> This file is extended with four new micro-benchmarks. Running it on the same machine as above shows that the changeset does not affect the performance of the existing micro-benchmarks, and improves moderately to substantially the performance of the new ones (because it allows the x64 matcher to select a floating-point `Math.min()` implementation that is specialized for reduction min operations): >> >> | micro-benchmark | speedup compared to mainline | >> | --- | --- | >> | `fMinReduceInOuterLoop` | 1.1x | >> | `fMinReduceNonCounted` | 2.3x | >> | `fMinReduceGlobalAccumulator` | 2.4x | >> | `fMinReducePartiallyUnrolled` | 3.9x | >> >> ## Acknowledgments >> >> Thanks to @danielogh for making it possible to test this improvement with confidence ([JDK-8294715](https://bugs.openjdk.org/browse/JDK-8294715)) and to @TobiHartmann, @chhagedorn, @vnkozlov and @eme64 for discussions and useful feedback. > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits: > > - Merge master > - Fix node naming in reduction chain traversal > - Use is_marked_reduction() in new SLP code > - Merge master > - Emit Node::Flag_has_swapped_edges in IGV graphs > - Merge master > - Relax the reduction cycle search bound > - Remove redundant IR check precondition > - Use SuperWord members in reduction marking > - Remove redundant opcode checks > - ... and 23 more: https://git.openjdk.org/jdk/compare/7400aff3...1510accd > Since the efficiency-conceptual complexity trade-off between this changeset and the general search approach is not obvious, I propose to integrate this changeset (which strikes a balance between the two) and investigate the latter one in a follow-up RFE. Filed now: [JDK-8306989](https://bugs.openjdk.org/browse/JDK-8306989). ------------- PR Comment: https://git.openjdk.org/jdk/pull/13120#issuecomment-1525369938 From rcastanedalo at openjdk.org Thu Apr 27 10:44:27 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 27 Apr 2023 10:44:27 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Wed, 26 Apr 2023 14:17:18 GMT, Yasumasa Suenaga wrote: > Ok, so how do we implement that? It is better to add that function to os class like os::free_memory() This option sounds cleaner to me than adding OS-specific code to compilerBroker.cpp. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1525415261 From chagedorn at openjdk.org Thu Apr 27 10:45:23 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 27 Apr 2023 10:45:23 GMT Subject: RFR: 8306933: C2: "assert(false) failed: infinite loop" failure In-Reply-To: References: Message-ID: On Wed, 26 Apr 2023 15:36:13 GMT, Roland Westrelin wrote: > The assert fires because an infinite loop appears in the graph after > loop opts are over. > > After loop opts, the `for(;;)` loop contains a null check and a range > check for `array[i]`. So it's not considered an infinite loop (it has > exits to uncommon traps). The null check and range check are redundant > with the one right before the loop: `int v = array2[k];` IGVN can > optimize it but it doesn't happen until after loop opts when a > `ConvI2L` for the `array[i]` access is processed as part of post loop > opts IGVN. The `for(;;)` loop is then emptied and only contains a > `Loop` and a `Safepoint` nodes. > > I propose removing the assert (at least for now) as I don't see a way > to guarantee no infinite loop can appear after loop opts. That looks reasonable! test/hotspot/jtreg/compiler/c2/TestInfiniteLoopCompilationFailure.java line 31: > 29: * -XX:+StressIGVN -XX:StressSeed=675320863 TestInfiniteLoopCompilationFailure > 30: * @run main/othervm -Xcomp -XX:CompileOnly=TestInfiniteLoopCompilationFailure::test -XX:-UseLoopPredicate -XX:-UseProfiledLoopPredicate > 31: * -XX:+StressIGVN TestInfiniteLoopCompilationFailure Since `StressIGVN` is diagnostic, you need to add `-XX:+UnlockDiagnosticVMOptions` here. ------------- Changes requested by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13672#pullrequestreview-1403716416 PR Review Comment: https://git.openjdk.org/jdk/pull/13672#discussion_r1178957313 From duke at openjdk.org Thu Apr 27 11:30:52 2023 From: duke at openjdk.org (Afshin Zafari) Date: Thu, 27 Apr 2023 11:30:52 GMT Subject: RFR: 8305079: Remove finalize() from compiler/c2/Test719030 In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 07:33:16 GMT, Afshin Zafari wrote: > The `finalize()` method is replaced by a Cleaner callback. Thank you for your reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13418#issuecomment-1525517000 From roland at openjdk.org Thu Apr 27 11:44:54 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 27 Apr 2023 11:44:54 GMT Subject: RFR: 8306997: C2: "malformed control flow" assert due to missing safepoint on backedge with a switch Message-ID: The assert fires because a self loop (a `Loop` whose second input is itself) is removed by loop opts. That loop comes from a switch where the default case is a loop head (a code shape I couldn't get javac to produce). That `Loop` should at the very least have a `Safepoint` but the logic at parse time only looks for backedges in the non default cases. With that fixed, the `Loop` is no longer considered dead code. ------------- Commit messages: - test & fix Changes: https://git.openjdk.org/jdk/pull/13688/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13688&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8306997 Stats: 106 lines in 3 files changed: 104 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/13688.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13688/head:pull/13688 PR: https://git.openjdk.org/jdk/pull/13688 From gcao at openjdk.org Thu Apr 27 14:03:58 2023 From: gcao at openjdk.org (Gui Cao) Date: Thu, 27 Apr 2023 14:03:58 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API [v3] In-Reply-To: References: Message-ID: > Hi, > > we have added some implementations related to vector cast, It was implemented by referring to RVV v1.0 [1]. please take a look and have some reviews. Thanks a lot. > > We can use the VectorReshapeTests.java[2] to print the compilation log, verify and observe the generation of nodes. > > For example, we can use the following command to print the compilation log of a jtreg test case: > > > /home/zifeihan/jdk-tools/jtreg/bin/jtreg \ > -v:default \ > -concurrency:16 -timeout:50 \ > -javaoption:-XX:+UnlockExperimentalVMOptions \ > -javaoption:-XX:+UseRVV \ > -javaoption:-XX:+PrintOptoAssembly \ > -javaoption:-XX:LogFile=/home/zifeihan/jdk-rvv/VectorReshapeTests_PrintOptoAssembly_20230426.log \ > -jdk:/home/zifeihan/jdk-rvv/build/linux-riscv64-server-fastdebug/jdk \ > -compilejdk:/home/zifeihan/jdk-rvv/build/linux-x86_64-server-release/images/jdk \ > /home/zifeihan/jdk/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > > > #### VectorCast/VectorCastB2X/VectorCastD2X/VectorCastF2X/VectorCastI2X/VectorCastL2X/VectorCastS2X > There are too many nodes here, and the following shows the log of `VectorCastB2X` nodes: > > ``` > 1ba0 ld R28, [R23, #280] # ptr, #@loadP > 1ba4 addi R29, R7, #32 # ptr, #@addP_reg_imm > 1ba8 reinterpretResize V1, V5 > 1bb0 vcvtBtoX V4, V1 > 1bb8 far_bgeu R29, R28, B465 #@far_cmpP_branch P=0.000100 C=-1.000000 > ``` > > #### VectorRearrange/VectorReinterpret > > When the original vector is transformed to the target vector, if the actual number of elements of the original vector is larger than the number of elements of the target vector, a slice action is performed to provide data for the subsequent cast nodes. the slice action depends on the `VectorRearrange` and `VectorReinterpret` nodes. > > The compilation log for the `VectorRearrange` node: > > ``` > 1f6 spill R7 -> [sp, #320] # spill size = 64 > 1f8 spill [sp, #128] -> V1 # vector spill size = 256 > 200 spill [sp, #160] -> V2 # vector spill size = 256 > 208 rearrange V3, V1, V2 > 210 spill V3 -> [sp, #96] # vector spill size = 256 > 218 li R11, #4 # int, #@loadConI > ``` > > The compilation log for the `VectorReinterpret` node: > > > 1218 spill [sp, #32] -> V4 # vector spill size = 256 > 1220 spill [sp, #176] -> V3 # vector spill size = 256 > 1228 rearrange V2, V4, V3 > 1230 spill [sp, #72] -> V0 # vmask spill size = 32 > 123c vmerge_vvm V1, V1, V2, v0 #@vector blend > 1244 reinterpretResize V2, V1 > 124c vcvtStoX_extend V5, V2 > 1254 bgeu R28, R7, B169 #@cmpP_branch P=0.000100 C=-1.000000 > > > #### LShiftCntV/RShiftCntV/MaskAll > > We have merged `LShiftCntV`, `RShiftCntV` nodes and support boolean types > > The compilation log for the LShiftCntV/RShiftCntV node: > > > 24c vasrB V3, V1, V2 > 260 storeV [R19], V3 # vector (rvv) > 268 lbu R19, [R29, #48] # byte, #@loadUB > 26c andi R19, R19, #7 #@andI_reg_imm > 270 loadV V1, [R25] # vector (rvv) > 278 vshiftcnt V2, R19 > 280 vasrB V3, V1, V2 > 294 storeV [R26], V3 # vector (rvv) > 29c lbu R19, [R29, #80] # byte, #@loadUB > 2a0 andi R19, R19, #7 #@andI_reg_imm > 2a4 loadV V1, [R22] # vector (rvv) > 2ac vshiftcnt V2, R19 > > > By the way, the mask version of MaskAll is supported. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > Testing: > qemu with UseRVV: > > - [ ] Tier1 tests (release) > - [ ] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (fastdebug) Gui Cao has updated the pull request incrementally with one additional commit since the last revision: During the conversion, specify the number of vectors ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13684/files - new: https://git.openjdk.org/jdk/pull/13684/files/b2216ec9..94efa172 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=01-02 Stats: 24 lines in 3 files changed: 0 ins; 1 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/13684.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13684/head:pull/13684 PR: https://git.openjdk.org/jdk/pull/13684 From cslucas at openjdk.org Thu Apr 27 20:52:23 2023 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 27 Apr 2023 20:52:23 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Sat, 22 Apr 2023 01:42:41 GMT, Vladimir Ivanov wrote: >> Cesar Soares Lucas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Catching up with master >> >> Merge remote-tracking branch 'origin/master' into rematerialization-of-merges >> - Fix tests. Remember previous reducible Phis. >> - Address PR review 3. Some comments and be able to abort compilation. >> - Merge with Master >> - Addressing PR review 2: refactor & reuse MacroExpand::scalar_replacement method. >> - Address PR feeedback 1: make ObjectMergeValue subclass of ObjectValue & create new IR class to represent scalarized merges. >> - Add support for SR'ing some inputs of merges used for field loads >> - Fix some typos and do some small refactorings. >> - Merge master >> - Add support for rematerializing scalar replaced objects participating in allocation merges > > src/hotspot/share/code/debugInfo.hpp line 199: > >> 197: // ObjectValue describing an object that was scalar replaced. >> 198: >> 199: class ObjectMergeValue: public ObjectValue { > > I find the decision to subclass`ObjectValue` confusing and error prone: now `is_object()` returns true for `ObjectMergeValue`, but you have to apply the selector first to turn it into `ObjectValue`. And now the order of checks matter, so you always have to perform `is_object_merge()` first and then follow it with `is_object()` guard. > > You have 3 flavors of `ObjectValue` now: > * good old `ObjectValue`; > * `ObjectMergeValue` > * merge candidates (`ObjectMergeCandidateValue`?) > > Does it make sense to introduce 3 different subclasses under `ObjectValue` to clearly distinguish the scenarios? Hi @iwanowww . I finished implementing a version of this like the illustration below (I didn't add a Candidate class). ScopeValue ObjectValue ObjectAllocationValue AutoBoxObjectValue ObjectMergeValue Here are some observations: - I don't think ObjectMergeValue should be under ObjectValue. The two classes only have two fields in common (_id and _visited). I think it should be a subclass of ScopeValue. - ObjectCandidateValue would need to go under ObjectAllocationValue because it essentially _is_ an ObjectAllocationValue in most aspects. - I didn't add a ObjectCandidateValue class because that class would need to go under ObjectAllocationValue and we would still need to do an "is_object_candidate" before all "is_object_allocation" and we would end up in much the situation that we want to avoid - needing to do is_object_merge before is_object. - It seems the best place to flag an object as candidate is really in ObjectAllocationValue. What do you think? As I said, I already have the code, if you want I can push it and you take a look. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1179649780 From vlivanov at openjdk.org Thu Apr 27 23:50:55 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 27 Apr 2023 23:50:55 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 27 Apr 2023 20:33:38 GMT, Cesar Soares Lucas wrote: >> src/hotspot/share/code/debugInfo.hpp line 199: >> >>> 197: // ObjectValue describing an object that was scalar replaced. >>> 198: >>> 199: class ObjectMergeValue: public ObjectValue { >> >> I find the decision to subclass`ObjectValue` confusing and error prone: now `is_object()` returns true for `ObjectMergeValue`, but you have to apply the selector first to turn it into `ObjectValue`. And now the order of checks matter, so you always have to perform `is_object_merge()` first and then follow it with `is_object()` guard. >> >> You have 3 flavors of `ObjectValue` now: >> * good old `ObjectValue`; >> * `ObjectMergeValue` >> * merge candidates (`ObjectMergeCandidateValue`?) >> >> Does it make sense to introduce 3 different subclasses under `ObjectValue` to clearly distinguish the scenarios? > > Hi @iwanowww . I finished implementing a version of this like the illustration below (I didn't add a Candidate class). > > > ScopeValue > ObjectValue > ObjectAllocationValue > AutoBoxObjectValue > ObjectMergeValue > > > Here are some observations: > > - I don't think ObjectMergeValue should be under ObjectValue. The two classes only have two fields in common (_id and _visited). I think it should be a subclass of ScopeValue. > - ObjectCandidateValue would need to go under ObjectAllocationValue because it essentially _is_ an ObjectAllocationValue in most aspects. > - I didn't add a ObjectCandidateValue class because that class would need to go under ObjectAllocationValue and we would still need to do an "is_object_candidate" before all "is_object_allocation" and we would end up in much the situation that we want to avoid - needing to do is_object_merge before is_object. > - It seems the best place to flag an object as candidate is really in ObjectAllocationValue. > > What do you think? As I said, I already have the code, if you want I can push it and you take a look. Can `ObjectCandidateValue` be a wrapper around a `ObjectAllocationValue`? It does make sense to separate `ObjectMergeValue` and `ObjectValue`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1179798496 From vlivanov at openjdk.org Thu Apr 27 23:50:56 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 27 Apr 2023 23:50:56 GMT Subject: RFR: JDK-8287061: Support for rematerializing scalar replaced objects participating in allocation merges [v10] In-Reply-To: References: <7nqFW-lgT1FzuMHPMUQiCj1ATcV_bQtroolf4V_kCc4=.ccd12605-aad0-433e-ba44-5772d972f05d@github.com> Message-ID: On Thu, 27 Apr 2023 23:35:02 GMT, Vladimir Ivanov wrote: >> Hi @iwanowww . I finished implementing a version of this like the illustration below (I didn't add a Candidate class). >> >> >> ScopeValue >> ObjectValue >> ObjectAllocationValue >> AutoBoxObjectValue >> ObjectMergeValue >> >> >> Here are some observations: >> >> - I don't think ObjectMergeValue should be under ObjectValue. The two classes only have two fields in common (_id and _visited). I think it should be a subclass of ScopeValue. >> - ObjectCandidateValue would need to go under ObjectAllocationValue because it essentially _is_ an ObjectAllocationValue in most aspects. >> - I didn't add a ObjectCandidateValue class because that class would need to go under ObjectAllocationValue and we would still need to do an "is_object_candidate" before all "is_object_allocation" and we would end up in much the situation that we want to avoid - needing to do is_object_merge before is_object. >> - It seems the best place to flag an object as candidate is really in ObjectAllocationValue. >> >> What do you think? As I said, I already have the code, if you want I can push it and you take a look. > > Can `ObjectCandidateValue` be a wrapper around a `ObjectAllocationValue`? > > It does make sense to separate `ObjectMergeValue` and `ObjectValue`. I need to to study the code in more details. Seems like I'm missing something important here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/12897#discussion_r1179798907 From fyang at openjdk.org Fri Apr 28 02:44:53 2023 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Apr 2023 02:44:53 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API [v3] In-Reply-To: References: Message-ID: <_lEk_hUS0IJOsKPmIpLsEuWAz-V7tvVpJkk_DXq4RIc=.a1b43de9-0e0c-4857-958f-e014ca56aab2@github.com> On Thu, 27 Apr 2023 14:03:58 GMT, Gui Cao wrote: >> Hi, >> >> we have added some implementations related to vector cast, It was implemented by referring to RVV v1.0 [1]. please take a look and have some reviews. Thanks a lot. >> >> We can use the VectorReshapeTests.java[2] to print the compilation log, verify and observe the generation of nodes. >> >> For example, we can use the following command to print the compilation log of a jtreg test case: >> >> >> /home/zifeihan/jdk-tools/jtreg/bin/jtreg \ >> -v:default \ >> -concurrency:16 -timeout:50 \ >> -javaoption:-XX:+UnlockExperimentalVMOptions \ >> -javaoption:-XX:+UseRVV \ >> -javaoption:-XX:+PrintOptoAssembly \ >> -javaoption:-XX:LogFile=/home/zifeihan/jdk-rvv/VectorReshapeTests_PrintOptoAssembly_20230426.log \ >> -jdk:/home/zifeihan/jdk-rvv/build/linux-riscv64-server-fastdebug/jdk \ >> -compilejdk:/home/zifeihan/jdk-rvv/build/linux-x86_64-server-release/images/jdk \ >> /home/zifeihan/jdk/test/jdk/jdk/incubator/vector/VectorReshapeTests.java >> >> >> #### VectorCast/VectorCastB2X/VectorCastD2X/VectorCastF2X/VectorCastI2X/VectorCastL2X/VectorCastS2X >> There are too many nodes here, and the following shows the log of `VectorCastB2X` nodes: >> >> ``` >> 1ba0 ld R28, [R23, #280] # ptr, #@loadP >> 1ba4 addi R29, R7, #32 # ptr, #@addP_reg_imm >> 1ba8 reinterpretResize V1, V5 >> 1bb0 vcvtBtoX V4, V1 >> 1bb8 far_bgeu R29, R28, B465 #@far_cmpP_branch P=0.000100 C=-1.000000 >> ``` >> >> #### VectorRearrange/VectorReinterpret >> >> When the original vector is transformed to the target vector, if the actual number of elements of the original vector is larger than the number of elements of the target vector, a slice action is performed to provide data for the subsequent cast nodes. the slice action depends on the `VectorRearrange` and `VectorReinterpret` nodes. >> >> The compilation log for the `VectorRearrange` node: >> >> ``` >> 1f6 spill R7 -> [sp, #320] # spill size = 64 >> 1f8 spill [sp, #128] -> V1 # vector spill size = 256 >> 200 spill [sp, #160] -> V2 # vector spill size = 256 >> 208 rearrange V3, V1, V2 >> 210 spill V3 -> [sp, #96] # vector spill size = 256 >> 218 li R11, #4 # int, #@loadConI >> ``` >> >> The compilation log for the `VectorReinterpret` node: >> >> >> 1218 spill [sp, #32] -> V4 # vector spill size = 256 >> 1220 spill [sp, #176] -> V3 # vector spill size = 256 >> 1228 rearrange V2, V4, V3 >> 1230 spill [sp, #72] -> V0 # vmask spill size = 32 >> 123c vmerge_vvm V1, V1, V2, v0 #@vector blend >> 1244 reinterpretResize V2, V1 >> 124c vcvtStoX_extend V5, V2 >> 1254 bgeu R28, R7, B169 #@cmpP_branch P=0.000100 C=-1.000000 >> >> >> #### LShiftCntV/RShiftCntV/MaskAll >> >> We have merged `LShiftCntV`, `RShiftCntV` nodes and support boolean types >> >> The compilation log for the LShiftCntV/RShiftCntV node: >> >> >> 24c vasrB V3, V1, V2 >> 260 storeV [R19], V3 # vector (rvv) >> 268 lbu R19, [R29, #48] # byte, #@loadUB >> 26c andi R19, R19, #7 #@andI_reg_imm >> 270 loadV V1, [R25] # vector (rvv) >> 278 vshiftcnt V2, R19 >> 280 vasrB V3, V1, V2 >> 294 storeV [R26], V3 # vector (rvv) >> 29c lbu R19, [R29, #80] # byte, #@loadUB >> 2a0 andi R19, R19, #7 #@andI_reg_imm >> 2a4 loadV V1, [R22] # vector (rvv) >> 2ac vshiftcnt V2, R19 >> >> >> By the way, the mask version of MaskAll is supported. >> >> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc >> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorReshapeTests.java >> Testing: >> qemu with UseRVV: >> >> - [ ] Tier1 tests (release) >> - [ ] Tier2 tests (release) >> - [ ] Tier3 tests (release) >> - [x] test/jdk/jdk/incubator/vector (fastdebug) > > Gui Cao has updated the pull request incrementally with one additional commit since the last revision: > > During the conversion, specify the number of vectors Changes requested by fyang (Reviewer). src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1797: > 1795: assert_different_registers(dst, src); > 1796: > 1797: rvv_vsetvli(dst_bt, length_in_bytes); I think we should use the actual AVL instread of 'length_in_bytes' for rvv_vsetvli ? src/hotspot/cpu/riscv/riscv_v.ad line 2837: > 2835: if (bt == T_LONG) { > 2836: __ vector_integer_extend(as_VectorRegister($dst$$reg), T_LONG, > 2837: Matcher::vector_length_in_bytes(this), as_VectorRegister($dst$$reg), T_INT); Will this work? I see you are asserting that 'dst' and 'src' vector registers are different in vector_integer_extend. But the same vector register is passed for these two paramerters here. src/hotspot/cpu/riscv/riscv_v.ad line 2885: > 2883: %} > 2884: > 2885: instruct vcvtDtoF(vReg dst_src1, vReg tmp) %{ Why not break down 'dst_src1' into two seperate 'dst' and 'src' inputs like you do for 'vcvtFtoD' ? ------------- PR Review: https://git.openjdk.org/jdk/pull/13684#pullrequestreview-1405146050 PR Review Comment: https://git.openjdk.org/jdk/pull/13684#discussion_r1179873004 PR Review Comment: https://git.openjdk.org/jdk/pull/13684#discussion_r1179875372 PR Review Comment: https://git.openjdk.org/jdk/pull/13684#discussion_r1179873881 From gcao at openjdk.org Fri Apr 28 03:51:28 2023 From: gcao at openjdk.org (Gui Cao) Date: Fri, 28 Apr 2023 03:51:28 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API [v4] In-Reply-To: References: Message-ID: > Hi, > > we have added some implementations related to vector cast, It was implemented by referring to RVV v1.0 [1]. please take a look and have some reviews. Thanks a lot. > > We can use the VectorReshapeTests.java[2] to print the compilation log, verify and observe the generation of nodes. > > For example, we can use the following command to print the compilation log of a jtreg test case: > > > /home/zifeihan/jdk-tools/jtreg/bin/jtreg \ > -v:default \ > -concurrency:16 -timeout:50 \ > -javaoption:-XX:+UnlockExperimentalVMOptions \ > -javaoption:-XX:+UseRVV \ > -javaoption:-XX:+PrintOptoAssembly \ > -javaoption:-XX:LogFile=/home/zifeihan/jdk-rvv/VectorReshapeTests_PrintOptoAssembly_20230426.log \ > -jdk:/home/zifeihan/jdk-rvv/build/linux-riscv64-server-fastdebug/jdk \ > -compilejdk:/home/zifeihan/jdk-rvv/build/linux-x86_64-server-release/images/jdk \ > /home/zifeihan/jdk/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > > > #### VectorCast/VectorCastB2X/VectorCastD2X/VectorCastF2X/VectorCastI2X/VectorCastL2X/VectorCastS2X > There are too many nodes here, and the following shows the log of `VectorCastB2X` nodes: > > ``` > 1ba0 ld R28, [R23, #280] # ptr, #@loadP > 1ba4 addi R29, R7, #32 # ptr, #@addP_reg_imm > 1ba8 reinterpretResize V1, V5 > 1bb0 vcvtBtoX V4, V1 > 1bb8 far_bgeu R29, R28, B465 #@far_cmpP_branch P=0.000100 C=-1.000000 > ``` > > #### VectorRearrange/VectorReinterpret > > When the original vector is transformed to the target vector, if the actual number of elements of the original vector is larger than the number of elements of the target vector, a slice action is performed to provide data for the subsequent cast nodes. the slice action depends on the `VectorRearrange` and `VectorReinterpret` nodes. > > The compilation log for the `VectorRearrange` node: > > ``` > 1f6 spill R7 -> [sp, #320] # spill size = 64 > 1f8 spill [sp, #128] -> V1 # vector spill size = 256 > 200 spill [sp, #160] -> V2 # vector spill size = 256 > 208 rearrange V3, V1, V2 > 210 spill V3 -> [sp, #96] # vector spill size = 256 > 218 li R11, #4 # int, #@loadConI > ``` > > The compilation log for the `VectorReinterpret` node: > > > 1218 spill [sp, #32] -> V4 # vector spill size = 256 > 1220 spill [sp, #176] -> V3 # vector spill size = 256 > 1228 rearrange V2, V4, V3 > 1230 spill [sp, #72] -> V0 # vmask spill size = 32 > 123c vmerge_vvm V1, V1, V2, v0 #@vector blend > 1244 reinterpretResize V2, V1 > 124c vcvtStoX_extend V5, V2 > 1254 bgeu R28, R7, B169 #@cmpP_branch P=0.000100 C=-1.000000 > > > #### LShiftCntV/RShiftCntV/MaskAll > > We have merged `LShiftCntV`, `RShiftCntV` nodes and support boolean types > > The compilation log for the LShiftCntV/RShiftCntV node: > > > 24c vasrB V3, V1, V2 > 260 storeV [R19], V3 # vector (rvv) > 268 lbu R19, [R29, #48] # byte, #@loadUB > 26c andi R19, R19, #7 #@andI_reg_imm > 270 loadV V1, [R25] # vector (rvv) > 278 vshiftcnt V2, R19 > 280 vasrB V3, V1, V2 > 294 storeV [R26], V3 # vector (rvv) > 29c lbu R19, [R29, #80] # byte, #@loadUB > 2a0 andi R19, R19, #7 #@andI_reg_imm > 2a4 loadV V1, [R22] # vector (rvv) > 2ac vshiftcnt V2, R19 > > > By the way, the mask version of MaskAll is supported. > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > Testing: > qemu with UseRVV: > > - [ ] Tier1 tests (release) > - [ ] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (fastdebug) Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Fix VectorCastF2X ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13684/files - new: https://git.openjdk.org/jdk/pull/13684/files/94efa172..3554e7a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=02-03 Stats: 21 lines in 2 files changed: 4 ins; 5 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/13684.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13684/head:pull/13684 PR: https://git.openjdk.org/jdk/pull/13684 From gcao at openjdk.org Fri Apr 28 03:51:30 2023 From: gcao at openjdk.org (Gui Cao) Date: Fri, 28 Apr 2023 03:51:30 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API [v3] In-Reply-To: <_lEk_hUS0IJOsKPmIpLsEuWAz-V7tvVpJkk_DXq4RIc=.a1b43de9-0e0c-4857-958f-e014ca56aab2@github.com> References: <_lEk_hUS0IJOsKPmIpLsEuWAz-V7tvVpJkk_DXq4RIc=.a1b43de9-0e0c-4857-958f-e014ca56aab2@github.com> Message-ID: On Fri, 28 Apr 2023 02:37:26 GMT, Fei Yang wrote: >> Gui Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> During the conversion, specify the number of vectors > > src/hotspot/cpu/riscv/riscv_v.ad line 2837: > >> 2835: if (bt == T_LONG) { >> 2836: __ vector_integer_extend(as_VectorRegister($dst$$reg), T_LONG, >> 2837: Matcher::vector_length_in_bytes(this), as_VectorRegister($dst$$reg), T_INT); > > Will this work? I see you are asserting that 'dst' and 'src' vector registers are different in vector_integer_extend. But the same vector register is passed for these two paramerters here. Fixed. > src/hotspot/cpu/riscv/riscv_v.ad line 2885: > >> 2883: %} >> 2884: >> 2885: instruct vcvtDtoF(vReg dst_src1, vReg tmp) %{ > > Why not break down 'dst_src1' into two seperate 'dst' and 'src' inputs like you do for 'vcvtFtoD' ? Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13684#discussion_r1179896243 PR Review Comment: https://git.openjdk.org/jdk/pull/13684#discussion_r1179896145 From fyang at openjdk.org Fri Apr 28 04:11:22 2023 From: fyang at openjdk.org (Fei Yang) Date: Fri, 28 Apr 2023 04:11:22 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v3] In-Reply-To: References: Message-ID: On Tue, 25 Apr 2023 02:46:17 GMT, Jasmine Karthikeyan wrote: >> Hello, I wonder if we could make this transformation of Conv2B conditional? Architectures like RISC-V doesn't have support of conditional moves at the ISA level for now. So we set ConditionalMoveLimit parameter to 0 for this platform and conditionals moves are emulated with normal compare and branch instructions instead [1]. I don't think we would achieve better performance numbers on this platform with this change. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/riscv.ad#L9583 > > Hey @RealFYang, thanks for this info! I wasn't aware that RISC-V didn't have conditional moves, and I agree that it doesn't sound like this transform would be so profitable there. To make the transformation conditional I've moved it to post loop opts IGVN, and only run it if the match rule for `Conv2B` isn't found. In an effort to not accidentally cause performance regressions, I've limited the transform to x86_64, aarch64, and arm32. > > @merykitty I've also implemented your change with idealization and would like your thoughts on it, thanks! > > I'll attach aarch64 perf results soon. @jaskarth : Thanks for taking care of that. I performed tier1-3 tests on linux-riscv64 platform, result looks good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13345#issuecomment-1526952806 From qamai at openjdk.org Fri Apr 28 06:01:53 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Fri, 28 Apr 2023 06:01:53 GMT Subject: RFR: 8051725: Improve expansion of Conv2B nodes in the middle-end [v6] In-Reply-To: <2ODJH1IFMOVjRgjQIeobF2eb_nxTCgnxcV__ttNz9nw=.7cbf388a-0a65-4d1c-8b60-d29ae3502123@github.com> References: <2ODJH1IFMOVjRgjQIeobF2eb_nxTCgnxcV__ttNz9nw=.7cbf388a-0a65-4d1c-8b60-d29ae3502123@github.com> Message-ID: <0LZmbabcnX0kf4HLiNPu-XX5IcMkTPvR1CAG6yEAat0=.cc5bd2fd-2029-4e41-86e5-3b899b1b523f@github.com> On Tue, 25 Apr 2023 14:43:23 GMT, Jasmine Karthikeyan wrote: >> Hi, I've created optimizations for the expansion of `Conv2B` nodes, especially when followed immediately by an xor of 1. This pattern is fairly common, and can arise from both [cmov idealization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/movenode.cpp#L241) and [diamond-phi optimization](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L1571). This change replaces `Conv2B` nodes in the middle-end during post loop opts IGVN with conditional moves on supported platforms (x86_64, aarch64, arm32), allowing the bit flip with `xor` to be subsumed with an inversion of the comparison instead. This change also reduces the overhead of the matcher in the backends, as fewer rules need to be traversed in order to match an ideal node. Performance results from my (Zen 2) machine: >> >> >> Baseline Patch Improvement >> Benchmark Mode Cnt Score Error Units Score Error Units >> Conv2BRules.testEquals0 avgt 10 47.566 ? 0.346 ns/op / 34.130 ? 0.177 ns/op + 28.2% >> Conv2BRules.testNotEquals0 avgt 10 37.167 ? 0.211 ns/op / 34.185 ? 0.258 ns/op + 8.0% >> Conv2BRules.testEquals1 avgt 10 35.059 ? 0.280 ns/op / 34.847 ? 0.160 ns/op (unchanged) >> Conv2BRules.testEqualsNull avgt 10 56.768 ? 2.600 ns/op / 34.330 ? 0.625 ns/op + 39.5% >> Conv2BRules.testNotEqualsNull avgt 10 47.447 ? 1.193 ns/op / 34.142 ? 0.303 ns/op + 28.0% >> >> Reviews would be greatly appreciated! >> >> Testing: tier1-2 on linux x64, GHA > > Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Merge branch 'master' into conv2b-x86-lowering > - Whitespace tweak > - Make transform conditional > - Remove Conv2B from backend as it's macro expanded now > - Re-work transform to happen in macro expansion > - Fix whitespace and add bug tag to IR test > - Merge branch 'master' into conv2b-x86-lowering > - Merge branch 'master' into conv2b-x86-lowering > - Merge branch 'master' into conv2b-x86-lowering > - Merge branch 'master' into conv2b-x86-lowering > - ... and 1 more: https://git.openjdk.org/jdk/compare/bad6aa68...295b9a67 src/hotspot/share/opto/addnode.cpp line 890: > 888: } > 889: > 890: // Try to convert (c ? 1 : 0) ^ 1 into !c ? 1 : 0. This pattern can occur after expansion of Conv2B nodes. Be more general? `Xor (CMove cond, iftrue, iffalse), op == CMove cond, (Xor iftrue op), (Xor iffalse op)`. You can be conservative and apply this only if `op`, `iftrue` and `iffalse` are all constant. src/hotspot/share/opto/cfgnode.cpp line 1576: > 1574: Node *n = new Conv2BNode(cmp->in(1)); > 1575: if( flipped ) > 1576: n = new XorINode( phase->transform(n), phase->intcon(1) ); This lives under the `if (flipped)`, maybe move into a block for more clarity. src/hotspot/share/opto/macro.cpp line 44: > 42: #include "opto/macro.hpp" > 43: #include "opto/memnode.hpp" > 44: #include "opto/movenode.hpp" Unnecessary change? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13345#discussion_r1179955624 PR Review Comment: https://git.openjdk.org/jdk/pull/13345#discussion_r1179953289 PR Review Comment: https://git.openjdk.org/jdk/pull/13345#discussion_r1179952045 From thartmann at openjdk.org Fri Apr 28 06:45:30 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 28 Apr 2023 06:45:30 GMT Subject: RFR: 8306997: C2: "malformed control flow" assert due to missing safepoint on backedge with a switch In-Reply-To: References: Message-ID: On Thu, 27 Apr 2023 11:27:30 GMT, Roland Westrelin wrote: > The assert fires because a self loop (a `Loop` whose second input is > itself) is removed by loop opts. That loop comes from a switch where > the default case is a loop head (a code shape I couldn't get javac to > produce). That `Loop` should at the very least have a `Safepoint` but > the logic at parse time only looks for backedges in the non default > cases. With that fixed, the `Loop` is no longer considered dead code. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13688#pullrequestreview-1405304285 From duke at openjdk.org Fri Apr 28 06:48:53 2023 From: duke at openjdk.org (Afshin Zafari) Date: Fri, 28 Apr 2023 06:48:53 GMT Subject: Integrated: 8305079: Remove finalize() from compiler/c2/Test719030 In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 07:33:16 GMT, Afshin Zafari wrote: > The `finalize()` method is replaced by a Cleaner callback. This pull request has now been integrated. Changeset: 84df74ca Author: Afshin Zafari Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/84df74ca3948c50d8e6f24694310860ed3888aba Stats: 8 lines in 1 file changed: 2 ins; 5 del; 1 mod 8305079: Remove finalize() from compiler/c2/Test719030 Reviewed-by: thartmann, coleenp ------------- PR: https://git.openjdk.org/jdk/pull/13418 From epeter at openjdk.org Fri Apr 28 06:51:23 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 28 Apr 2023 06:51:23 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v4] In-Reply-To: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: > An other case of `uncast` not being type-propagated through. > > We have a case like this: > `Phi -> ShiftL -> CastII -> AndI` > > The Phi has an updated type, so we should re-run Value on the AndI. > > In PhaseCCP::push_and, we do update a similar pattern: > `X -> ShiftL -> AndI` > > I extended it to handle this pattern: > `parent -> LShift (use) -> ConstraintCast* -> And` > > For this, I implemented: > https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 > > I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` > > **Discussion** > > https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 > I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. > > Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: CCP worklist on local arena ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13611/files - new: https://git.openjdk.org/jdk/pull/13611/files/f4df73bd..4d2a4f9e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13611&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13611&range=02-03 Stats: 7 lines in 1 file changed: 6 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13611.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13611/head:pull/13611 PR: https://git.openjdk.org/jdk/pull/13611 From thartmann at openjdk.org Fri Apr 28 06:55:22 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 28 Apr 2023 06:55:22 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v3] In-Reply-To: References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: On Thu, 27 Apr 2023 09:14:53 GMT, Emanuel Peter wrote: >> An other case of `uncast` not being type-propagated through. >> >> We have a case like this: >> `Phi -> ShiftL -> CastII -> AndI` >> >> The Phi has an updated type, so we should re-run Value on the AndI. >> >> In PhaseCCP::push_and, we do update a similar pattern: >> `X -> ShiftL -> AndI` >> >> I extended it to handle this pattern: >> `parent -> LShift (use) -> ConstraintCast* -> And` >> >> For this, I implemented: >> https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 >> >> I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` >> >> **Discussion** >> >> https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 >> I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. >> >> Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > re-introduced ResourceMark. Made CCP worklist allocate from comp_arena() The new changes look good to me. Let's clean up memory management of worklists with [JDK-8302670](https://bugs.openjdk.org/browse/JDK-8302670). ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13611#pullrequestreview-1405312811 From thartmann at openjdk.org Fri Apr 28 06:59:54 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 28 Apr 2023 06:59:54 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v4] In-Reply-To: References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: On Fri, 28 Apr 2023 06:51:23 GMT, Emanuel Peter wrote: >> An other case of `uncast` not being type-propagated through. >> >> We have a case like this: >> `Phi -> ShiftL -> CastII -> AndI` >> >> The Phi has an updated type, so we should re-run Value on the AndI. >> >> In PhaseCCP::push_and, we do update a similar pattern: >> `X -> ShiftL -> AndI` >> >> I extended it to handle this pattern: >> `parent -> LShift (use) -> ConstraintCast* -> And` >> >> For this, I implemented: >> https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 >> >> I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` >> >> **Discussion** >> >> https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 >> I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. >> >> Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > CCP worklist on local arena Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/13611#pullrequestreview-1405321743 From chagedorn at openjdk.org Fri Apr 28 06:59:55 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 28 Apr 2023 06:59:55 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v4] In-Reply-To: References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: <4Hc_3iMenGek1zXDzHiVwH19aMNw5Wp91hgN6PVERks=.d054b00a-50a5-4042-b75a-db6cf593148f@github.com> On Fri, 28 Apr 2023 06:51:23 GMT, Emanuel Peter wrote: >> An other case of `uncast` not being type-propagated through. >> >> We have a case like this: >> `Phi -> ShiftL -> CastII -> AndI` >> >> The Phi has an updated type, so we should re-run Value on the AndI. >> >> In PhaseCCP::push_and, we do update a similar pattern: >> `X -> ShiftL -> AndI` >> >> I extended it to handle this pattern: >> `parent -> LShift (use) -> ConstraintCast* -> And` >> >> For this, I implemented: >> https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 >> >> I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` >> >> **Discussion** >> >> https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 >> I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. >> >> Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > CCP worklist on local arena Update looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13611#pullrequestreview-1405324451 From epeter at openjdk.org Fri Apr 28 06:59:57 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 28 Apr 2023 06:59:57 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v3] In-Reply-To: References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: On Thu, 27 Apr 2023 09:14:53 GMT, Emanuel Peter wrote: >> An other case of `uncast` not being type-propagated through. >> >> We have a case like this: >> `Phi -> ShiftL -> CastII -> AndI` >> >> The Phi has an updated type, so we should re-run Value on the AndI. >> >> In PhaseCCP::push_and, we do update a similar pattern: >> `X -> ShiftL -> AndI` >> >> I extended it to handle this pattern: >> `parent -> LShift (use) -> ConstraintCast* -> And` >> >> For this, I implemented: >> https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 >> >> I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` >> >> **Discussion** >> >> https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 >> I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. >> >> Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > re-introduced ResourceMark. Made CCP worklist allocate from comp_arena() I now added the `worklist` to a local arena instead. I may change this again in [JDK-8302670](https://bugs.openjdk.org/browse/JDK-8302670), if that is beneficial for memory management of the CCP data-structures. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13611#issuecomment-1527066540 From ysuenaga at openjdk.org Fri Apr 28 12:09:58 2023 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Fri, 28 Apr 2023 12:09:58 GMT Subject: RFR: 8305770: os::Linux::available_memory() should refer MemAvailable in /proc/meminfo In-Reply-To: References: Message-ID: On Sat, 8 Apr 2023 02:24:44 GMT, Yasumasa Suenaga wrote: > `os::Linux::available_memory()` returns available memory from cgroups or sysinfo(2). In case of the process which run on out of container, that value is based on `freeram` from sysinfo(2). > > `freeram` is equivalent to `MemFree` in `/proc/meminfo` [1]. However it means just a free RAM. We should use `MemAvailable` when we want to know how much memory is available for the process [2]. `MemAvailable` is available in modern Linux kernel, and it has been backported some older kernels (e.g. RHEL). In `sar` from sysstat, it refers that value and shows it as `kbavail` [3]. > > AFAIK PhysicalMemory event in JFR depends on `os::Linux::available_memory()`, and it is used in automated analysis in JMC. So the JFR/JMC user could misunderstand physical memory was exhausted even if the memory was available enough. > > [1] https://github.com/torvalds/linux/blob/c9c3395d5e3dcc6daee66c6908354d47bf98cb0c/fs/proc/meminfo.c#L59 > [2] https://docs.kernel.org/filesystems/proc.html?highlight=memavailable > [3] https://github.com/sysstat/sysstat/blob/ac1df71ca252c158e8d418ded93e5ed52f5e8765/rd_stats.c#L325-L328 There appears to be no consistency in `os::available_memory()`. I think there are big difference between Linux, BSD, Windows. What is expected in this function? * Linux * cgroups - memory limit - memory usage * sysinfo(2) - equivalent to MemFree in /proc/meminfo * Windows * GlobalMemoryStatusEx() - I guess it is equivalent to MemAvailable in Linux * AIX * libo4::get_memory_info() * perfstat_memory_total() * BSD * 1/4 of physical memory * macOS * host_statistics64() In compiler thread, the policy may be different in each platforms. If we add `os::free_memory()`, we have to implement it into os_linux, os_aix, os_bsd, os_windows. Wouldn't the impact be significant? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13398#issuecomment-1527463178 From gcao at openjdk.org Fri Apr 28 12:21:28 2023 From: gcao at openjdk.org (Gui Cao) Date: Fri, 28 Apr 2023 12:21:28 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API [v3] In-Reply-To: <_lEk_hUS0IJOsKPmIpLsEuWAz-V7tvVpJkk_DXq4RIc=.a1b43de9-0e0c-4857-958f-e014ca56aab2@github.com> References: <_lEk_hUS0IJOsKPmIpLsEuWAz-V7tvVpJkk_DXq4RIc=.a1b43de9-0e0c-4857-958f-e014ca56aab2@github.com> Message-ID: On Fri, 28 Apr 2023 02:31:15 GMT, Fei Yang wrote: >> Gui Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> During the conversion, specify the number of vectors > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1797: > >> 1795: assert_different_registers(dst, src); >> 1796: >> 1797: rvv_vsetvli(dst_bt, length_in_bytes); > > I think we should use the actual AVL instread of 'length_in_bytes' for rvv_vsetvli ? No problem here, https://github.com/openjdk/jdk/blob/3554e7a3ffb879c7e5ef7547eb053e484d09d12b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp#L1830 There was a problem here and it has been fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13684#discussion_r1180328798 From stuefe at openjdk.org Fri Apr 28 15:53:55 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 28 Apr 2023 15:53:55 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 [v2] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 06:55:30 GMT, Thomas Stuefe wrote: >> The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. >> >> Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > reuse Assembler::brk Withdrawn since the usefulness is questioned (one can just use brk(0)) ------------- PR Comment: https://git.openjdk.org/jdk/pull/13401#issuecomment-1527748784 From stuefe at openjdk.org Fri Apr 28 15:53:57 2023 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 28 Apr 2023 15:53:57 GMT Subject: Withdrawn: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 In-Reply-To: References: Message-ID: On Sun, 9 Apr 2023 07:45:46 GMT, Thomas Stuefe wrote: > The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. > > Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/13401 From dlong at openjdk.org Fri Apr 28 16:07:00 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 28 Apr 2023 16:07:00 GMT Subject: RFR: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if [v2] In-Reply-To: References: Message-ID: On Mon, 24 Apr 2023 18:10:44 GMT, Dean Long wrote: >> This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. > > Dean Long has updated the pull request incrementally with two additional commits since the last revision: > > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/parse2.cpp > > Co-authored-by: Tobias Hartmann Thanks Christian. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13551#issuecomment-1527757000 From dlong at openjdk.org Fri Apr 28 16:07:24 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 28 Apr 2023 16:07:24 GMT Subject: Integrated: 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if In-Reply-To: References: Message-ID: On Thu, 20 Apr 2023 02:44:00 GMT, Dean Long wrote: > This change removes undefined behavior caused by signed overflow, which triggered an assert with Xcode14.3+1.0-beta1 on macos aarch64. This pull request has now been integrated. Changeset: a177152f Author: Dean Long URL: https://git.openjdk.org/jdk/commit/a177152f224cdaa3ef24a90baa57f1b42c0cc220 Stats: 26 lines in 1 file changed: 24 ins; 0 del; 2 mod 8306331: assert((cnt > 0.0f) && (prob > 0.0f)) failed: Bad frequency assignment in if Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/13551 From kvn at openjdk.org Fri Apr 28 18:59:59 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Apr 2023 18:59:59 GMT Subject: RFR: 8306997: C2: "malformed control flow" assert due to missing safepoint on backedge with a switch In-Reply-To: References: Message-ID: On Thu, 27 Apr 2023 11:27:30 GMT, Roland Westrelin wrote: > The assert fires because a self loop (a `Loop` whose second input is > itself) is removed by loop opts. That loop comes from a switch where > the default case is a loop head (a code shape I couldn't get javac to > produce). That `Loop` should at the very least have a `Safepoint` but > the logic at parse time only looks for backedges in the non default > cases. With that fixed, the `Loop` is no longer considered dead code. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13688#pullrequestreview-1406445600 From kvn at openjdk.org Fri Apr 28 19:12:53 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 28 Apr 2023 19:12:53 GMT Subject: RFR: 8306042: C2: failed: Missed optimization opportunity in PhaseCCP (adding LShift->Cast->Add notification) [v4] In-Reply-To: References: <5LdntwU5zlwXPnwYeJxzNPZTwrOuki6VebrE9Leeb8g=.3dc26d60-2729-4d60-9a5c-14cbb57f2813@github.com> Message-ID: On Fri, 28 Apr 2023 06:51:23 GMT, Emanuel Peter wrote: >> An other case of `uncast` not being type-propagated through. >> >> We have a case like this: >> `Phi -> ShiftL -> CastII -> AndI` >> >> The Phi has an updated type, so we should re-run Value on the AndI. >> >> In PhaseCCP::push_and, we do update a similar pattern: >> `X -> ShiftL -> AndI` >> >> I extended it to handle this pattern: >> `parent -> LShift (use) -> ConstraintCast* -> And` >> >> For this, I implemented: >> https://github.com/openjdk/jdk/blob/26f4adaae901822bea984b926c06d1a78f9c6b48/src/hotspot/share/opto/castnode.hpp#L73-L78 >> >> I could refactor code from a previous similar fix, for pattern: `ConstraintCast+ -> Sub/Phi` >> >> **Discussion** >> >> https://github.com/openjdk/jdk/blob/4d350f8f4eaabb18482c7656cb56a734e60187cf/src/hotspot/share/opto/castnode.hpp#L78-L79 >> I would have liked to place a `ResourceMark` between these two lines, to ensure the `internals` data structure is de-allocated after the traversal. But if I add it there, then one cannot modify any outer data-structure, or else one risks re-allocation of the outer data-structure in the inner ResourceMark, and then this memory gets de-allocated once the ResourceMark is cleared, and the outer data-structure is broken. This would for example mean that I could not push to the IGVN worklist inside the callback. >> >> Not having the ResourceMark means a memory leak, until the compile phase is over. But my code is not the only place, there are lots of places where we create a Resource allocated data-structure, but do not use ResourceMarks. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > CCP worklist on local arena src/hotspot/share/opto/phaseX.cpp line 1961: > 1959: // Push root onto worklist > 1960: worklist.push(C->root()); > 1961: DEBUG_ONLY(Unique_Node_List worklist_verify;) Should you put `worklist_verify` to `local_arena` too? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13611#discussion_r1180724608 From qamai at openjdk.org Sat Apr 29 02:19:23 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 29 Apr 2023 02:19:23 GMT Subject: RFR: 8306706: Support out-of-line code generation for MachNodes [v2] In-Reply-To: <3haQdXHxlUHKAqi4MNWaVz3gVcFB9M8A20tGPQIok3c=.940d6d13-9764-449a-a9e1-36247f08b68e@github.com> References: <3haQdXHxlUHKAqi4MNWaVz3gVcFB9M8A20tGPQIok3c=.940d6d13-9764-449a-a9e1-36247f08b68e@github.com> Message-ID: > Hi, > > This patch adds supports for MachNodes to emit an out-of-line piece of code in the stub section of the compiled method. This allows the separation of the uncommon path from the common one, which speeds up the common path a little bit and increases compiled code density. Please take a look and leave reviews. > > Thanks a lot. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: add benchmark ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13602/files - new: https://git.openjdk.org/jdk/pull/13602/files/4e317bbb..3b13e9e6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13602&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13602&range=00-01 Stats: 91 lines in 1 file changed: 91 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/13602.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13602/head:pull/13602 PR: https://git.openjdk.org/jdk/pull/13602 From aph at openjdk.org Sun Apr 30 10:07:24 2023 From: aph at openjdk.org (Andrew Haley) Date: Sun, 30 Apr 2023 10:07:24 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 [v2] In-Reply-To: References: Message-ID: On Tue, 11 Apr 2023 06:55:30 GMT, Thomas Stuefe wrote: >> The ability to emit debug traps was useful for me on arm, and I miss it on aarch64. >> >> Tested manually on Linux aarch64 in gdb with various values for hint covering the whole 16-bit range set. Hint gets encoded in the instruction (gdb decodes instruction as "BRK xxx" with xxx being the hint). According to documentation the hint ends up in ESR.ELx.ISS after the trap hit, but gdb refused to display the ESR register, so I could not verify that. > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > reuse Assembler::brk The best way is like this: address poo; void stuff() { ... instructions ... poo = pc(); ... instructions ... } then, in gdb: (gdb) b breakpoint 2 at 0xabcedf (gdb) comm 2 Type commands for breakpoint(s) 2, one per line. End with a line saying just "end". >b *poo >c >end (gdb) Now, every time you run the program, a breakpoint will be set at `*poo`. And you can continue after the breakpoint. ------------- PR Comment: https://git.openjdk.org/jdk/pull/13401#issuecomment-1528984092 From eosterlund at openjdk.org Sun Apr 30 12:04:59 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Sun, 30 Apr 2023 12:04:59 GMT Subject: RFR: JDK-8305782: Provide MacroAssembler::breakpoint on aarch64 [v2] In-Reply-To: References: Message-ID: On Sun, 30 Apr 2023 09:53:35 GMT, Andrew Haley wrote: > The best way is like this: > > > > ``` > > address poo; > > > > void stuff() { > > ... instructions ... > > poo = pc(); > > ... instructions ... > > } > > ``` > > > > then, in gdb: > > > > ``` > > (gdb) b > > breakpoint 2 at 0xabcedf > > (gdb) comm 2 > > Type commands for breakpoint(s) 2, one per line. > > End with a line saying just "end". > > >b *poo > > >c > > >end > > (gdb) > > ``` > > > > Now, every time you run the program, a breakpoint will be set at `*poo`. And you can continue after the breakpoint. That doesn't work with code relocation, does it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/13401#issuecomment-1529007868 From gcao at openjdk.org Sun Apr 30 16:13:53 2023 From: gcao at openjdk.org (Gui Cao) Date: Sun, 30 Apr 2023 16:13:53 GMT Subject: RFR: 8306966: RISC-V: Support vector cast node for Vector API [v5] In-Reply-To: References: Message-ID: <3olpwS6aeI1iq5NQC5jbM62Hq2-LGg1ait21_9yHfas=.da841ea6-90ce-4fdb-8136-cf1da00b4e8c@github.com> > Hi, > > we have added some implementations related to vector cast, It was implemented by referring to RVV v1.0 [1]. please take a look and have some reviews. Thanks a lot. > > We can use the VectorReshapeTests.java[2] to print the compilation log, verify and observe the generation of nodes. > > For example, we can use the following command to print the compilation log of a jtreg test case: > > > /home/zifeihan/jdk-tools/jtreg/bin/jtreg \ > -v:default \ > -concurrency:16 -timeout:50 \ > -javaoption:-XX:+UnlockExperimentalVMOptions \ > -javaoption:-XX:+UseRVV \ > -javaoption:-XX:+PrintOptoAssembly \ > -javaoption:-XX:LogFile=/home/zifeihan/jdk-rvv/VectorReshapeTests_PrintOptoAssembly_20230426.log \ > -jdk:/home/zifeihan/jdk-rvv/build/linux-riscv64-server-fastdebug/jdk \ > -compilejdk:/home/zifeihan/jdk-rvv/build/linux-x86_64-server-release/images/jdk \ > /home/zifeihan/jdk/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > > > #### VectorCast/VectorCastB2X/VectorCastD2X/VectorCastF2X/VectorCastI2X/VectorCastL2X/VectorCastS2X > There are too many nodes here, and the following shows the log of `VectorCastB2X` nodes: > > ``` > 1ba0 ld R28, [R23, #280] # ptr, #@loadP > 1ba4 addi R29, R7, #32 # ptr, #@addP_reg_imm > 1ba8 reinterpretResize V1, V5 > 1bb0 vcvtBtoX V4, V1 > 1bb8 far_bgeu R29, R28, B465 #@far_cmpP_branch P=0.000100 C=-1.000000 > ``` > > #### VectorRearrange > > When the original vector is converted to the target vector, if the actual number of elements of the original vector is greater than the number of elements of the target vector, a slicing action is performed to provide data for subsequent cast nodes. The slicing action depends on the VectorRearrange node. > > The compilation log for the `VectorRearrange` node: > > ``` > 1f6 spill R7 -> [sp, #320] # spill size = 64 > 1f8 spill [sp, #128] -> V1 # vector spill size = 256 > 200 spill [sp, #160] -> V2 # vector spill size = 256 > 208 rearrange V3, V1, V2 > 210 spill V3 -> [sp, #96] # vector spill size = 256 > 218 li R11, #4 # int, #@loadConI > ``` > > #### VectorReinterpret > If num_elem_from and num_elem_to are not equal, Reinterpret is needed to reset the correct number. > https://github.com/openjdk/jdk/blob/3554e7a3ffb879c7e5ef7547eb053e484d09d12b/src/hotspot/share/opto/vectorIntrinsics.cpp#L2374-L2376 > The compilation log for the `VectorReinterpret` node: > > > 1218 spill [sp, #32] -> V4 # vector spill size = 256 > 1220 spill [sp, #176] -> V3 # vector spill size = 256 > 1228 rearrange V2, V4, V3 > 1230 spill [sp, #72] -> V0 # vmask spill size = 32 > 123c vmerge_vvm V1, V1, V2, v0 #@vector blend > 1244 reinterpretResize V2, V1 > 124c vcvtStoX_extend V5, V2 > 1254 bgeu R28, R7, B169 #@cmpP_branch P=0.000100 C=-1.000000 > > > #### LShiftCntV/RShiftCntV > > We have merged `LShiftCntV`, `RShiftCntV` nodes and support boolean types > > The compilation log for the LShiftCntV/RShiftCntV node: > > > 24c vasrB V3, V1, V2 > 260 storeV [R19], V3 # vector (rvv) > 268 lbu R19, [R29, #48] # byte, #@loadUB > 26c andi R19, R19, #7 #@andI_reg_imm > 270 loadV V1, [R25] # vector (rvv) > 278 vshiftcnt V2, R19 > 280 vasrB V3, V1, V2 > 294 storeV [R26], V3 # vector (rvv) > 29c lbu R19, [R29, #80] # byte, #@loadUB > 2a0 andi R19, R19, #7 #@andI_reg_imm > 2a4 loadV V1, [R22] # vector (rvv) > 2ac vshiftcnt V2, R19 > > > [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc > [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/VectorReshapeTests.java > Testing: > qemu with UseRVV: > > - [ ] Tier1 tests (release) > - [ ] Tier2 tests (release) > - [ ] Tier3 tests (release) > - [x] test/jdk/jdk/incubator/vector (fastdebug) Gui Cao has updated the pull request incrementally with one additional commit since the last revision: Small refactoring of rvv_vsetvli ------------- Changes: - all: https://git.openjdk.org/jdk/pull/13684/files - new: https://git.openjdk.org/jdk/pull/13684/files/3554e7a3..586bce12 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13684&range=03-04 Stats: 261 lines in 4 files changed: 15 ins; 40 del; 206 mod Patch: https://git.openjdk.org/jdk/pull/13684.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/13684/head:pull/13684 PR: https://git.openjdk.org/jdk/pull/13684