From jiefu at openjdk.org Fri Dec 1 01:33:09 2023 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 1 Dec 2023 01:33:09 GMT Subject: RFR: 8321141: VM build issue on MacOS after JDK-8267532 In-Reply-To: References: Message-ID: On Thu, 30 Nov 2023 23:29:48 GMT, Vladimir Kozlov wrote: > [JDK-8267532](https://bugs.openjdk.org/browse/JDK-8267532) added new method `ciMethodData::exception_handler_bci_to_data()` which has `ShouldNotReachHere()` call on one exit without returning any value. > > Unfortunately `ATTRIBUTE_NORETURN` attribute in `ShouldNotReachHere()` seems don't work with older versions of Xcode. > > The fix is to add `return` statement. > > Tested with old Xcode (12.4) I have. Also tested with tier1 which includes builds. LGTM ------------- Marked as reviewed by jiefu (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16914#pullrequestreview-1758849463 From kvn at openjdk.org Fri Dec 1 03:25:08 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Dec 2023 03:25:08 GMT Subject: RFR: 8321141: VM build issue on MacOS after JDK-8267532 In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 01:30:28 GMT, Jie Fu wrote: >> [JDK-8267532](https://bugs.openjdk.org/browse/JDK-8267532) added new method `ciMethodData::exception_handler_bci_to_data()` which has `ShouldNotReachHere()` call on one exit without returning any value. >> >> Unfortunately `ATTRIBUTE_NORETURN` attribute in `ShouldNotReachHere()` seems don't work with older versions of Xcode. >> >> The fix is to add `return` statement. >> >> Tested with old Xcode (12.4) I have. Also tested with tier1 which includes builds. > > LGTM Thank you, @DamonFool , for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16914#issuecomment-1835393588 From kvn at openjdk.org Fri Dec 1 03:38:13 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Dec 2023 03:38:13 GMT Subject: RFR: 8321141: VM build issue on MacOS after JDK-8267532 In-Reply-To: References: Message-ID: On Thu, 30 Nov 2023 23:29:48 GMT, Vladimir Kozlov wrote: > [JDK-8267532](https://bugs.openjdk.org/browse/JDK-8267532) added new method `ciMethodData::exception_handler_bci_to_data()` which has `ShouldNotReachHere()` call on one exit without returning any value. > > Unfortunately `ATTRIBUTE_NORETURN` attribute in `ShouldNotReachHere()` seems don't work with older versions of Xcode. > > The fix is to add `return` statement. > > Tested with old Xcode (12.4) I have. Also tested with tier1 which includes builds. `java/util/stream/GathererTest.java` test failure in GHA on 32-bit linux-x86 failures seems related to [JDK-8321124](https://bugs.openjdk.org/browse/JDK-8321124). I consider this change trivial and will push it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16914#issuecomment-1835401026 PR Comment: https://git.openjdk.org/jdk/pull/16914#issuecomment-1835401266 From kvn at openjdk.org Fri Dec 1 03:38:14 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Dec 2023 03:38:14 GMT Subject: Integrated: 8321141: VM build issue on MacOS after JDK-8267532 In-Reply-To: References: Message-ID: <0rZbkDXz8NyNYPcurSOctS5SSyodT4R8O56cloCoZ-A=.e8a8733e-2711-4ea4-a7c9-f07b09afb885@github.com> On Thu, 30 Nov 2023 23:29:48 GMT, Vladimir Kozlov wrote: > [JDK-8267532](https://bugs.openjdk.org/browse/JDK-8267532) added new method `ciMethodData::exception_handler_bci_to_data()` which has `ShouldNotReachHere()` call on one exit without returning any value. > > Unfortunately `ATTRIBUTE_NORETURN` attribute in `ShouldNotReachHere()` seems don't work with older versions of Xcode. > > The fix is to add `return` statement. > > Tested with old Xcode (12.4) I have. Also tested with tier1 which includes builds. This pull request has now been integrated. Changeset: 02ffab1a Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/02ffab1a4d9e1209f3f1da715acae975e0754551 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8321141: VM build issue on MacOS after JDK-8267532 Reviewed-by: jiefu ------------- PR: https://git.openjdk.org/jdk/pull/16914 From chagedorn at openjdk.org Fri Dec 1 07:47:17 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Dec 2023 07:47:17 GMT Subject: RFR: 8321107: Add more test cases for JDK-8319372 In-Reply-To: <1Ywfesy0UBjk9sVFZEgqsKSW7P37EWMINQFEirb6pLQ=.32fed60e-8f20-474f-a078-fd2f2f554550@github.com> References: <1Ywfesy0UBjk9sVFZEgqsKSW7P37EWMINQFEirb6pLQ=.32fed60e-8f20-474f-a078-fd2f2f554550@github.com> Message-ID: On Thu, 30 Nov 2023 14:34:36 GMT, Christian Hagedorn wrote: > This PR adds the remaining failing test cases of bugs linked to [JDK-8321097](https://bugs.openjdk.org/browse/JDK-8321097) which have now been fixed with [JDK-8319372](https://bugs.openjdk.org/browse/JDK-8319372). > > Thanks, > Christian Thanks Vladimir for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16902#issuecomment-1835624865 From chagedorn at openjdk.org Fri Dec 1 07:47:19 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Dec 2023 07:47:19 GMT Subject: Integrated: 8321107: Add more test cases for JDK-8319372 In-Reply-To: <1Ywfesy0UBjk9sVFZEgqsKSW7P37EWMINQFEirb6pLQ=.32fed60e-8f20-474f-a078-fd2f2f554550@github.com> References: <1Ywfesy0UBjk9sVFZEgqsKSW7P37EWMINQFEirb6pLQ=.32fed60e-8f20-474f-a078-fd2f2f554550@github.com> Message-ID: <5cc7CL-wxQIWgp8-WZ9uw_AnGFJGC_eercJsXhae-io=.2cb6db91-9bc9-4cad-baf6-2e63589463b7@github.com> On Thu, 30 Nov 2023 14:34:36 GMT, Christian Hagedorn wrote: > This PR adds the remaining failing test cases of bugs linked to [JDK-8321097](https://bugs.openjdk.org/browse/JDK-8321097) which have now been fixed with [JDK-8319372](https://bugs.openjdk.org/browse/JDK-8319372). > > Thanks, > Christian This pull request has now been integrated. Changeset: ecd335d8 Author: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/ecd335d8f42757d332f217e220e1a9db8c48c8d6 Stats: 134 lines in 1 file changed: 134 ins; 0 del; 0 mod 8321107: Add more test cases for JDK-8319372 Reviewed-by: roland, kvn ------------- PR: https://git.openjdk.org/jdk/pull/16902 From jbhateja at openjdk.org Fri Dec 1 08:39:21 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Dec 2023 08:39:21 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v2] In-Reply-To: References: Message-ID: > Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. > > A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. > > Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. > > Following are the performance numbers of and existing JMH micro. > > ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) > > Please review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review suggestions incorportated. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16888/files - new: https://git.openjdk.org/jdk/pull/16888/files/a8df0c99..935f4e07 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16888&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16888&range=00-01 Stats: 185 lines in 13 files changed: 8 ins; 4 del; 173 mod Patch: https://git.openjdk.org/jdk/pull/16888.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16888/head:pull/16888 PR: https://git.openjdk.org/jdk/pull/16888 From jbhateja at openjdk.org Fri Dec 1 08:39:23 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Dec 2023 08:39:23 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v2] In-Reply-To: <0rIPEtQHzBJ305hTlGk1NW8Ixze3FxaBpgD-t1QO3vk=.1e425ce3-bee5-4e47-964a-815e9ed073bd@github.com> References: <8jI7MRXEtEo3bQNn8gACp9H1Sy0HiSyqMVIhLvGAZII=.cb6a1c93-155c-4aa6-9c4c-f11a83bf3d56@github.com> <0rIPEtQHzBJ305hTlGk1NW8Ixze3FxaBpgD-t1QO3vk=.1e425ce3-bee5-4e47-964a-815e9ed073bd@github.com> Message-ID: On Thu, 30 Nov 2023 22:57:29 GMT, Sandhya Viswanathan wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 1244: >> >>> 1242: >>> 1243: int mem_num_elem = mismatched_ms ? num_elem * type2aelembytes(elem_bt) : num_elem; >>> 1244: BasicType mem_elem_bt = mismatched_ms ? T_BYTE : elem_bt; >> >> Shouldn't the mem_elem_bt come from arr_type->elem()->array_element_basic_type()? >> Also the mem_num_elem be calculated accordingly? > > The non masked load/store is doing that but the masked load/store is using T_BYTE. Yes, that's what the intent is, for masked versions we operate at byte granularity to support widening and narrowing case e.g. loading masked float vector from a double memory segment, mask bit is in accordance with float type, but cannot be directly reflected over double lanes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16888#discussion_r1411776496 From fjiang at openjdk.org Fri Dec 1 11:03:28 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 1 Dec 2023 11:03:28 GMT Subject: RFR: 8320697: RISC-V: Small refactoring for runtime calls [v4] In-Reply-To: References: Message-ID: > Hi, please review this refactoring for runtime calls. > Major changes: > 1. Unified the runtime calls with the existing MacroAssembler::rt_call. This will remove the duplicate code like `relocate(target.rspec() [&] {...}` to emit uncompressed instructions. > 2. Removed MacroAssembler::far_branches and made the call sites default to far branches. `branch_range` is 1MB for riscv, and `ReservedCodeCacheSize` will always bigger than `branch_range` in practice. We should remove this unnecessary check and simplify the code logic. > 3. Renamed MacroAssembler::la_patchable with MacroAssembler::la making it less confusing. > 4. `far_call` in `rt_call` should use `tmp` instead of the default temporary register `t0` > 5. Removed some unused codes in `g1BarrierSetAssembler_riscv.cpp` > > > Testing: > - [x] Tier1-3 tested on hifive unmatched board (release) > - [x] Run non-trivial benchmark workloads (fastdebug) Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' of https://github.com/openjdk/jdk into JDK-8320697 - adjust format - remove unnecessary relocate - Rename la_patchable with la - RISC-V: Small refactoring for external and runtime calls ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16816/files - new: https://git.openjdk.org/jdk/pull/16816/files/fee7449a..f94ece09 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16816&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16816&range=02-03 Stats: 19803 lines in 677 files changed: 14322 ins; 3283 del; 2198 mod Patch: https://git.openjdk.org/jdk/pull/16816.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16816/head:pull/16816 PR: https://git.openjdk.org/jdk/pull/16816 From chagedorn at openjdk.org Fri Dec 1 12:54:12 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 1 Dec 2023 12:54:12 GMT Subject: RFR: 8310711: [IR Framework] Remove safepoint while printing handling Message-ID: This clean-up PR removes the handling of the `` message in the IR framework. It is no longer required since we dump the output of `PrintIdeal` to the hotspot_pid file differently since [JDK-8306922](https://bugs.openjdk.org/browse/JDK-8306922). There is no interrupting `` message anymore. I removed the corresponding now unneeded code together with the previously added test case for it. Testing: tier1-4 Thanks, Christian ------------- Commit messages: - 8310711: [IR Framework] Remove safepoint while printing handling Changes: https://git.openjdk.org/jdk/pull/16921/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16921&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8310711 Stats: 459 lines in 6 files changed: 0 ins; 457 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16921.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16921/head:pull/16921 PR: https://git.openjdk.org/jdk/pull/16921 From mli at openjdk.org Fri Dec 1 15:31:16 2023 From: mli at openjdk.org (Hamlin Li) Date: Fri, 1 Dec 2023 15:31:16 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF Message-ID: Hi, Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? Thanks ## Test test/hotspot/jtreg/compiler/intrinsics/ and tests found: grep -nr test/hotspot/jtreg/ -we Math.signum and test found: grep -nr test/jdk/ -we Math.signum ------------- Commit messages: - Initial commit Changes: https://git.openjdk.org/jdk/pull/16925/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321001 Stats: 39 lines in 4 files changed: 39 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16925.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16925/head:pull/16925 PR: https://git.openjdk.org/jdk/pull/16925 From roland at openjdk.org Fri Dec 1 16:36:08 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 1 Dec 2023 16:36:08 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian Looks reasonable to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16877#pullrequestreview-1760188526 From sviswanathan at openjdk.org Fri Dec 1 21:30:35 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 1 Dec 2023 21:30:35 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v2] In-Reply-To: References: <8jI7MRXEtEo3bQNn8gACp9H1Sy0HiSyqMVIhLvGAZII=.cb6a1c93-155c-4aa6-9c4c-f11a83bf3d56@github.com> <0rIPEtQHzBJ305hTlGk1NW8Ixze3FxaBpgD-t1QO3vk=.1e425ce3-bee5-4e47-964a-815e9ed073bd@github.com> Message-ID: On Fri, 1 Dec 2023 08:36:18 GMT, Jatin Bhateja wrote: >> The non masked load/store is doing that but the masked load/store is using T_BYTE. > > Yes, that's what the intent is, for masked versions we operate at byte granularity to support widening and narrowing case e.g. loading masked float vector from a double memory segment, mask bit is in accordance with float type, but cannot be directly reflected over double lanes. Using byte type to write irrespective of memory segment type would work on little endian architectures but not on big endian architectures. On big endian the memory load/store should use the associated memory segment element type so a reinterpret node is needed as is done in non masked case. Alternatively at the minimum a check that the underlying architecture is little endian would be good to add for the masked intrinsic to succeed for the new cases. I think something like ((Endian::NATIVE == Endian::LITTLE) would do that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16888#discussion_r1412588376 From jbhateja at openjdk.org Fri Dec 1 21:54:33 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 1 Dec 2023 21:54:33 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v2] In-Reply-To: References: <8jI7MRXEtEo3bQNn8gACp9H1Sy0HiSyqMVIhLvGAZII=.cb6a1c93-155c-4aa6-9c4c-f11a83bf3d56@github.com> <0rIPEtQHzBJ305hTlGk1NW8Ixze3FxaBpgD-t1QO3vk=.1e425ce3-bee5-4e47-964a-815e9ed073bd@github.com> Message-ID: <_G7vBwmMjgmcSHc5Seh50wMBLQt0kGzOdOWcuuMz6UI=.9ba6d270-dc7d-439b-a522-a769958c13aa@github.com> On Fri, 1 Dec 2023 21:28:14 GMT, Sandhya Viswanathan wrote: >> Yes, that's what the intent is, for masked versions we operate at byte granularity to support widening and narrowing case e.g. loading masked float vector from a double memory segment, mask bit is in accordance with float type, but cannot be directly reflected over double lanes. > > Using byte type to write irrespective of memory segment type would work on little endian architectures but not on big endian architectures. On big endian the memory load/store should use the associated memory segment element type so a reinterpret node is needed as is done in non masked case. > Alternatively at the minimum a check that the underlying architecture is little endian would be good to add for the masked intrinsic to succeed for the new cases. I think something like ((Endian::NATIVE == Endian::LITTLE) would do that. Any rearrangement for endianness is done prior to storing or after loading data into vectors. So actual load / store operations is agnostic to target endianness. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16888#discussion_r1412602738 From jbhateja at openjdk.org Sat Dec 2 07:19:17 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Dec 2023 07:19:17 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v3] In-Reply-To: References: Message-ID: > Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. > > A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. > > Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. > > Following are the performance numbers of and existing JMH micro. > > ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) > > Please review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comment resolution. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16888/files - new: https://git.openjdk.org/jdk/pull/16888/files/935f4e07..19a14f08 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16888&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16888&range=01-02 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16888.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16888/head:pull/16888 PR: https://git.openjdk.org/jdk/pull/16888 From jbhateja at openjdk.org Sat Dec 2 07:53:13 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Dec 2023 07:53:13 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v4] In-Reply-To: References: Message-ID: > Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. > > A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. > > Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. > > Following are the performance numbers of and existing JMH micro. > > ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) > > Please review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Correting BIG_ENDIAN_ONLY check ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16888/files - new: https://git.openjdk.org/jdk/pull/16888/files/19a14f08..ec0dba61 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16888&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16888&range=02-03 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16888.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16888/head:pull/16888 PR: https://git.openjdk.org/jdk/pull/16888 From jbhateja at openjdk.org Sat Dec 2 07:53:13 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Dec 2023 07:53:13 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v4] In-Reply-To: References: <8jI7MRXEtEo3bQNn8gACp9H1Sy0HiSyqMVIhLvGAZII=.cb6a1c93-155c-4aa6-9c4c-f11a83bf3d56@github.com> <0rIPEtQHzBJ305hTlGk1NW8Ixze3FxaBpgD-t1QO3vk=.1e425ce3-bee5-4e47-964a-815e9ed073bd@github.com> Message-ID: On Fri, 1 Dec 2023 21:28:14 GMT, Sandhya Viswanathan wrote: > Using byte type to write irrespective of memory segment type would work on little endian architectures but not on big endian architectures. On big endian the memory load/store should use the associated memory segment element type so a reinterpret node is needed as is done in non masked case. Alternatively at the minimum a check that the underlying architecture is little endian would be good to add for the masked intrinsic to succeed for the new cases. I think something like ((Endian::NATIVE == Endian::LITTLE) would do that. I agree, limiting this for little endian. BTW, reinterpretation nodes are already in place currently before store and after loads, but re-interpretation of vector value is different from architecture level byte swizzling before loading memory contents in vector. On big-endian architectures if we load 64 bit memory (laid out as [Hi-Addr] 0,1,2,3,4,5,6,7 [Lo-Addr] bytes in memory) into two integers then in case1) first int should hold 3:0 bytes and second one 7:4 byte in strict sense, however, case2) one may also argue if we see long memory layout from an integer view, then first integer value should have 7:4 byte and second one 3:0 byte. These are mismatched segment accesses backed by primitive arrays but being loaded into a vector of different type hence an unsafe semantics of case2 should be applicable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16888#discussion_r1412751061 From duke at openjdk.org Sat Dec 2 22:10:37 2023 From: duke at openjdk.org (serge-sans-paille) Date: Sat, 2 Dec 2023 22:10:37 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v6] In-Reply-To: <-lVL7hp68aKNlBWHCxdKdPrPDe6NOAKD3zoX-u5ZMEM=.6164666f-7312-4f0c-b15b-2e01a331e820@github.com> References: <-lVL7hp68aKNlBWHCxdKdPrPDe6NOAKD3zoX-u5ZMEM=.6164666f-7312-4f0c-b15b-2e01a331e820@github.com> Message-ID: On Wed, 29 Nov 2023 06:12:54 GMT, David Holmes wrote: >> Not listed here: https://oca.opensource.oracle.com/?ojr=contributors > > I take it he is not an Intel employee, in which case he has to be an OpenJDK contributor himself for code for which he holds a copyright to be contributed to OpenJDK. @robilad please correct me if I am wrong here. hey o/ No problem on my side to either let go my copyright or fill the contributor agreement (where is it?) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1409280698 From robilad at openjdk.org Sat Dec 2 22:10:38 2023 From: robilad at openjdk.org (Dalibor Topic) Date: Sat, 2 Dec 2023 22:10:38 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v6] In-Reply-To: References: <-lVL7hp68aKNlBWHCxdKdPrPDe6NOAKD3zoX-u5ZMEM=.6164666f-7312-4f0c-b15b-2e01a331e820@github.com> Message-ID: <6gIzR7OWDjVSklcQD-PM7ng_maqv6IbhAjHfgm-hWEs=.45910411-ab20-4ef0-99fa-88a13f722b7b@github.com> On Wed, 29 Nov 2023 13:26:58 GMT, serge-sans-paille wrote: >> I take it he is not an Intel employee, in which case he has to be an OpenJDK contributor himself for code for which he holds a copyright to be contributed to OpenJDK. @robilad please correct me if I am wrong here. > > hey o/ No problem on my side to either let go my copyright or fill the contributor agreement (where is it?) Thanks Serge! And thank you for sending in an OCA for processing, which has now been done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1412885472 From dholmes at openjdk.org Mon Dec 4 03:20:38 2023 From: dholmes at openjdk.org (David Holmes) Date: Mon, 4 Dec 2023 03:20:38 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v6] In-Reply-To: References: <-lVL7hp68aKNlBWHCxdKdPrPDe6NOAKD3zoX-u5ZMEM=.6164666f-7312-4f0c-b15b-2e01a331e820@github.com> Message-ID: On Wed, 29 Nov 2023 13:26:58 GMT, serge-sans-paille wrote: >> I take it he is not an Intel employee, in which case he has to be an OpenJDK contributor himself for code for which he holds a copyright to be contributed to OpenJDK. @robilad please correct me if I am wrong here. > > hey o/ No problem on my side to either let go my copyright or fill the contributor agreement (where is it?) Thank you @serge-sans-paille ! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1413324819 From never at openjdk.org Mon Dec 4 05:41:53 2023 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 4 Dec 2023 05:41:53 GMT Subject: RFR: 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references Message-ID: Checking for leaf Klasses requires seeing if the subklass field is null. As part of the fix for JVMCI support for ZGC, JDK-8299229, it was changed to call into the runtime which had the side effect of creating a strong reference to an the class. Since it's only checking for non-null it's ok to just perform thread directly as was done prior to JDK-8299229. This avoids causing class unloading problems. ------------- Commit messages: - 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references Changes: https://git.openjdk.org/jdk/pull/16943/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16943&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321225 Stats: 6 lines in 1 file changed: 5 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16943.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16943/head:pull/16943 PR: https://git.openjdk.org/jdk/pull/16943 From fyang at openjdk.org Mon Dec 4 06:54:39 2023 From: fyang at openjdk.org (Fei Yang) Date: Mon, 4 Dec 2023 06:54:39 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 15:24:35 GMT, Hamlin Li wrote: > Hi, > Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? > Thanks > > ## Test > test/hotspot/jtreg/compiler/intrinsics/ > test/hotspot/jtreg/compiler/vectorapi/ > and tests found via: > grep -nr test/hotspot/jtreg/ -we Math.signum > and test found via: > grep -nr test/jdk/ -we Math.signum Hi, Did you check the C2 JIT code? I am wondering whether the newly-added code is covered well by the tests performed. src/hotspot/cpu/riscv/assembler_riscv.hpp line 1592: > 1590: INSN(vfsgnj_vf, 0b1010111, 0b101, 0b001000); > 1591: INSN(vfsgnjx_vf, 0b1010111, 0b101, 0b001010); > 1592: INSN(vfsgnjn_vf, 0b1010111, 0b101, 0b001001); Not used anywhere? src/hotspot/cpu/riscv/riscv_v.ad line 3670: > 3668: match(Set dst (SignumVF dst (Binary zero one))); > 3669: match(Set dst (SignumVD dst (Binary zero one))); > 3670: effect(TEMP_DEF dst); v0 is clobbered in `C2_MacroAssembler::signum_fp_v`. Shouldn't we add a `TEMP v0` to the effect? ------------- Changes requested by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16925#pullrequestreview-1761692106 PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1413430803 PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1413430160 From vkempik at openjdk.org Mon Dec 4 06:54:43 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Mon, 4 Dec 2023 06:54:43 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF In-Reply-To: References: Message-ID: <_Jo2yWlqtqmsrZxAJonVjXB1xPRlbPZwIxjNHaVAoTg=.eeea6a5b-8941-4563-9939-b30bc11916b2@github.com> On Fri, 1 Dec 2023 15:24:35 GMT, Hamlin Li wrote: > Hi, > Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? > Thanks > > ## Test > test/hotspot/jtreg/compiler/intrinsics/ > test/hotspot/jtreg/compiler/vectorapi/ > and tests found via: > grep -nr test/hotspot/jtreg/ -we Math.signum > and test found via: > grep -nr test/jdk/ -we Math.signum src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1681: > 1679: void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, > 1680: VectorRegister zero, VectorRegister one) { > 1681: vsetvli_helper(bt, vlen); Can we have a situation where vlen times sew(bt) won't fit into h/w register ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1413434154 From xgong at openjdk.org Mon Dec 4 07:32:42 2023 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 4 Dec 2023 07:32:42 GMT Subject: RFR: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts [v4] In-Reply-To: References: Message-ID: On Wed, 22 Nov 2023 07:05:21 GMT, Eric Liu wrote: >> Vector API defines zero-extend operations [1], which are going to be intrinsified and generated to `VectorUCastNode` by C2. This patch adds backend implementation for `VectorUCastNode` on AArch64. >> >> The micro benchmark shows significant performance improvement. In my test machine (SVE, 256-bit), the result is shown as below: >> >> >> >> Benchmark Before After Units Gain >> VectorZeroExtend.byte2Int 3168.251 243012.399 ops/ms 75.70 >> VectorZeroExtend.byte2Long 3212.201 216291.588 ops/ms 66.33 >> VectorZeroExtend.byte2Short 3391.968 182655.365 ops/ms 52.85 >> VectorZeroExtend.int2Long 1012.197 80448.553 ops/ms 78.48 >> VectorZeroExtend.short2Int 1812.471 153416.828 ops/ms 83.65 >> VectorZeroExtend.short2Long 1788.382 129794.814 ops/ms 71.58 >> >> >> On other Neon systems, we can get similar performance boost as a result of intrinsification success. >> >> Since `VectorUCastNode` only used in Vector API's zero extension currently, this patch also adds assertion on nodes' definitions to clarify their usages. >> >> [TEST] >> compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726 > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Icfe9619af1c9e7d5ea8cac457ccebb4eec5c34ad LGTM! ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/16670#pullrequestreview-1761741783 From eliu at openjdk.org Mon Dec 4 08:17:56 2023 From: eliu at openjdk.org (Eric Liu) Date: Mon, 4 Dec 2023 08:17:56 GMT Subject: Integrated: 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts In-Reply-To: References: Message-ID: <9H0QK5sdiSsreNc11CwQGpRnsW5p5__fHVrvmrLQMjs=.3599d4bd-0005-4db4-b11b-de2b9dff70dc@github.com> On Wed, 15 Nov 2023 07:48:28 GMT, Eric Liu wrote: > Vector API defines zero-extend operations [1], which are going to be intrinsified and generated to `VectorUCastNode` by C2. This patch adds backend implementation for `VectorUCastNode` on AArch64. > > The micro benchmark shows significant performance improvement. In my test machine (SVE, 256-bit), the result is shown as below: > > > > Benchmark Before After Units Gain > VectorZeroExtend.byte2Int 3168.251 243012.399 ops/ms 75.70 > VectorZeroExtend.byte2Long 3212.201 216291.588 ops/ms 66.33 > VectorZeroExtend.byte2Short 3391.968 182655.365 ops/ms 52.85 > VectorZeroExtend.int2Long 1012.197 80448.553 ops/ms 78.48 > VectorZeroExtend.short2Int 1812.471 153416.828 ops/ms 83.65 > VectorZeroExtend.short2Long 1788.382 129794.814 ops/ms 71.58 > > > On other Neon systems, we can get similar performance boost as a result of intrinsification success. > > Since `VectorUCastNode` only used in Vector API's zero extension currently, this patch also adds assertion on nodes' definitions to clarify their usages. > > [TEST] > compiler/vectorapi and jdk/incubator/vector passed on NEON and SVE machines. > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorOperators.java#L726 This pull request has now been integrated. Changeset: 9b8eaa2f Author: Eric Liu URL: https://git.openjdk.org/jdk/commit/9b8eaa2fc3c5127bc7828471916f5d881bf71228 Stats: 381 lines in 8 files changed: 299 ins; 23 del; 59 mod 8319872: AArch64: [vectorapi] Implementation of unsigned (zero extended) casts Reviewed-by: aph, xgong ------------- PR: https://git.openjdk.org/jdk/pull/16670 From rehn at openjdk.org Mon Dec 4 08:21:45 2023 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 4 Dec 2023 08:21:45 GMT Subject: RFR: 8320697: RISC-V: Small refactoring for runtime calls [v4] In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 11:03:28 GMT, Feilong Jiang wrote: >> Hi, please review this refactoring for runtime calls. >> Major changes: >> 1. Unified the runtime calls with the existing MacroAssembler::rt_call. This will remove the duplicate code like `relocate(target.rspec() [&] {...}` to emit uncompressed instructions. >> 2. Removed MacroAssembler::far_branches and made the call sites default to far branches. `branch_range` is 1MB for riscv, and `ReservedCodeCacheSize` will always bigger than `branch_range` in practice. We should remove this unnecessary check and simplify the code logic. >> 3. Renamed MacroAssembler::la_patchable with MacroAssembler::la making it less confusing. >> 4. `far_call` in `rt_call` should use `tmp` instead of the default temporary register `t0` >> 5. Removed some unused codes in `g1BarrierSetAssembler_riscv.cpp` >> >> >> Testing: >> - [x] Tier1-3 tested on hifive unmatched board (release) >> - [x] Run non-trivial benchmark workloads (fastdebug) > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' of https://github.com/openjdk/jdk into JDK-8320697 > - adjust format > - remove unnecessary relocate > - Rename la_patchable with la > - RISC-V: Small refactoring for external and runtime calls Hey! I notice this: static int far_branch_size() { if (far_branches()) { return 2 * 4; // auipc + jalr, see far_call() & far_jump() Which is used to determine deopt handler size. I don't understand how we ever may use movptr here instead with knowing who is calling? I think we want two methods here one with fixed size ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16816#issuecomment-1838043824 From chagedorn at openjdk.org Mon Dec 4 08:24:42 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Dec 2023 08:24:42 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian Thanks Roland for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16877#issuecomment-1838047691 From ihse at openjdk.org Mon Dec 4 11:51:38 2023 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Mon, 4 Dec 2023 11:51:38 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v6] In-Reply-To: References: Message-ID: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> On Thu, 30 Nov 2023 20:19:56 GMT, Srinivas Vamsi Parasa wrote: >> Raising the minimum gcc version is not done willy-nilly. (I feel a "You just don't ..." meme coming up) >> >> But you are saying that you want to skip building this library unless you have a gcc version that supports c++17? >> >> I still don't really like it. I'd like to hear someone else who can think clearly about this, if we want to go down this path, and start adding libraries that use C++17. Maybe @kimbarrett has some input? > >> But you are saying that you want to skip building this library unless you have a gcc version that supports c++17? >> > Yes, the request is to skip building the simdsort library if GCC version is < 8 as only GCC >= 8 supports C++17 features. Then you must add logic to check for this. Now the build will just fail if building with an older gcc. That is not acceptable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1413761798 From epeter at openjdk.org Mon Dec 4 12:42:35 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 4 Dec 2023 12:42:35 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Suggestions by Christian for naming ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/0e9edf76..85cda773 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=20 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=19-20 Stats: 159 lines in 1 file changed: 36 ins; 0 del; 123 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From duke at openjdk.org Mon Dec 4 14:25:48 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 4 Dec 2023 14:25:48 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" Message-ID: This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. Changes: - Fix the issue by replacing `adr` with `lea`. - Add a regression test. Thanks to @fisk and @xmas92 for the assistance. ### Testing Tests: tier1, tier2, tier3, tier4, tier5 Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 ------------- Commit messages: - Remove OS restriction in test - Add regression test - Replace adr in const2reg with lea Changes: https://git.openjdk.org/jdk/pull/16951/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16951&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8320682 Stats: 15 lines in 2 files changed: 13 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16951.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16951/head:pull/16951 PR: https://git.openjdk.org/jdk/pull/16951 From thartmann at openjdk.org Mon Dec 4 14:50:41 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 4 Dec 2023 14:50:41 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16951#pullrequestreview-1762613552 From thartmann at openjdk.org Mon Dec 4 14:53:38 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 4 Dec 2023 14:53:38 GMT Subject: RFR: 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references In-Reply-To: References: Message-ID: <9SWRqWmPhtcYtnNBBPMuw3VbihKoAlgmguR2-rmRonk=.881cf57c-494e-4d49-a56b-1c26e46478bf@github.com> On Mon, 4 Dec 2023 05:36:38 GMT, Tom Rodriguez wrote: > Checking for leaf Klasses requires seeing if the subklass field is null. As part of the fix for JVMCI support for ZGC, JDK-8299229, it was changed to call into the runtime which had the side effect of creating a strong reference to an the class. Since it's only checking for non-null it's ok to just perform thread directly as was done prior to JDK-8299229. This avoids causing class unloading problems. Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16943#pullrequestreview-1762631218 From thartmann at openjdk.org Mon Dec 4 14:55:36 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 4 Dec 2023 14:55:36 GMT Subject: RFR: 8310711: [IR Framework] Remove safepoint while printing handling In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 12:47:48 GMT, Christian Hagedorn wrote: > This clean-up PR removes the handling of the `` message in the IR framework. It is no longer required since we dump the output of `PrintIdeal` to the hotspot_pid file differently since [JDK-8306922](https://bugs.openjdk.org/browse/JDK-8306922). There is no interrupting `` message anymore. I removed the corresponding now unneeded code together with the previously added test case for it. > > Testing: tier1-4 > > Thanks, > Christian Nice cleanup, looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16921#pullrequestreview-1762634894 From aph at openjdk.org Mon Dec 4 15:01:44 2023 From: aph at openjdk.org (Andrew Haley) Date: Mon, 4 Dec 2023 15:01:44 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: <8vJqCggXVpftvIxB46JcHu3-LZq29zZz_qp6v16Bc-8=.23e01553-4bc3-4184-8ad4-f29693f28311@github.com> On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 Marked as reviewed by aph (Reviewer). This is fine for C1. Iif it were for C2, we're already doing a relaxation pass which we could utilize to fix up out-of-range loads. ------------- PR Review: https://git.openjdk.org/jdk/pull/16951#pullrequestreview-1762644482 PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1838824825 From chagedorn at openjdk.org Mon Dec 4 15:06:44 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Dec 2023 15:06:44 GMT Subject: RFR: 8310711: [IR Framework] Remove safepoint while printing handling In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 12:47:48 GMT, Christian Hagedorn wrote: > This clean-up PR removes the handling of the `` message in the IR framework. It is no longer required since we dump the output of `PrintIdeal` to the hotspot_pid file differently since [JDK-8306922](https://bugs.openjdk.org/browse/JDK-8306922). There is no interrupting `` message anymore. I removed the corresponding now unneeded code together with the previously added test case for it. > > Testing: tier1-4 > > Thanks, > Christian Thanks Tobias for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16921#issuecomment-1838835035 From sviswanathan at openjdk.org Mon Dec 4 17:37:39 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 4 Dec 2023 17:37:39 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v4] In-Reply-To: References: Message-ID: On Sat, 2 Dec 2023 07:53:13 GMT, Jatin Bhateja wrote: >> Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. >> >> A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. >> >> Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. >> >> Following are the performance numbers of and existing JMH micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Correting BIG_ENDIAN_ONLY check PR looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16888#pullrequestreview-1763020221 From sviswanathan at openjdk.org Mon Dec 4 19:28:39 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 4 Dec 2023 19:28:39 GMT Subject: RFR: 8321215: Incorrect x86 instruction encoding for VSIB addressing mode Message-ID: For instructions that use VSIB addressing mode (gather/scatter), the assembler incorrectly sets EVEX.X bit when the VSIB vector register is in the range XMM16 - XMM23. The EVEX.X bit should only be set when bit 3 of the register encoding is 1, i.e. if the register encoding is 8 - 15 or 24 - 31. ------------- Commit messages: - 8321215: Incorrect x86 instruction encoding for VSIB addressing mode Changes: https://git.openjdk.org/jdk/pull/16957/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16957&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321215 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16957.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16957/head:pull/16957 PR: https://git.openjdk.org/jdk/pull/16957 From shade at openjdk.org Mon Dec 4 19:49:51 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 4 Dec 2023 19:49:51 GMT Subject: RFR: 8321215: Incorrect x86 instruction encoding for VSIB addressing mode In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 19:09:33 GMT, Sandhya Viswanathan wrote: > For instructions that use VSIB addressing mode (gather/scatter), the assembler incorrectly sets EVEX.X bit when the VSIB vector register is in the range XMM16 - XMM23. The EVEX.X bit should only be set when bit 3 of the register encoding is 1, i.e. if the register encoding is 8 - 15 or 24 - 31. I am curious which part of SDM it follows from? In Intel SDM Vol 2, "2.6.1 Instruction Format and EVEX", I see: "Operand specifier modifier bit for vector register: ... P[6] (Aleksey: EVEX.X) can also provide access to a high 16 vector register when SIB or VSIB addressing are not needed." This "high 16" seems to differ from "upper 16" like you described, right? I.e. "high 16" means the 3-th bit set (XMM8...XMM15 or XMM24...XMM31), whereas "upper 16" means the actual "upper index" registers e.g. XMM16...XMM31)? ------------- PR Review: https://git.openjdk.org/jdk/pull/16957#pullrequestreview-1763275276 From shade at openjdk.org Mon Dec 4 19:59:08 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 4 Dec 2023 19:59:08 GMT Subject: RFR: 8321215: Incorrect x86 instruction encoding for VSIB addressing mode In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 19:09:33 GMT, Sandhya Viswanathan wrote: > For instructions that use VSIB addressing mode (gather/scatter), the assembler incorrectly sets EVEX.X bit when the VSIB vector register is in the range XMM16 - XMM23. The EVEX.X bit should only be set when bit 3 of the register encoding is 1, i.e. if the register encoding is 8 - 15 or 24 - 31. Marked as reviewed by shade (Reviewer). Ah, AMD APM Vol 3, "1.2.8 VEX and XOP Prefixes" is significantly clearer on that part, it just states it adds 1 msb bit to SIB.index, which I think matches the _"high 16"_ in Intel SDM implies. REX.X: Index field extension (Bit 1). The REX.X bit adds a 1-bit (msb) extension to the SIB.index field. See ?ModRM and SIB Bytes? on page 17. ------------- PR Review: https://git.openjdk.org/jdk/pull/16957#pullrequestreview-1763287946 PR Comment: https://git.openjdk.org/jdk/pull/16957#issuecomment-1839375880 From sviswanathan at openjdk.org Mon Dec 4 21:14:34 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 4 Dec 2023 21:14:34 GMT Subject: RFR: 8321215: Incorrect x86 instruction encoding for VSIB addressing mode In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 19:55:20 GMT, Aleksey Shipilev wrote: >> For instructions that use VSIB addressing mode (gather/scatter), the assembler incorrectly sets EVEX.X bit when the VSIB vector register is in the range XMM16 - XMM23. The EVEX.X bit should only be set when bit 3 of the register encoding is 1, i.e. if the register encoding is 8 - 15 or 24 - 31. > > Ah, AMD APM Vol 3, "1.2.8 VEX and XOP Prefixes" is significantly clearer on that part, it just states it adds 1 msb bit to SIB.index, which I think matches the _"high 16"_ in Intel SDM implies. > > > REX.X: Index field extension (Bit 1). The REX.X bit adds a 1-bit (msb) extension to the > SIB.index field. See ?ModRM and SIB Bytes? on page 17. @shipilev Thanks a lot for the review. Table 2-31 in section 2.7.2 specifies the Vector Index encoding of VSIB Memory Addressing as follows: VIDX 4:(EVEX.V') 3:(EVEX.X) [2:0]:(sib.index) ------------- PR Comment: https://git.openjdk.org/jdk/pull/16957#issuecomment-1839483764 From duke at openjdk.org Mon Dec 4 21:55:36 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Dec 2023 21:55:36 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v7] In-Reply-To: References: Message-ID: > The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. > > For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. > > For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. > > **Note:** This PR also improves the performance of AVX512 sort by upto 35%. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup > -- | -- | -- | -- | -- > ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 > ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 > ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 > ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 > ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 > ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 > ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 > ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 > ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 > ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 > ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 > ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 > ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 > ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 > ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 > ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 > ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 > ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/... Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: add GCC version guards ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16534/files - new: https://git.openjdk.org/jdk/pull/16534/files/d957f413..bb5f711a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=05-06 Stats: 42 lines in 3 files changed: 42 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16534.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16534/head:pull/16534 PR: https://git.openjdk.org/jdk/pull/16534 From duke at openjdk.org Mon Dec 4 22:15:24 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Dec 2023 22:15:24 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: References: Message-ID: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> > The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. > > For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. > > For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. > > **Note:** This PR also improves the performance of AVX512 sort by upto 35%. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup > -- | -- | -- | -- | -- > ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 > ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 > ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 > ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 > ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 > ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 > ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 > ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 > ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 > ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 > ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 > ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 > ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 > ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 > ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 > ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 > ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 > ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/... Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort - add GCC version guards - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort - Remove C++17 from C flags - add avoid masked stores operation - update the code to check for supported simd sort cpus - Disable AVX2 sort for 64-bit types - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort - fix jcheck failures due to windows encoding - fix carriage return and change insertion sort thresholds - ... and 7 more: https://git.openjdk.org/jdk/compare/0b17dc14...bc590d9f ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16534/files - new: https://git.openjdk.org/jdk/pull/16534/files/bb5f711a..bc590d9f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=06-07 Stats: 36176 lines in 1033 files changed: 17425 ins; 14551 del; 4200 mod Patch: https://git.openjdk.org/jdk/pull/16534.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16534/head:pull/16534 PR: https://git.openjdk.org/jdk/pull/16534 From duke at openjdk.org Mon Dec 4 22:18:35 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 4 Dec 2023 22:18:35 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> Message-ID: On Mon, 4 Dec 2023 11:48:44 GMT, Magnus Ihse Bursie wrote: >>> But you are saying that you want to skip building this library unless you have a gcc version that supports c++17? >>> >> Yes, the request is to skip building the simdsort library if GCC version is < 8 as only GCC >= 8 supports C++17 features. > > Then you must add logic to check for this. Now the build will just fail if building with an older gcc. That is not acceptable. Hi Marcus (@magicus), please see the updated code which added guards to check for GCC version >= 7.5 in `src/java.base/linux/native/libsimdsort/{avx2-linux-qsort.cpp, avx512-linux-qsort.cpp}`. GCC >= 7.5 is needed to compile libsimdsort using C++17 features. Made sure that OpenJDK builds without errors using both GCC 7.5 and GCC 6.4. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1414570644 From sviswanathan at openjdk.org Tue Dec 5 00:10:38 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 5 Dec 2023 00:10:38 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: <5JnWpXMWKZ85mosrUdVrdNLRyHWF_HNW3h8krKWO63k=.63bf3ffa-63a1-4ce4-b972-278655e8a567@github.com> On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - add GCC version guards > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - Remove C++17 from C flags > - add avoid masked stores operation > - update the code to check for supported simd sort cpus > - Disable AVX2 sort for 64-bit types > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - fix jcheck failures due to windows encoding > - fix carriage return and change insertion sort thresholds > - ... and 7 more: https://git.openjdk.org/jdk/compare/d4804a12...bc590d9f The PR looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16534#pullrequestreview-1763680754 From duke at openjdk.org Tue Dec 5 00:50:41 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 5 Dec 2023 00:50:41 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - add GCC version guards > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - Remove C++17 from C flags > - add avoid masked stores operation > - update the code to check for supported simd sort cpus > - Disable AVX2 sort for 64-bit types > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - fix jcheck failures due to windows encoding > - fix carriage return and change insertion sort thresholds > - ... and 7 more: https://git.openjdk.org/jdk/compare/0dc47dcf...bc590d9f Hello Vladimir (@vnkozlov), Could you please review this PR? Thanks, Vamsi ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1839810768 From fjiang at openjdk.org Tue Dec 5 01:12:35 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 5 Dec 2023 01:12:35 GMT Subject: RFR: 8320697: RISC-V: Small refactoring for runtime calls [v4] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 08:19:12 GMT, Robbin Ehn wrote: > I don't understand how we ever may use movptr here instead without knowing who is calling? > > I think we want two methods here one with fixed size ? Hi @robehn, I'm not sure if I understand it correctly. Do you mean we should use `movptr` to get fixed instruction size instead of `la_patchbale`? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16816#issuecomment-1839828061 From rehn at openjdk.org Tue Dec 5 06:52:34 2023 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 5 Dec 2023 06:52:34 GMT Subject: RFR: 8320697: RISC-V: Small refactoring for runtime calls [v4] In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 01:09:51 GMT, Feilong Jiang wrote: > > I don't understand how we ever may use movptr here instead without knowing who is calling? > > I think we want two methods here one with fixed size ? > > Hi @robehn, I'm not sure if I understand it correctly. Do you mean we should use `movptr` to get fixed instruction size instead of `la_patchable`? https://bugs.openjdk.org/browse/JDK-8321315 Let's take here, this PR is good! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16816#issuecomment-1840110033 From fjiang at openjdk.org Tue Dec 5 07:08:45 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 5 Dec 2023 07:08:45 GMT Subject: RFR: 8320697: RISC-V: Small refactoring for runtime calls [v4] In-Reply-To: References: Message-ID: <_RW7B0s680obwRqz-A7LMrF6WUzNyN1ZysqP81IYrNA=.dc85b625-6e71-4103-83c1-45e4b7876d42@github.com> On Tue, 5 Dec 2023 06:49:53 GMT, Robbin Ehn wrote: > https://bugs.openjdk.org/browse/JDK-8321315 > > Let's take here, this PR is good! Ok. Let's get this integrated then. Thanks. Tier1-3 tests are still good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16816#issuecomment-1840124649 From fjiang at openjdk.org Tue Dec 5 07:08:46 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 5 Dec 2023 07:08:46 GMT Subject: Integrated: 8320697: RISC-V: Small refactoring for runtime calls In-Reply-To: References: Message-ID: On Sun, 26 Nov 2023 10:52:18 GMT, Feilong Jiang wrote: > Hi, please review this refactoring for runtime calls. > Major changes: > 1. Unified the runtime calls with the existing MacroAssembler::rt_call. This will remove the duplicate code like `relocate(target.rspec() [&] {...}` to emit uncompressed instructions. > 2. Removed MacroAssembler::far_branches and made the call sites default to far branches. `branch_range` is 1MB for riscv, and `ReservedCodeCacheSize` will always bigger than `branch_range` in practice. We should remove this unnecessary check and simplify the code logic. > 3. Renamed MacroAssembler::la_patchable with MacroAssembler::la making it less confusing. > 4. `far_call` in `rt_call` should use `tmp` instead of the default temporary register `t0` > 5. Removed some unused codes in `g1BarrierSetAssembler_riscv.cpp` > > > Testing: > - [x] Tier1-3 tested on hifive unmatched board (release) > - [x] Run non-trivial benchmark workloads (fastdebug) This pull request has now been integrated. Changeset: aec38659 Author: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/aec386596d531345b46be4f674b775df71df1eee Stats: 253 lines in 15 files changed: 26 ins; 137 del; 90 mod 8320697: RISC-V: Small refactoring for runtime calls Co-authored-by: Fei Yang Reviewed-by: fyang, rehn ------------- PR: https://git.openjdk.org/jdk/pull/16816 From eosterlund at openjdk.org Tue Dec 5 08:02:32 2023 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 5 Dec 2023 08:02:32 GMT Subject: RFR: 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 05:36:38 GMT, Tom Rodriguez wrote: > Checking for leaf Klasses requires seeing if the subklass field is null. As part of the fix for JVMCI support for ZGC, JDK-8299229, it was changed to call into the runtime which had the side effect of creating a strong reference to an the class. Since it's only checking for non-null it's ok to just perform thread directly as was done prior to JDK-8299229. This avoids causing class unloading problems. Looks good. I suppose this can yield some false negatives when the link isn't null but the subclass is concurrently unloading. But that probably doesn't matter, and you would have received the same negative answer had it been asked a bit earlier. ------------- Marked as reviewed by eosterlund (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16943#pullrequestreview-1764242534 From thartmann at openjdk.org Tue Dec 5 08:16:03 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 08:16:03 GMT Subject: RFR: 8318468: compiler/tiered/LevelTransitionTest.java fails with -XX:CompileThreshold=100 -XX:TieredStopAtLevel=1 Message-ID: The test fails with `-XX:CompileThreshold=100 -XX:TieredStopAtLevel=1` because `CompileMethodHolder::nonTrivialMethod` is unexpectedly OSR compiled but the test case has `isOSR() == false` (see line 197). The test is indeed not supposed to trigger an OSR compilation, and usually won't, but the loop is required to test tiered level transitions of a non-trivial method containing a loop. I simply changed the iterations to 1 to make sure that the backedge is never taken and thus prevent unexpected OSR compilations. The method will still be detected to have a loop and serve its purpose. Thanks, Tobias ------------- Commit messages: - 8318468: compiler/tiered/LevelTransitionTest.java fails with -XX:CompileThreshold=100 -XX:TieredStopAtLevel=1 Changes: https://git.openjdk.org/jdk/pull/16964/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16964&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8318468 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16964.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16964/head:pull/16964 PR: https://git.openjdk.org/jdk/pull/16964 From rcastanedalo at openjdk.org Tue Dec 5 08:25:36 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Dec 2023 08:25:36 GMT Subject: RFR: 8318468: compiler/tiered/LevelTransitionTest.java fails with -XX:CompileThreshold=100 -XX:TieredStopAtLevel=1 In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 08:09:34 GMT, Tobias Hartmann wrote: > The test fails with `-XX:CompileThreshold=100 -XX:TieredStopAtLevel=1` because `CompileMethodHolder::nonTrivialMethod` is unexpectedly OSR compiled but the test case has `isOSR() == false` (see line 197). The test is indeed not supposed to trigger an OSR compilation, and usually won't, but the loop is required to test tiered level transitions of a non-trivial method containing a loop. I simply changed the iterations to 1 to make sure that the backedge is never taken and thus prevent unexpected OSR compilations. The method will still be detected to have a loop and serve its purpose. > > Thanks, > Tobias Looks good. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16964#pullrequestreview-1764324986 From chagedorn at openjdk.org Tue Dec 5 08:30:38 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Dec 2023 08:30:38 GMT Subject: RFR: 8318468: compiler/tiered/LevelTransitionTest.java fails with -XX:CompileThreshold=100 -XX:TieredStopAtLevel=1 In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 08:09:34 GMT, Tobias Hartmann wrote: > The test fails with `-XX:CompileThreshold=100 -XX:TieredStopAtLevel=1` because `CompileMethodHolder::nonTrivialMethod` is unexpectedly OSR compiled but the test case has `isOSR() == false` (see line 197). The test is indeed not supposed to trigger an OSR compilation, and usually won't, but the loop is required to test tiered level transitions of a non-trivial method containing a loop. I simply changed the iterations to 1 to make sure that the backedge is never taken and thus prevent unexpected OSR compilations. The method will still be detected to have a loop and serve its purpose. > > Thanks, > Tobias Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16964#pullrequestreview-1764339947 From roland at openjdk.org Tue Dec 5 08:41:49 2023 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 5 Dec 2023 08:41:49 GMT Subject: RFR: 8320649: C2: Optimize scoped values Message-ID: This change implements C2 optimizations for calls to ScopedValue.get(). Indeed, in: v1 = scopedValue.get(); ... v2 = scopedValue.get(); `v2` can be replaced by `v1` and the second call to `get()` can be optimized out. That's true whatever is between the 2 calls unless a new mapping for `scopedValue` is created in between (when that happens no optimizations is performed for the method being compiled). Hoisting a `get()` call out of loop for a loop invariant `scopedValue` should also be legal in most cases. `ScopedValue.get()` is implemented in java code as a 2 step process. A cache is attached to the current thread object. If the `ScopedValue` object is in the cache then the result from `get()` is read from there. Otherwise a slow call is performed that also inserts the mapping in the cache. The cache itself is lazily allocated. One `ScopedValue` can be hashed to 2 different indexes in the cache. On a cache probe, both indexes are checked. As a consequence, the process of probing the cache is a multi step process (check if the cache is present, check first index, check second index if first index failed). If the cache is populated early on, then when the method that calls `ScopedValue.get()` is compiled, profile reports the slow path as never taken and only the read from the cache is compiled. To perform the optimizations, I added 3 new node types to C2: - the pair ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for the cache probe - a cfg node ScopedValueGetResultNode to help locate the result of the `get()` call in the IR graph. In pseudo code, once the nodes are inserted, the code of a `get()` is: hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) if (hits_in_the_cache) { res = ScopedValueGetLoadFromCache(hits_in_the_cache); } else { res = ..; //slow call possibly inlined. Subgraph can be arbitray complex } res = ScopedValueGetResult(res) In the snippet: v1 = scopedValue.get(); ... v2 = scopedValue.get(); Replacing `v2` by `v1` is then done by starting from the `ScopedValueGetResult` node for the second `get()` and looking for a dominating `ScopedValueGetResult` for the same `ScopedValue` object. When one is found, it is used as a replacement. Eliminating the second `get()` call is achieved by making `ScopedValueGetHitsInCache` always successful if there's a dominating `ScopedValueGetResult` and replacing its companion `ScopedValueGetLoadFromCache` by the dominating `ScopedValueGetResult`. Hoisting a `get()` out of loop is achieved by peeling one iteration of the loop. The optimization above then finds a dominating `get()` and removed the `get()` from the loop body. An important case, I think, is when profile predicts the slow case to never taken. Then the code of `get()` is: hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) if (hits_in_the_cache) { res = ScopedValueGetLoadFromCache(hits_in_the_cache); } else { trap(); } res = ScopedValueGetResult(res) The `ScopedValueGetResult` doesn't help and is removed early one. The optimization process then looks for a pair of `ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` that dominates the current pair of `ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` and can replace them. In that case, hoisting a `ScopedValue.get()` can be done by predication and I added special logic in predication for that. Adding the new nodes to the graph when a `ScopedValue.get()` call is encountered is done in several steps: 1- inlining of `ScopedValue.get()` is delayed and the call is enqueued for late inlining. 2- Once the graph is fully constructed, for each call to `ScopedValue.get()`, a `ScopedValueGetResult` is added between the result of the call and its uses. 3- the call is then inlined by parsing the `ScopedValue.get()` method 4- finally the subgraph that results is pattern matched and the pieces required to perform the cache probe are extracted and attached to new `ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` nodes There are a couple of reasons for steps 3 and 4: - As mentioned above probing the cache is a multi step process. Having only 2 nodes in a simple graph shape to represent it makes it easier to write robust optimizations - the subgraph for the method after parsing contains valuable pieces of information: profile data that captures which of the 2 locations in the cache is the most likely to causee a hit. Profile data is attached to the nodes. Removal of redundant nodes is done during loop opts. The `ScopedValue` nodes are then expanded. That also happens during loop opts because once expansion is over, there are opportunities for further optimizations/clean up that can only happens during loop opts. During expansion, `ScopedValueGetResult` nodes are removed and `ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` are expanded to the multi step process of probing the cache. Profile data attached to the nodes are used to assign correct frequencies/counts to the If nodes. Of the 2 locations in the cache that are tested, the one that's the most likely to see a hit (from profile data) is done first. ------------- Commit messages: - white spaces + bug id in test - test & fix Changes: https://git.openjdk.org/jdk/pull/16966/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8320649 Stats: 2077 lines in 33 files changed: 2047 ins; 1 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From duke at openjdk.org Tue Dec 5 09:10:52 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 09:10:52 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN Message-ID: This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). Changes: - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. - Add a regression test. ### Testing - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) ------------- Commit messages: - Add regression test - Record LoadN for IGVN Changes: https://git.openjdk.org/jdk/pull/16967/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16967&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8310524 Stats: 66 lines in 2 files changed: 66 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16967.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16967/head:pull/16967 PR: https://git.openjdk.org/jdk/pull/16967 From chagedorn at openjdk.org Tue Dec 5 10:01:34 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Dec 2023 10:01:34 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 09:05:35 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). > > Changes: > - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. > - Add a regression test. > > ### Testing > - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16967#pullrequestreview-1764604357 From rcastanedalo at openjdk.org Tue Dec 5 10:33:39 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Dec 2023 10:33:39 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 09:05:35 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). > > Changes: > - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. > - Add a regression test. > > ### Testing > - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) src/hotspot/share/opto/graphKit.cpp line 1567: > 1565: record_for_igvn(ld); > 1566: if (ld->is_DecodeN()) { > 1567: // Also record the actual load (LoadN) in case ld is DecodeN Maybe add an assertion here checking that `ld->in(1)` is indeed a LoadN node. test/hotspot/jtreg/compiler/c2/irTests/igvn/TestLoadNIdeal.java line 48: > 46: > 47: @Test > 48: @IR(counts = {IRNode.LOAD_N, "1"}) Maybe add a precondition here testing that `UseCompressedOops` is enabled. Currently the test passes when running with `-XX:-UseCompressedOops` because `UseCompressedOops` is not whitelisted by the IR framework and hence the IR check is disabled, but better to be explicit I think. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1415322398 PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1415320929 From ihse at openjdk.org Tue Dec 5 11:21:36 2023 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 5 Dec 2023 11:21:36 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> Message-ID: On Mon, 4 Dec 2023 22:14:14 GMT, Srinivas Vamsi Parasa wrote: >> Then you must add logic to check for this. Now the build will just fail if building with an older gcc. That is not acceptable. > > Hi Marcus (@magicus), please see the updated code which added guards to check for GCC version >= 7.5 in `src/java.base/linux/native/libsimdsort/{avx2-linux-qsort.cpp, avx512-linux-qsort.cpp}`. GCC >= 7.5 is needed to compile libsimdsort using C++17 features. Made sure that OpenJDK builds without errors using both GCC 7.5 and GCC 6.4. That sounds weird. You can't check for if compiler options should be enabled or not inside source code files. Are you saying that when compiling with GCC 6, it will just silently ignore `-std=c++17`? I'd have assumed that it printed a warning or error about an unknown or invalid option, if C++17 is not supported. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1415392615 From thartmann at openjdk.org Tue Dec 5 11:43:33 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 11:43:33 GMT Subject: RFR: 8318468: compiler/tiered/LevelTransitionTest.java fails with -XX:CompileThreshold=100 -XX:TieredStopAtLevel=1 In-Reply-To: References: Message-ID: <4BaKSz7ZrvExQ6sMDJJKB9UDj0jTPP4DRL8KEPZCbvQ=.9e1094fb-2f92-4d14-9555-79c52213d9e1@github.com> On Tue, 5 Dec 2023 08:09:34 GMT, Tobias Hartmann wrote: > The test fails with `-XX:CompileThreshold=100 -XX:TieredStopAtLevel=1` because `CompileMethodHolder::nonTrivialMethod` is unexpectedly OSR compiled but the test case has `isOSR() == false` (see line 197). The test is indeed not supposed to trigger an OSR compilation, and usually won't, but the loop is required to test tiered level transitions of a non-trivial method containing a loop. I simply changed the iterations to 1 to make sure that the backedge is never taken and thus prevent unexpected OSR compilations. The method will still be detected to have a loop and serve its purpose. > > Thanks, > Tobias Thanks for the reviews, Roberto and Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16964#issuecomment-1840608673 From duke at openjdk.org Tue Dec 5 11:55:53 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 11:55:53 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: References: Message-ID: > This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). > > Changes: > - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. > - Add a regression test. > > ### Testing > - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16967/files - new: https://git.openjdk.org/jdk/pull/16967/files/a238157a..23f80e63 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16967&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16967&range=00-01 Stats: 4 lines in 2 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16967.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16967/head:pull/16967 PR: https://git.openjdk.org/jdk/pull/16967 From duke at openjdk.org Tue Dec 5 11:55:56 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 11:55:56 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 10:31:16 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Address comments > > src/hotspot/share/opto/graphKit.cpp line 1567: > >> 1565: record_for_igvn(ld); >> 1566: if (ld->is_DecodeN()) { >> 1567: // Also record the actual load (LoadN) in case ld is DecodeN > > Maybe add an assertion here checking that `ld->in(1)` is indeed a LoadN node. Good idea, added > test/hotspot/jtreg/compiler/c2/irTests/igvn/TestLoadNIdeal.java line 48: > >> 46: >> 47: @Test >> 48: @IR(counts = {IRNode.LOAD_N, "1"}) > > Maybe add a precondition here testing that `UseCompressedOops` is enabled. Currently the test passes when running with `-XX:-UseCompressedOops` because `UseCompressedOops` is not whitelisted by the IR framework and hence the IR check is disabled, but better to be explicit I think. Thanks, now added. I also switched to `TestFramework.runWithFlags("-XX:+UseCompressedOops");` to ensure that the test runs with compressed oops enabled (if, for some reason, that is not the default at some point in the future). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1415471606 PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1415471474 From duke at openjdk.org Tue Dec 5 12:12:52 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:12:52 GMT Subject: RFR: 8295166: IGV: dump graph at more locations Message-ID: This changeset 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): - Before conditional constant propagation - After register allocation - After block ordering - After peephole optimization - After post-allocation expansion - Before and after - loop predication - loop peeling - pre/main/post loops - loop unrolling - range check elimination - loop unswitching - partial peeling - split if - superword 2. adds support for enumeration of repeated IGV graph dumps. 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. Example phase list screenshots in IGV (first at level 6, second at level 4) ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) Some notes: - While discussing the above changes, a separate question was brought up by @chhagedorn: > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). ### Testing #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 - tier1, tier2, tier3, tier4, tier5. - Check that optimized builds (`--with-debug-level optimized`) still work. #### Platforms: linux-x64 - Tested that thousands of graphs are correctly opened and visualized with IGV. ------------- Commit messages: - Add phase enumeration reset - Merge AFTER_LOOP_PREDICATION IC and RC - Change to title case for new phases - Update JFR test after new phase level - Fix Christian's comments - Restore IGV .gitignore - Fix incorrect range for PrintIdealGraphLevel - Move incorrectly placed BEFORE_SUPERWORD_SCHEDULE - Superword dump update - Adjust print levels - ... and 14 more: https://git.openjdk.org/jdk/compare/1cf7ef52...07aac1c5 Changes: https://git.openjdk.org/jdk/pull/16120/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295166 Stats: 167 lines in 15 files changed: 115 ins; 0 del; 52 mod Patch: https://git.openjdk.org/jdk/pull/16120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16120/head:pull/16120 PR: https://git.openjdk.org/jdk/pull/16120 From thartmann at openjdk.org Tue Dec 5 12:12:56 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 12:12:56 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Nice enhancement! Looks good to me overall. Some comments below. Please add an IGV screenshot of how the new phases look in the phase list. Could you summarize which of the ideas / proposals from the RFE are not covered? We should file a follow-up RFE for them. > Many of the proposed dump locations are educated guesses. Should we adjust any of them? They look good to me but let's see what others think. > Are the proposed levels (for PrintIdealGraphLevel) reasonable or should we adjust them? I put the loop optimization dumps at level 4 and adjusted IdealGraphVisualizer/README.md accordingly. I think that's reasonable. > I put most new calls to print_method/print_method_iter within a NOT_PRODUCT. Is this OK? Existing calls to `print_method` are not guarded because the method also commits a JFR event and updates `Compile::_latest_stage_start_counter`, so I think your new code should behave similar. And please make sure that you verify that the 'optimized' build (`--with-debug-level optimized`) still works. It's a level between fastdebug and release where both `#ifdef ASSERT` and `#ifdef PRODUCT` are false. src/hotspot/share/opto/compile.cpp line 625: > 623: #ifndef PRODUCT > 624: _igv_idx(0), > 625: _igv_phase_iter(), This is value initialization which guarantees proper zeroing, right? For other arrays, for example `Compile::_trap_hist`, we use explicit `Copy::zero_to_bytes` but I think your variant is fine. src/hotspot/share/opto/compile.cpp line 5115: > 5113: print_method(cpt, level, n, iter); > 5114: #else > 5115: print_method(cpt, level, n); This is dead code because all calls are guarded by `NOT_PRODUCT`, right? src/utils/IdealGraphVisualizer/.gitignore line 6: > 4: /lastModified/ > 5: /localeVariants > 6: /package-attrs.dat Is that really needed? Just wondering why these files haven't been added before. ------------- PR Review: https://git.openjdk.org/jdk/pull/16120#pullrequestreview-1673474298 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1356396506 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1356402215 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1356405379 From rcastanedalo at openjdk.org Tue Dec 5 12:12:57 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Dec 2023 12:12:57 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Hi Daniel, thanks for working on this! The code changes themselves look good, I have a few comments/suggestions: - It might make sense to create a new print level between current 3 and 4 including the new loop transformation dumps but not the individual IGVN step dumps. - I see the value in numbering the `PHASE_PHASEIDEALLOOP_ITERATIONS` dumps, but I am not sure numbering the new loop transformations is worth the additional complexity (`_igv_phase_iter` array, etc.). Limiting numbering to `PHASE_PHASEIDEALLOOP_ITERATIONS` dumps would not require any additional state in `Compile`. - In my opinion, it would be clearer to only dump loop transformations if they actually take place. I find it a bit confusing to see e.g. `Before/After superword ...` graph dumps when vectorization has actually failed and the graph has not changed at all. Dumping only after effective transformations would also match better the output of `TraceLoopOpts`. Enforcing this invariant would require getting rid of the `Before...` dumps though, but this is acceptable (and perhaps even preferable) in my opinion. ------------- PR Review: https://git.openjdk.org/jdk/pull/16120#pullrequestreview-1684732753 From chagedorn at openjdk.org Tue Dec 5 12:13:12 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Dec 2023 12:13:12 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: <5xBtgUJn2lRN0mgB9I_mmSNdYcw7OKwP7NeJDap_JkA=.2489d9b0-a3ea-4b74-be75-d70ea0530d8a@github.com> On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Thanks for working on this! This is a good addition and helps to better debug with IGV. I left a few comments with some suggestion and improvement ideas. I gave it some more thought to improve some of the places where we dump the phases and how. But this is of course open for discussion. src/hotspot/share/opto/compile.cpp line 5121: > 5119: #endif > 5120: } > 5121: I'm not so sure about having an extra method `print_method_iter()` where the user need to keep track if a method is possibly repeated or not. I therefore suggest to only keep `print_method()` with its original signature and do the increment here like this: int iter = ++_igv_phase_iter[cpt]; if (iter > 1) { ss.print(" %d", iter); } Doing it this way we only add a number for the second time a phase is dumped again. I guess that's fine. But I'm open for other opinions about that. src/hotspot/share/opto/loopPredicate.cpp line 1276: > 1274: offset, init, limit, stride, rng, overflow, reason); > 1275: > 1276: C->print_method(PHASE_AFTER_LOOP_PREDICATION_RC, 4, new_predicate_proj->in(0)); I thought about this here again. I propose to merge `PHASE_AFTER_LOOP_PREDICATION_RC` and `PHASE_AFTER_LOOP_PREDICATION_IC` to a single `PHASE_AFTER_LOOP_PREDICATION` phase and dump it after `dominated_by()` on L1289 which kills the hoisted check (replaces the bool with a constant). I think that's more intuitive. Otherwise, the old and the new `If` still share the same `BoolNode` in the dump. You can use `new_predicate_proj->in(0)` which is the same as `new_predicate_iff` for the invariant check. src/hotspot/share/opto/loopPredicate.cpp line 1390: > 1388: set_ctrl(zero, C->root()); > 1389: > 1390: NOT_PRODUCT(C->print_method_iter(PHASE_BEFORE_LOOP_PREDICATION, 4, head);) I suggest to move both the before and after phase into `PhaseIdealLoop::loop_predication_impl_helper()`, where we also do the dump with `TraceLoopOpts`. Additionally, we could also dump the node that's hoisted and the new predicate for it instead of the loop head. We could even define two separate phases for hoisting invariant checks and range checks. It could look something like this if we try to hoist `20 IfNode` with predicate `30 IfNode`: For invariant checks: - `Before Loop Predication IC - IfNode 20` - `After Loop Predication IC - IfNode 30` For range checks: - `Before Loop Predication RC - IfNode 20` - `After Loop Predication RC - IfNode 30` src/hotspot/share/opto/loopTransform.cpp line 802: > 800: loop->record_for_igvn(); > 801: > 802: C->print_method(PHASE_AFTER_LOOP_PEELING, 4, head); You can use the new head after peeling here: Suggestion: C->print_method(PHASE_AFTER_LOOP_PEELING, 4, new_head); src/hotspot/share/opto/loopTransform.cpp line 2390: > 2388: #endif > 2389: > 2390: NOT_PRODUCT(C->print_method_iter(PHASE_UNROLL_LOOP, 4, loop_head);) Here you could use the new loop head `clone_head` after unrolling src/hotspot/share/opto/loopTransform.cpp line 2872: > 2870: CountedLoopNode *cl = loop->_head->as_CountedLoop(); > 2871: > 2872: NOT_PRODUCT(C->print_method_iter(PHASE_BEFORE_RANGE_CHECK_ELIMINATION, 4, cl);) Here we could also try to dump the range check to be eliminated instead of the main loop head. There is no replacement though since we adjust the limits of the pre and main loop to eliminate this check. We could still think about dumping the main loop head in the `AFTER` phase as currently done. src/hotspot/share/opto/loopUnswitch.cpp line 205: > 203: #endif > 204: > 205: C->print_method(PHASE_AFTER_LOOP_UNSWITCHING, 4, head); Here you can use the cloned slow loop head: Suggestion: C->print_method(PHASE_AFTER_LOOP_UNSWITCHING, 4, head_clone); src/hotspot/share/opto/loopopts.cpp line 3902: > 3900: #endif > 3901: > 3902: NOT_PRODUCT(C->print_method_iter(PHASE_PARTIAL_PEEL, 4, head);) Here you can also dump the new head `new_head_clone` after partial peeling. src/hotspot/share/opto/phasetype.hpp line 31: > 29: flags(BEFORE_STRINGOPTS, "Before StringOpts") \ > 30: flags(AFTER_STRINGOPTS, "After StringOpts") \ > 31: flags(BEFORE_REMOVEUSELESS, "Before RemoveUseless") \ General comments here. I would add a `AFTER_` to match the `BEFORE_` phases for consistency where you also mention "After" in the name string. src/hotspot/share/opto/phasetype.hpp line 36: > 34: flags(ITER_GVN1, "Iter GVN 1") \ > 35: flags(AFTER_ITER_GVN_STEP, "After Iter GVN Step") \ > 36: flags(AFTER_ITER_GVN, "After Iter GVN") \ With the new `AFTER_ITER_GVN` phase that Roberto added some time ago, I think we can get rid of this one here together with Iter GVN 2. src/hotspot/share/opto/phasetype.hpp line 38: > 36: flags(AFTER_ITER_GVN, "After Iter GVN") \ > 37: flags(INCREMENTAL_INLINE_STEP, "Incremental Inline Step") \ > 38: flags(INCREMENTAL_INLINE_CLEANUP, "Incremental Inline Cleanup") \ We could use IGVN instead of Iter GVN which is more common to use. But I'm fine with both versions src/hotspot/share/opto/phasetype.hpp line 50: > 48: flags(BEFORE_BEAUTIFY_LOOPS, "Before beautify loops") \ > 49: flags(AFTER_BEAUTIFY_LOOPS, "After beautify loops") \ > 50: flags(BEFORE_UNROLL_LOOP, "Before loop unrolling") \ I suggest to use the same name as on the right side: Suggestion: flags(BEFORE_LOOP_UNROLLING, "Before loop unrolling") \ src/hotspot/share/opto/phasetype.hpp line 51: > 49: flags(AFTER_BEAUTIFY_LOOPS, "After beautify loops") \ > 50: flags(BEFORE_LOOP_UNROLLING, "Before loop unrolling") \ > 51: flags(AFTER_LOOP_UNROLLING, "After loop unrolling") \ Nit: I suggest to use upper case letters for nouns in the new phase name strings to follow the convention of the other existing phases. src/hotspot/share/opto/phasetype.hpp line 61: > 59: flags(LOOP_PEEL, "After loop peeling") \ > 60: flags(BEFORE_LOOP_UNSWITCH, "Before loop unswitching") \ > 61: flags(LOOP_UNSWITCH, "After loop unswitching") \ Same here: Suggestion: flags(PARTIAL_PEELING, "After partial peeling") \ flags(BEFORE_LOOP_PEELING, "Before loop peeling") \ flags(LOOP_PEELING, "After loop peeling") \ flags(BEFORE_LOOP_UNSWITCHING, "Before loop unswitching") \ flags(LOOP_UNSWITCHING, "After loop unswitching") \ src/hotspot/share/opto/phasetype.hpp line 64: > 62: flags(BEFORE_RANGE_CHECK_ELIMINATION, "Before range check elimination") \ > 63: flags(RANGE_CHECK_ELIMINATION, "After range check elimination") \ > 64: flags(BEFORE_PRE_POST_LOOPS, "Before pre/post loops") \ Here I suggest to use `BEFORE_PRE_MAIN_POST` to also mention the main loop which belongs to the pre and post loop. Same for the string on the right side. src/hotspot/share/opto/phasetype.hpp line 76: > 74: flags(PHASEIDEALLOOP1, "PhaseIdealLoop 1") \ > 75: flags(PHASEIDEALLOOP2, "PhaseIdealLoop 2") \ > 76: flags(PHASEIDEALLOOP3, "PhaseIdealLoop 3") \ I guess we can remove that as well since we already have `AFTER_EA` and `AFTER_ITER_GVN`. src/hotspot/share/opto/phasetype.hpp line 77: > 75: flags(PHASEIDEALLOOP2, "PhaseIdealLoop 2") \ > 76: flags(PHASEIDEALLOOP3, "PhaseIdealLoop 3") \ > 77: flags(BEFORE_CCP1, "Before PhaseCCP 1") \ I suggest to rename this to `AFTER_MACRO_NODE_ELIMINATION` and move it just before `igvn.optimize()` in the code. Otherwise, this phase is a duplication of `AFTER_ITER_GVN`. src/hotspot/share/opto/phasetype.hpp line 80: > 78: flags(CCP1, "PhaseCCP 1") \ > 79: flags(ITER_GVN2, "Iter GVN 2") \ > 80: flags(PHASEIDEALLOOP_ITERATIONS, "PhaseIdealLoop iterations") \ With the new counters, we could now specify a single `PHASE_IDEAL_LOOP` and use that one for these phases and for `PHASEIDEALLOOP_ITERATIONS`. src/hotspot/share/opto/phasetype.hpp line 82: > 80: flags(PHASEIDEALLOOP_ITERATIONS, "PhaseIdealLoop iterations") \ > 81: flags(MACRO_EXPANSION, "Macro expand") \ > 82: flags(BARRIER_EXPANSION, "Barrier expand") \ Since there is only one run of CCP, I suggest to remove "1" and also "phase". Suggestion: flags(BEFORE_CCP, "Before CCP") \ flags(CCP, "After CCP") \ src/hotspot/share/opto/superword.cpp line 2410: > 2408: #endif > 2409: > 2410: CountedLoopNode *cl = lpt()->_head->as_CountedLoop(); Asterisk should be at type: Suggestion: CountedLoopNode* cl = lpt()->_head->as_CountedLoop(); ------------- PR Review: https://git.openjdk.org/jdk/pull/16120#pullrequestreview-1710654550 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381622519 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1386246842 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1380370224 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1386106559 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381643702 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381648394 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1386122306 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381653756 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1380387096 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381606655 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381607678 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381594341 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1386263247 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381597191 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381599029 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381610367 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381612937 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381602356 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1380391478 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1386124951 From duke at openjdk.org Tue Dec 5 12:13:14 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:14 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Thanks for the review Tobias! > Please add an IGV screenshot of how the new phases look in the phase list. I've attached two sample screenshots of the phase list (from running one of the tests in `hotspot/jtreg/compiler/loopopts/PartialPeelingUnswitch.java`). Note the enumeration of the loop optimizations and the attached target loop node indices. ![Screenshot from 2023-10-13 12-59-22](https://github.com/openjdk/jdk/assets/4222397/897a8a92-d275-4b06-9d63-720745ed099c) ![Screenshot from 2023-10-13 12-59-53](https://github.com/openjdk/jdk/assets/4222397/ed0840ea-4a49-4e70-affb-cb7b5eda92ad) > Could you summarize which of the ideas / proposals from the RFE are not covered? We should file a follow-up RFE for them. I believe I have covered everything in the RFE. > Existing calls to print_method are not guarded because the method also commits a JFR event and updates Compile::_latest_stage_start_counter, so I think your new code should behave similar. Ah, I see. Note, however, that the existing call to `print_method` for `PHASE_AFTER_ITER_GVN_STEP` (at level 4) is guarded. Also, the bytecode printing (level 5) is guarded (although it does not use `Compile::print_method`). > And please make sure that you verify that the 'optimized' build (--with-debug-level optimized) still works. It's a level between fastdebug and release where both #ifdef ASSERT and #ifdef PRODUCT are false. I'll make sure to include this when testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16120#issuecomment-1761349652 From duke at openjdk.org Tue Dec 5 12:13:16 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:16 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Thu, 12 Oct 2023 07:36:38 GMT, Tobias Hartmann wrote: >> This changeset >> 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): >> - Before conditional constant propagation >> - After register allocation >> - After block ordering >> - After peephole optimization >> - After post-allocation expansion >> - Before and after >> - loop predication >> - loop peeling >> - pre/main/post loops >> - loop unrolling >> - range check elimination >> - loop unswitching >> - partial peeling >> - split if >> - superword >> 2. adds support for enumeration of repeated IGV graph dumps. >> 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. >> >> Example phase list screenshots in IGV (first at level 6, second at level 4) >> ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) >> >> >> Some notes: >> - While discussing the above changes, a separate question was brought up by @chhagedorn: >> > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? >> - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). >> >> ### Testing >> #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5. >> - Check that optimized builds (`--with-debug-level optimized`) still work. >> >> #### Platforms: linux-x64 >> - Tested that thousands of graphs are correctly opened and visualized with IGV. > > src/hotspot/share/opto/compile.cpp line 625: > >> 623: #ifndef PRODUCT >> 624: _igv_idx(0), >> 625: _igv_phase_iter(), > > This is value initialization which guarantees proper zeroing, right? For other arrays, for example `Compile::_trap_hist`, we use explicit `Copy::zero_to_bytes` but I think your variant is fine. Yes, correct. I'll switch to the `Copy::zero_to_bytes` initialization, better to be consistent. > src/hotspot/share/opto/compile.cpp line 5115: > >> 5113: print_method(cpt, level, n, iter); >> 5114: #else >> 5115: print_method(cpt, level, n); > > This is dead code because all calls are guarded by `NOT_PRODUCT`, right? Not quite, there is one unguarded call for `PHASE_PHASEIDEALLOOP_ITERATIONS` at level 2. This is an existing phase that I converted from `print_method` to `print_method_iter` (as suggested by @chhagedorn in the issue corresponding to this PR). > src/utils/IdealGraphVisualizer/.gitignore line 6: > >> 4: /lastModified/ >> 5: /localeVariants >> 6: /package-attrs.dat > > Is that really needed? Just wondering why these files haven't been added before. Not sure, IGV generates them at first run on my system (after a `git clean` and clean IGV build). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1358136315 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1358135091 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1358131887 From chagedorn at openjdk.org Tue Dec 5 12:13:18 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Dec 2023 12:13:18 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Fri, 13 Oct 2023 11:26:23 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/compile.cpp line 5115: >> >>> 5113: print_method(cpt, level, n, iter); >>> 5114: #else >>> 5115: print_method(cpt, level, n); >> >> This is dead code because all calls are guarded by `NOT_PRODUCT`, right? > > Not quite, there is one unguarded call for `PHASE_PHASEIDEALLOOP_ITERATIONS` at level 2. This is an existing phase that I converted from `print_method` to `print_method_iter` (as suggested by @chhagedorn in the issue corresponding to this PR). The current rule seems to be that we always want to emit a JFR event when dumping a graph - regardless of whether we dump a graph or not. I think we should follow this convention and remove the `NOT_PRODUCT` from the calls to `print_method_iter()`. On a separate note, I'm wondering how useful it is to always dump all events when calling `print_method()`. Should this be revisited again in general? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1380459903 From duke at openjdk.org Tue Dec 5 12:13:18 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:18 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Thu, 2 Nov 2023 16:46:36 GMT, Christian Hagedorn wrote: >> Not quite, there is one unguarded call for `PHASE_PHASEIDEALLOOP_ITERATIONS` at level 2. This is an existing phase that I converted from `print_method` to `print_method_iter` (as suggested by @chhagedorn in the issue corresponding to this PR). > > The current rule seems to be that we always want to emit a JFR event when dumping a graph - regardless of whether we dump a graph or not. I think we should follow this convention and remove the `NOT_PRODUCT` from the calls to `print_method_iter()`. > > On a separate note, I'm wondering how useful it is to always dump all events when calling `print_method()`. Should this be revisited again in general? I have now incorporated the functionality of `print_method_iter` directly into `print_method`, and also removed all the `NOT_PRODUCT` wrappers. I'm resolving this thread now, should we move the discussion regarding JFR event dumping somewhere else? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383053251 From chagedorn at openjdk.org Tue Dec 5 12:13:18 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Dec 2023 12:13:18 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Mon, 6 Nov 2023 10:06:52 GMT, Daniel Lund?n wrote: >> The current rule seems to be that we always want to emit a JFR event when dumping a graph - regardless of whether we dump a graph or not. I think we should follow this convention and remove the `NOT_PRODUCT` from the calls to `print_method_iter()`. >> >> On a separate note, I'm wondering how useful it is to always dump all events when calling `print_method()`. Should this be revisited again in general? > > I have now incorporated the functionality of `print_method_iter` directly into `print_method`, and also removed all the `NOT_PRODUCT` wrappers. > > I'm resolving this thread now, should we move the discussion regarding JFR event dumping somewhere else? Maybe we can mention it as a side node in the PR description that it is something we should think about at some point. But it should not block this PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383078760 From duke at openjdk.org Tue Dec 5 12:13:18 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:18 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Mon, 6 Nov 2023 10:27:23 GMT, Christian Hagedorn wrote: >> I have now incorporated the functionality of `print_method_iter` directly into `print_method`, and also removed all the `NOT_PRODUCT` wrappers. >> >> I'm resolving this thread now, should we move the discussion regarding JFR event dumping somewhere else? > > Maybe we can mention it as a side node in the PR description that it is something we should think about at some point. But it should not block this PR. Sure, I'll add it to the final (non-draft) PR description. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383082935 From duke at openjdk.org Tue Dec 5 12:13:19 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:19 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: <5xBtgUJn2lRN0mgB9I_mmSNdYcw7OKwP7NeJDap_JkA=.2489d9b0-a3ea-4b74-be75-d70ea0530d8a@github.com> References: <5xBtgUJn2lRN0mgB9I_mmSNdYcw7OKwP7NeJDap_JkA=.2489d9b0-a3ea-4b74-be75-d70ea0530d8a@github.com> Message-ID: <_TXgVBDkti-UmPSv9w7nVzxy7adWTVP_z6mMVCHK0yI=.dd1af554-3577-401c-b5fc-6b2467727cf1@github.com> On Fri, 3 Nov 2023 12:39:57 GMT, Christian Hagedorn wrote: >> This changeset >> 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): >> - Before conditional constant propagation >> - After register allocation >> - After block ordering >> - After peephole optimization >> - After post-allocation expansion >> - Before and after >> - loop predication >> - loop peeling >> - pre/main/post loops >> - loop unrolling >> - range check elimination >> - loop unswitching >> - partial peeling >> - split if >> - superword >> 2. adds support for enumeration of repeated IGV graph dumps. >> 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. >> >> Example phase list screenshots in IGV (first at level 6, second at level 4) >> ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) >> >> >> Some notes: >> - While discussing the above changes, a separate question was brought up by @chhagedorn: >> > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? >> - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). >> >> ### Testing >> #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5. >> - Check that optimized builds (`--with-debug-level optimized`) still work. >> >> #### Platforms: linux-x64 >> - Tested that thousands of graphs are correctly opened and visualized with IGV. > > src/hotspot/share/opto/compile.cpp line 5121: > >> 5119: #endif >> 5120: } >> 5121: > > I'm not so sure about having an extra method `print_method_iter()` where the user need to keep track if a method is possibly repeated or not. I therefore suggest to only keep `print_method()` with its original signature and do the increment here like this: > > > int iter = ++_igv_phase_iter[cpt]; > if (iter > 1) { > ss.print(" %d", iter); > } > > Doing it this way we only add a number for the second time a phase is dumped again. I guess that's fine. But I'm open for other opinions about that. I agree with just having a `print_method`. I initially hesitated to modify `print_method` in the way you suggest, as it would then add iteration numbering to _all_ phases with no exceptions. But, it seems this is a feature we want then? > src/hotspot/share/opto/loopPredicate.cpp line 1276: > >> 1274: offset, init, limit, stride, rng, overflow, reason); >> 1275: >> 1276: C->print_method(PHASE_AFTER_LOOP_PREDICATION_RC, 4, new_predicate_proj->in(0)); > > I thought about this here again. I propose to merge `PHASE_AFTER_LOOP_PREDICATION_RC` and `PHASE_AFTER_LOOP_PREDICATION_IC` to a single `PHASE_AFTER_LOOP_PREDICATION` phase and dump it after `dominated_by()` on L1289 which kills the hoisted check (replaces the bool with a constant). I think that's more intuitive. Otherwise, the old and the new `If` still share the same `BoolNode` in the dump. You can use `new_predicate_proj->in(0)` which is the same as `new_predicate_iff` for the invariant check. Good, now fixed. Just to be clear, I merged the two `AFTER` phases but did not touch the `BEFORE` phases. > src/hotspot/share/opto/loopPredicate.cpp line 1390: > >> 1388: set_ctrl(zero, C->root()); >> 1389: >> 1390: NOT_PRODUCT(C->print_method_iter(PHASE_BEFORE_LOOP_PREDICATION, 4, head);) > > I suggest to move both the before and after phase into `PhaseIdealLoop::loop_predication_impl_helper()`, where we also do the dump with `TraceLoopOpts`. Additionally, we could also dump the node that's hoisted and the new predicate for it instead of the loop head. We could even define two separate phases for hoisting invariant checks and range checks. > > It could look something like this if we try to hoist `20 IfNode` with predicate `30 IfNode`: > > For invariant checks: > - `Before Loop Predication IC - IfNode 20` > - `After Loop Predication IC - IfNode 30` > > For range checks: > - `Before Loop Predication RC - IfNode 20` > - `After Loop Predication RC - IfNode 30` Updated now > src/hotspot/share/opto/loopTransform.cpp line 2390: > >> 2388: #endif >> 2389: >> 2390: NOT_PRODUCT(C->print_method_iter(PHASE_UNROLL_LOOP, 4, loop_head);) > > Here you could use the new loop head `clone_head` after unrolling Updated > src/hotspot/share/opto/loopTransform.cpp line 2872: > >> 2870: CountedLoopNode *cl = loop->_head->as_CountedLoop(); >> 2871: >> 2872: NOT_PRODUCT(C->print_method_iter(PHASE_BEFORE_RANGE_CHECK_ELIMINATION, 4, cl);) > > Here we could also try to dump the range check to be eliminated instead of the main loop head. There is no replacement though since we adjust the limits of the pre and main loop to eliminate this check. We could still think about dumping the main loop head in the `AFTER` phase as currently done. Updated now > src/hotspot/share/opto/loopopts.cpp line 3902: > >> 3900: #endif >> 3901: >> 3902: NOT_PRODUCT(C->print_method_iter(PHASE_PARTIAL_PEEL, 4, head);) > > Here you can also dump the new head `new_head_clone` after partial peeling. Thanks, updated. > src/hotspot/share/opto/phasetype.hpp line 31: > >> 29: flags(BEFORE_STRINGOPTS, "Before StringOpts") \ >> 30: flags(AFTER_STRINGOPTS, "After StringOpts") \ >> 31: flags(BEFORE_REMOVEUSELESS, "Before RemoveUseless") \ > > General comments here. I would add a `AFTER_` to match the `BEFORE_` phases for consistency where you also mention "After" in the name string. I believe I've kept it consistent for my own additions, but the older phase names are sometimes inconsistent in this regard (including `STRINGOPTS`). Should I rename other phases to improve consistency? This changeset will touch many more files then, but perhaps that's OK. > src/hotspot/share/opto/phasetype.hpp line 36: > >> 34: flags(ITER_GVN1, "Iter GVN 1") \ >> 35: flags(AFTER_ITER_GVN_STEP, "After Iter GVN Step") \ >> 36: flags(AFTER_ITER_GVN, "After Iter GVN") \ > > With the new `AFTER_ITER_GVN` phase that Roberto added some time ago, I think we can get rid of this one here together with Iter GVN 2. Will be addressed in a separate RFE. > src/hotspot/share/opto/phasetype.hpp line 38: > >> 36: flags(AFTER_ITER_GVN, "After Iter GVN") \ >> 37: flags(INCREMENTAL_INLINE_STEP, "Incremental Inline Step") \ >> 38: flags(INCREMENTAL_INLINE_CLEANUP, "Incremental Inline Cleanup") \ > > We could use IGVN instead of Iter GVN which is more common to use. But I'm fine with both versions Will be addressed in a separate RFE. > src/hotspot/share/opto/phasetype.hpp line 51: > >> 49: flags(AFTER_BEAUTIFY_LOOPS, "After beautify loops") \ >> 50: flags(BEFORE_LOOP_UNROLLING, "Before loop unrolling") \ >> 51: flags(AFTER_LOOP_UNROLLING, "After loop unrolling") \ > > Nit: I suggest to use upper case letters for nouns in the new phase name strings to follow the convention of the other existing phases. I switched the new phases to title case. The existing phases are not quite consistent either, so I suggest that we change all phase descriptions to title case (I've added an item for this to the IGV cleanup RFE) > src/hotspot/share/opto/phasetype.hpp line 76: > >> 74: flags(PHASEIDEALLOOP1, "PhaseIdealLoop 1") \ >> 75: flags(PHASEIDEALLOOP2, "PhaseIdealLoop 2") \ >> 76: flags(PHASEIDEALLOOP3, "PhaseIdealLoop 3") \ > > I guess we can remove that as well since we already have `AFTER_EA` and `AFTER_ITER_GVN`. Will be addressed in a separate RFE. > src/hotspot/share/opto/phasetype.hpp line 77: > >> 75: flags(PHASEIDEALLOOP2, "PhaseIdealLoop 2") \ >> 76: flags(PHASEIDEALLOOP3, "PhaseIdealLoop 3") \ >> 77: flags(BEFORE_CCP1, "Before PhaseCCP 1") \ > > I suggest to rename this to `AFTER_MACRO_NODE_ELIMINATION` and move it just before `igvn.optimize()` in the code. Otherwise, this phase is a duplication of `AFTER_ITER_GVN`. Will be addressed in a separate RFE. > src/hotspot/share/opto/phasetype.hpp line 80: > >> 78: flags(CCP1, "PhaseCCP 1") \ >> 79: flags(ITER_GVN2, "Iter GVN 2") \ >> 80: flags(PHASEIDEALLOOP_ITERATIONS, "PhaseIdealLoop iterations") \ > > With the new counters, we could now specify a single `PHASE_IDEAL_LOOP` and use that one for these phases and for `PHASEIDEALLOOP_ITERATIONS`. Will be addressed in a separate RFE. > src/hotspot/share/opto/phasetype.hpp line 82: > >> 80: flags(PHASEIDEALLOOP_ITERATIONS, "PhaseIdealLoop iterations") \ >> 81: flags(MACRO_EXPANSION, "Macro expand") \ >> 82: flags(BARRIER_EXPANSION, "Barrier expand") \ > > Since there is only one run of CCP, I suggest to remove "1" and also "phase". > > Suggestion: > > flags(BEFORE_CCP, "Before CCP") \ > flags(CCP, "After CCP") \ Will be addressed in a separate RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381923946 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1388033903 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383280092 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383309989 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383432144 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383308798 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381903351 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383096301 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383096468 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1388028782 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383097079 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383097507 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383096000 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383063548 From chagedorn at openjdk.org Tue Dec 5 12:13:20 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 5 Dec 2023 12:13:20 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: <_TXgVBDkti-UmPSv9w7nVzxy7adWTVP_z6mMVCHK0yI=.dd1af554-3577-401c-b5fc-6b2467727cf1@github.com> References: <5xBtgUJn2lRN0mgB9I_mmSNdYcw7OKwP7NeJDap_JkA=.2489d9b0-a3ea-4b74-be75-d70ea0530d8a@github.com> <_TXgVBDkti-UmPSv9w7nVzxy7adWTVP_z6mMVCHK0yI=.dd1af554-3577-401c-b5fc-6b2467727cf1@github.com> Message-ID: On Fri, 3 Nov 2023 16:02:38 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/compile.cpp line 5121: >> >>> 5119: #endif >>> 5120: } >>> 5121: >> >> I'm not so sure about having an extra method `print_method_iter()` where the user need to keep track if a method is possibly repeated or not. I therefore suggest to only keep `print_method()` with its original signature and do the increment here like this: >> >> >> int iter = ++_igv_phase_iter[cpt]; >> if (iter > 1) { >> ss.print(" %d", iter); >> } >> >> Doing it this way we only add a number for the second time a phase is dumped again. I guess that's fine. But I'm open for other opinions about that. > > I agree with just having a `print_method`. I initially hesitated to modify `print_method` in the way you suggest, as it would then add iteration numbering to _all_ phases with no exceptions. But, it seems this is a feature we want then? I think it could be useful for any repeated phase, if others also agree with that. And we would still keep the same name for phases that are only printed once and only add a number for the repeated dumps. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383082103 From duke at openjdk.org Tue Dec 5 12:13:20 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:20 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: <5xBtgUJn2lRN0mgB9I_mmSNdYcw7OKwP7NeJDap_JkA=.2489d9b0-a3ea-4b74-be75-d70ea0530d8a@github.com> <_TXgVBDkti-UmPSv9w7nVzxy7adWTVP_z6mMVCHK0yI=.dd1af554-3577-401c-b5fc-6b2467727cf1@github.com> Message-ID: On Mon, 6 Nov 2023 10:30:03 GMT, Christian Hagedorn wrote: >> I agree with just having a `print_method`. I initially hesitated to modify `print_method` in the way you suggest, as it would then add iteration numbering to _all_ phases with no exceptions. But, it seems this is a feature we want then? > > I think it could be useful for any repeated phase, if others also agree with that. And we would still keep the same name for phases that are only printed once and only add a number for the repeated dumps. I've changed it for now. If anyone does not agree, please let us know. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383086835 From duke at openjdk.org Tue Dec 5 12:13:21 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:21 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: <_TXgVBDkti-UmPSv9w7nVzxy7adWTVP_z6mMVCHK0yI=.dd1af554-3577-401c-b5fc-6b2467727cf1@github.com> References: <5xBtgUJn2lRN0mgB9I_mmSNdYcw7OKwP7NeJDap_JkA=.2489d9b0-a3ea-4b74-be75-d70ea0530d8a@github.com> <_TXgVBDkti-UmPSv9w7nVzxy7adWTVP_z6mMVCHK0yI=.dd1af554-3577-401c-b5fc-6b2467727cf1@github.com> Message-ID: <-NjedxE3lofoM1UtdPUVa0cJkN8sGaa3d1XAmfaH8HQ=.d9da589f-c55a-485a-bbd1-2fa06b32c312@github.com> On Fri, 3 Nov 2023 15:44:52 GMT, Daniel Lund?n wrote: >> src/hotspot/share/opto/phasetype.hpp line 31: >> >>> 29: flags(BEFORE_STRINGOPTS, "Before StringOpts") \ >>> 30: flags(AFTER_STRINGOPTS, "After StringOpts") \ >>> 31: flags(BEFORE_REMOVEUSELESS, "Before RemoveUseless") \ >> >> General comments here. I would add a `AFTER_` to match the `BEFORE_` phases for consistency where you also mention "After" in the name string. > > I believe I've kept it consistent for my own additions, but the older phase names are sometimes inconsistent in this regard (including `STRINGOPTS`). Should I rename other phases to improve consistency? This changeset will touch many more files then, but perhaps that's OK. My mistake, it was actually called `BEFORE_STRINGOPTS` and `AFTER_STRINGOPTS`. But, the old phases `MATCHING` and `MACH_ANALYSIS` do contain "After" in the string but not in the name. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1381912649 From duke at openjdk.org Tue Dec 5 12:13:21 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:21 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: <-NjedxE3lofoM1UtdPUVa0cJkN8sGaa3d1XAmfaH8HQ=.d9da589f-c55a-485a-bbd1-2fa06b32c312@github.com> References: <5xBtgUJn2lRN0mgB9I_mmSNdYcw7OKwP7NeJDap_JkA=.2489d9b0-a3ea-4b74-be75-d70ea0530d8a@github.com> <_TXgVBDkti-UmPSv9w7nVzxy7adWTVP_z6mMVCHK0yI=.dd1af554-3577-401c-b5fc-6b2467727cf1@github.com> <-NjedxE3lofoM1UtdPUVa0cJkN8sGaa3d1XAmfaH8HQ=.d9da589f-c55a-485a-bbd1-2fa06b32c312@github.com> Message-ID: On Fri, 3 Nov 2023 15:52:59 GMT, Daniel Lund?n wrote: >> I believe I've kept it consistent for my own additions, but the older phase names are sometimes inconsistent in this regard (including `STRINGOPTS`). Should I rename other phases to improve consistency? This changeset will touch many more files then, but perhaps that's OK. > > My mistake, it was actually called `BEFORE_STRINGOPTS` and `AFTER_STRINGOPTS`. But, the old phases `MATCHING` and `MACH_ANALYSIS` do contain "After" in the string but not in the name. I'll resolve this by using `BEFORE_` and `AFTER_` for all new phases that I'm adding, but leaving existing phase names intact. We can address inconsistencies in another RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383062476 From duke at openjdk.org Tue Dec 5 12:13:22 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:13:22 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Fri, 13 Oct 2023 11:22:45 GMT, Daniel Lund?n wrote: >> src/utils/IdealGraphVisualizer/.gitignore line 6: >> >>> 4: /lastModified/ >>> 5: /localeVariants >>> 6: /package-attrs.dat >> >> Is that really needed? Just wondering why these files haven't been added before. > > Not sure, IGV generates them at first run on my system (after a `git clean` and clean IGV build). I'll remove this from the PR and make a separate JBS issue. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1383059846 From rcastanedalo at openjdk.org Tue Dec 5 12:26:36 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 5 Dec 2023 12:26:36 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: References: Message-ID: <2p_CCLhe_e5e4sN0i27J8STqXn8sDHXR80dN8zj9H2M=.4dd8ba76-84b6-4ebb-b7c7-dee519fb6741@github.com> On Tue, 5 Dec 2023 11:55:53 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). >> >> Changes: >> - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. >> - Add a regression test. >> >> ### Testing >> - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Address comments Looks good, please re-run testing before integration to make sure the newly added assertion does not fail. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16967#pullrequestreview-1764950009 From duke at openjdk.org Tue Dec 5 12:26:37 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 5 Dec 2023 12:26:37 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: <2p_CCLhe_e5e4sN0i27J8STqXn8sDHXR80dN8zj9H2M=.4dd8ba76-84b6-4ebb-b7c7-dee519fb6741@github.com> References: <2p_CCLhe_e5e4sN0i27J8STqXn8sDHXR80dN8zj9H2M=.4dd8ba76-84b6-4ebb-b7c7-dee519fb6741@github.com> Message-ID: On Tue, 5 Dec 2023 12:23:15 GMT, Roberto Casta?eda Lozano wrote: > Looks good, please re-run testing before integration to make sure the newly added assertion does not fail. Of course, thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16967#issuecomment-1840692992 From mli at openjdk.org Tue Dec 5 14:02:48 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 5 Dec 2023 14:02:48 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v2] In-Reply-To: References: Message-ID: <03lGd1Rn0dXyqeUvrzDpIpnwOaf-FrTlv81bY1lZwhY=.18ddd848-def4-4cef-af66-7d9697415144@github.com> > Hi, > Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? > Thanks > > ## Test > test/hotspot/jtreg/compiler/intrinsics/ > test/hotspot/jtreg/compiler/vectorapi/ > and tests found via: > grep -nr test/hotspot/jtreg/ -we Math.signum > and test found via: > grep -nr test/jdk/ -we Math.signum Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: add v0 to effect ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16925/files - new: https://git.openjdk.org/jdk/pull/16925/files/3276c71e..4b89ab0f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16925.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16925/head:pull/16925 PR: https://git.openjdk.org/jdk/pull/16925 From mli at openjdk.org Tue Dec 5 14:02:51 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 5 Dec 2023 14:02:51 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v2] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 06:51:25 GMT, Fei Yang wrote: > Hi, Did you check the C2 JIT code? I am wondering whether the newly-added code is covered well by the tests performed. I did not check the JIT code, but the tests did help to catch some bugs when I implemented the intrinsic. > src/hotspot/cpu/riscv/assembler_riscv.hpp line 1592: > >> 1590: INSN(vfsgnj_vf, 0b1010111, 0b101, 0b001000); >> 1591: INSN(vfsgnjx_vf, 0b1010111, 0b101, 0b001010); >> 1592: INSN(vfsgnjn_vf, 0b1010111, 0b101, 0b001001); > > Not used anywhere? I can remove them if you think it's better to do it. Reason I added them is that they're quite similar instructions and it's annoying to lookup the instruction formatting in spec and add every single one when needing them. > src/hotspot/cpu/riscv/riscv_v.ad line 3670: > >> 3668: match(Set dst (SignumVF dst (Binary zero one))); >> 3669: match(Set dst (SignumVD dst (Binary zero one))); >> 3670: effect(TEMP_DEF dst); > > v0 is clobbered in `C2_MacroAssembler::signum_fp_v`. Shouldn't we add a `TEMP v0` to the effect? Thanks for catching! Fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16925#issuecomment-1840847990 PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1415657006 PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1415657364 From mli at openjdk.org Tue Dec 5 14:02:52 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 5 Dec 2023 14:02:52 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v2] In-Reply-To: <_Jo2yWlqtqmsrZxAJonVjXB1xPRlbPZwIxjNHaVAoTg=.eeea6a5b-8941-4563-9939-b30bc11916b2@github.com> References: <_Jo2yWlqtqmsrZxAJonVjXB1xPRlbPZwIxjNHaVAoTg=.eeea6a5b-8941-4563-9939-b30bc11916b2@github.com> Message-ID: On Mon, 4 Dec 2023 06:52:23 GMT, Vladimir Kempik wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> add v0 to effect > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1681: > >> 1679: void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, >> 1680: VectorRegister zero, VectorRegister one) { >> 1681: vsetvli_helper(bt, vlen); > > Can we have a situation where vlen times sew(bt) won't fit into h/w register ? No, as UseRVV is only enable when vlenb >= 16, and `match_rule_supported_vector` return false if UseRVV == false. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1415657105 From jbhateja at openjdk.org Tue Dec 5 15:01:37 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Dec 2023 15:01:37 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v4] In-Reply-To: References: Message-ID: On Sat, 2 Dec 2023 07:53:13 GMT, Jatin Bhateja wrote: >> Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. >> >> A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. >> >> Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. >> >> Following are the performance numbers of and existing JMH micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Correting BIG_ENDIAN_ONLY check Hi @PaulSandoz , Your comments have been addressed. Please let me know if its good to land this in. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16888#issuecomment-1840963951 From mli at openjdk.org Tue Dec 5 15:03:47 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 5 Dec 2023 15:03:47 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v3] In-Reply-To: References: Message-ID: <1HWA8nW4l8CmCtLplPnKDuDrlJg-jOhOkjk1OLFINhQ=.52b418a9-2c73-4996-ab1d-d48a58e2cfb5@github.com> > Hi, > Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? > Thanks > > ## Test > test/hotspot/jtreg/compiler/intrinsics/ > test/hotspot/jtreg/compiler/vectorapi/ > and tests found via: > grep -nr test/hotspot/jtreg/ -we Math.signum > and test found via: > grep -nr test/jdk/ -we Math.signum Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: enable TestSignumVector.java on riscv ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16925/files - new: https://git.openjdk.org/jdk/pull/16925/files/4b89ab0f..3dddd029 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=01-02 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16925.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16925/head:pull/16925 PR: https://git.openjdk.org/jdk/pull/16925 From thartmann at openjdk.org Tue Dec 5 16:01:54 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 16:01:54 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. That looks good to me! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16120#pullrequestreview-1765473696 From thartmann at openjdk.org Tue Dec 5 16:06:40 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 16:06:40 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: References: Message-ID: <95x_5ClhJG1tjcMpXO2879BUk3B8WR7OFOFEedX_Osk=.d7499d64-1dcc-471d-9a42-2f8697680694@github.com> On Tue, 5 Dec 2023 11:55:53 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). >> >> Changes: >> - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. >> - Add a regression test. >> >> ### Testing >> - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Address comments test/hotspot/jtreg/compiler/c2/irTests/igvn/TestLoadNIdeal.java line 54: > 52: p[0] = new A(); > 53: > 54: // Dummy is not compiled and hence not inlined => Escape analysis Is there a reason you are not using [DontInline](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/DontInline.java) to prevent inlining of `dummy`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1415872687 From thartmann at openjdk.org Tue Dec 5 16:09:39 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 16:09:39 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 11:55:53 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). >> >> Changes: >> - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. >> - Add a regression test. >> >> ### Testing >> - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Address comments Looks good to me! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16967#pullrequestreview-1765498814 From thartmann at openjdk.org Tue Dec 5 16:15:35 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 16:15:35 GMT Subject: RFR: 8321215: Incorrect x86 instruction encoding for VSIB addressing mode In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 19:09:33 GMT, Sandhya Viswanathan wrote: > For instructions that use VSIB addressing mode (gather/scatter), the assembler incorrectly sets EVEX.X bit when the VSIB vector register is in the range XMM16 - XMM23. The EVEX.X bit should only be set when bit 3 of the register encoding is 1, i.e. if the register encoding is 8 - 15 or 24 - 31. Looks reasonable. I assume it's not feasible to come up with a regression test, right? I added the 'noreg-hard' label to the bug. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16957#pullrequestreview-1765517349 From thartmann at openjdk.org Tue Dec 5 16:30:42 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 5 Dec 2023 16:30:42 GMT Subject: Integrated: 8318468: compiler/tiered/LevelTransitionTest.java fails with -XX:CompileThreshold=100 -XX:TieredStopAtLevel=1 In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 08:09:34 GMT, Tobias Hartmann wrote: > The test fails with `-XX:CompileThreshold=100 -XX:TieredStopAtLevel=1` because `CompileMethodHolder::nonTrivialMethod` is unexpectedly OSR compiled but the test case has `isOSR() == false` (see line 197). The test is indeed not supposed to trigger an OSR compilation, and usually won't, but the loop is required to test tiered level transitions of a non-trivial method containing a loop. I simply changed the iterations to 1 to make sure that the backedge is never taken and thus prevent unexpected OSR compilations. The method will still be detected to have a loop and serve its purpose. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 61d0db38 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/61d0db3838932d4030b05ffb04ee2b0215ea686e Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8318468: compiler/tiered/LevelTransitionTest.java fails with -XX:CompileThreshold=100 -XX:TieredStopAtLevel=1 Reviewed-by: rcastanedalo, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/16964 From sviswanathan at openjdk.org Tue Dec 5 16:42:53 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 5 Dec 2023 16:42:53 GMT Subject: RFR: 8321215: Incorrect x86 instruction encoding for VSIB addressing mode In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 16:13:03 GMT, Tobias Hartmann wrote: >> For instructions that use VSIB addressing mode (gather/scatter), the assembler incorrectly sets EVEX.X bit when the VSIB vector register is in the range XMM16 - XMM23. The EVEX.X bit should only be set when bit 3 of the register encoding is 1, i.e. if the register encoding is 8 - 15 or 24 - 31. > > Looks reasonable. I assume it's not feasible to come up with a regression test, right? I added the 'noreg-hard' label to the bug. Thanks a lot @TobiHartmann for the review. Yes, I couldn't come up with a regression test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16957#issuecomment-1841159767 From sviswanathan at openjdk.org Tue Dec 5 16:42:54 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 5 Dec 2023 16:42:54 GMT Subject: Integrated: 8321215: Incorrect x86 instruction encoding for VSIB addressing mode In-Reply-To: References: Message-ID: <9QETWHvisXYhx-AhARAZxPma1p090KqsY0WtyPdWbb0=.1a25e0ff-f970-47b5-a24a-715502f3c623@github.com> On Mon, 4 Dec 2023 19:09:33 GMT, Sandhya Viswanathan wrote: > For instructions that use VSIB addressing mode (gather/scatter), the assembler incorrectly sets EVEX.X bit when the VSIB vector register is in the range XMM16 - XMM23. The EVEX.X bit should only be set when bit 3 of the register encoding is 1, i.e. if the register encoding is 8 - 15 or 24 - 31. This pull request has now been integrated. Changeset: 027b5dbb Author: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/027b5dbb6a299e49d3dcbe67d529d6edc67f16d9 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8321215: Incorrect x86 instruction encoding for VSIB addressing mode Reviewed-by: shade, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/16957 From psandoz at openjdk.org Tue Dec 5 17:06:37 2023 From: psandoz at openjdk.org (Paul Sandoz) Date: Tue, 5 Dec 2023 17:06:37 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v4] In-Reply-To: References: Message-ID: On Sat, 2 Dec 2023 07:53:13 GMT, Jatin Bhateja wrote: >> Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. >> >> A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. >> >> Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. >> >> Following are the performance numbers of and existing JMH micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Correting BIG_ENDIAN_ONLY check Java changes look good. Internal tests were initiated by @TobiHartmann, they look ok but please confirm. ------------- Marked as reviewed by psandoz (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16888#pullrequestreview-1765671576 From kvn at openjdk.org Tue Dec 5 17:07:34 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Dec 2023 17:07:34 GMT Subject: RFR: 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 05:36:38 GMT, Tom Rodriguez wrote: > Checking for leaf Klasses requires seeing if the subklass field is null. As part of the fix for JVMCI support for ZGC, JDK-8299229, it was changed to call into the runtime which had the side effect of creating a strong reference to an the class. Since it's only checking for non-null it's ok to just perform thread directly as was done prior to JDK-8299229. This avoids causing class unloading problems. Looks good to me. `java/util/stream/GathererTest.java` failure in GHA is known issue. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16943#pullrequestreview-1765669321 PR Comment: https://git.openjdk.org/jdk/pull/16943#issuecomment-1841241348 From duke at openjdk.org Tue Dec 5 17:28:39 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 5 Dec 2023 17:28:39 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> Message-ID: On Tue, 5 Dec 2023 11:19:00 GMT, Magnus Ihse Bursie wrote: >> Hi Marcus (@magicus), please see the updated code which added guards to check for GCC version >= 7.5 in `src/java.base/linux/native/libsimdsort/{avx2-linux-qsort.cpp, avx512-linux-qsort.cpp}`. GCC >= 7.5 is needed to compile libsimdsort using C++17 features. Made sure that OpenJDK builds without errors using both GCC 7.5 and GCC 6.4. > > That sounds weird. You can't check for if compiler options should be enabled or not inside source code files. > > Are you saying that when compiling with GCC 6, it will just silently ignore `-std=c++17`? I'd have assumed that it printed a warning or error about an unknown or invalid option, if C++17 is not supported. Hi Magnus (@magicus), > Are you saying that when compiling with GCC 6, it will just silently ignore `-std=c++17`? I'd have assumed that it printed a warning or error about an unknown or invalid option, if C++17 is not supported. The GCC complier for versions 6 (and even 5) silently ignores the flag `-std=c++17`. It does not print any warning or error. I tested it with a toy C++ program and also by building OpenJDK using GCC 6. > You can't check for if compiler options should be enabled or not inside source code files. what I meant was, there are #ifdef guards using predefined macros in the C++ source code to check for GCC version and make the simdsort code available for compilation or not based on the GCC version // src/java.base/linux/native/libsimdsort/simdsort-support.hpp #if defined(_LP64) && (defined(__GNUC__) && ((__GNUC__ > 7) || ((__GNUC__ == 7) && (__GNUC_MINOR__ >= 5)))) #define __SIMDSORT_SUPPORTED_LINUX #endif //src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp #include "simdsort-support.hpp" #ifdef __SIMDSORT_SUPPORTED_LINUX #endif ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416037340 From never at openjdk.org Tue Dec 5 18:15:47 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 5 Dec 2023 18:15:47 GMT Subject: RFR: 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 05:36:38 GMT, Tom Rodriguez wrote: > Checking for leaf Klasses requires seeing if the subklass field is null. As part of the fix for JVMCI support for ZGC, JDK-8299229, it was changed to call into the runtime which had the side effect of creating a strong reference to an the class. Since it's only checking for non-null it's ok to just perform thread directly as was done prior to JDK-8299229. This avoids causing class unloading problems. Yes the possible negative answer just seems like a race that could end up happening either way depending when you compile. Thanks for the reviews. Testing was clean. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16943#issuecomment-1841347105 From never at openjdk.org Tue Dec 5 18:15:48 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 5 Dec 2023 18:15:48 GMT Subject: Integrated: 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 05:36:38 GMT, Tom Rodriguez wrote: > Checking for leaf Klasses requires seeing if the subklass field is null. As part of the fix for JVMCI support for ZGC, JDK-8299229, it was changed to call into the runtime which had the side effect of creating a strong reference to an the class. Since it's only checking for non-null it's ok to just perform thread directly as was done prior to JDK-8299229. This avoids causing class unloading problems. This pull request has now been integrated. Changeset: fddc02e0 Author: Tom Rodriguez URL: https://git.openjdk.org/jdk/commit/fddc02e046e926af75661ce167d4531393438c7a Stats: 6 lines in 1 file changed: 5 ins; 0 del; 1 mod 8321225: [JVMCI] HotSpotResolvedObjectTypeImpl.isLeafClass shouldn't create strong references Reviewed-by: thartmann, eosterlund, kvn ------------- PR: https://git.openjdk.org/jdk/pull/16943 From never at openjdk.org Tue Dec 5 19:06:52 2023 From: never at openjdk.org (Tom Rodriguez) Date: Tue, 5 Dec 2023 19:06:52 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes Message-ID: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. ------------- Commit messages: - 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes Changes: https://git.openjdk.org/jdk/pull/16981/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16981&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321288 Stats: 34 lines in 1 file changed: 31 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/16981.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16981/head:pull/16981 PR: https://git.openjdk.org/jdk/pull/16981 From jbhateja at openjdk.org Tue Dec 5 19:07:39 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Dec 2023 19:07:39 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v4] In-Reply-To: References: Message-ID: On Sat, 2 Dec 2023 07:53:13 GMT, Jatin Bhateja wrote: >> Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. >> >> A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. >> >> Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. >> >> Following are the performance numbers of and existing JMH micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Correting BIG_ENDIAN_ONLY check Hi @TobiHartmann , kindly confirm if results of your internal testing looks good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16888#issuecomment-1841448479 From jbhateja at openjdk.org Tue Dec 5 19:22:49 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Dec 2023 19:22:49 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - add GCC version guards > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - Remove C++17 from C flags > - add avoid masked stores operation > - update the code to check for supported simd sort cpus > - Disable AVX2 sort for 64-bit types > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - fix jcheck failures due to windows encoding > - fix carriage return and change insertion sort thresholds > - ... and 7 more: https://git.openjdk.org/jdk/compare/fc7d33f2...bc590d9f src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp line 50: > 48: case JVM_T_DOUBLE: > 49: avx2_fast_sort((double*)array, from_index, to_index, INSERTION_SORT_THRESHOLD_64BIT); > 50: break; Please add safe assertions for missing types. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416173543 From jbhateja at openjdk.org Tue Dec 5 19:47:49 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 5 Dec 2023 19:47:49 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - add GCC version guards > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - Remove C++17 from C flags > - add avoid masked stores operation > - update the code to check for supported simd sort cpus > - Disable AVX2 sort for 64-bit types > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - fix jcheck failures due to windows encoding > - fix carriage return and change insertion sort thresholds > - ... and 7 more: https://git.openjdk.org/jdk/compare/021f5063...bc590d9f src/java.base/linux/native/libsimdsort/avx2-emu-funcs.hpp line 64: > 62: } > 63: return lut; > 64: }(); Lut64 is needed for compress64 emulation, can be removed. src/java.base/linux/native/libsimdsort/avx2-emu-funcs.hpp line 234: > 232: > 233: vtype::mask_storeu(leftStore, left, temp); > 234: } Can be removed if not being used. src/java.base/linux/native/libsimdsort/avx2-emu-funcs.hpp line 277: > 275: > 276: return _mm_popcnt_u32(shortMask); > 277: } Can be removed if not being used. src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp line 44: > 42: break; > 43: case JVM_T_FLOAT: > 44: avx2_fast_sort((float*)array, from_index, to_index, INSERTION_SORT_THRESHOLD_32BIT); Assertions for unsupported types. src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp line 56: > 54: case JVM_T_FLOAT: > 55: avx2_fast_partition((float*)array, from_index, to_index, pivot_indices, index_pivot1, index_pivot2); > 56: break; Please add assertion for unsupported types. src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp line 235: > 233: return avx512_double_compressstore>( > 234: left_addr, right_addr, k, reg); > 235: } Can be removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416191049 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416186096 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416186814 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416189371 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416189115 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1416187350 From fgao at openjdk.org Wed Dec 6 02:00:10 2023 From: fgao at openjdk.org (Fei Gao) Date: Wed, 6 Dec 2023 02:00:10 GMT Subject: RFR: 8321308: AArch64: Fix matching predication for cbz/cbnz Message-ID: For array length check like: if (a.length > 0) { [Block 1] } else { [Block 2] } Since `a.length` is unsigned, it's semantically equivalent to: if (a.length != 0) { [Block 1] } else { [Block 2] } On aarch64 port, we can do the conversion like above, during c2 compiler instruction matching, for certain unsigned integral comparisons. For example, cmpw w11, #0 # unsigned bls label # unsigned [Block 1] label: [Block 2] can be converted to: cbz w11, label [Block 1] label: [Block 2] Currently, we have some matching rules to do the conversion [[1]](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64.ad#L16179). But the predicate here [[2]](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64.ad#L6140) matches wrong `BoolTest` masks, so these rules fail to convert. I guess it's a typo introduced in [JDK-8160006](https://bugs.openjdk.org/browse/JDK-8160006). The patch fixes it. ------------- Commit messages: - 8321308: AArch64: Fix matching predication for cbz/cbnz Changes: https://git.openjdk.org/jdk/pull/16989/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16989&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321308 Stats: 103 lines in 3 files changed: 94 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/16989.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16989/head:pull/16989 PR: https://git.openjdk.org/jdk/pull/16989 From fyang at openjdk.org Wed Dec 6 02:46:35 2023 From: fyang at openjdk.org (Fei Yang) Date: Wed, 6 Dec 2023 02:46:35 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v3] In-Reply-To: <1HWA8nW4l8CmCtLplPnKDuDrlJg-jOhOkjk1OLFINhQ=.52b418a9-2c73-4996-ab1d-d48a58e2cfb5@github.com> References: <1HWA8nW4l8CmCtLplPnKDuDrlJg-jOhOkjk1OLFINhQ=.52b418a9-2c73-4996-ab1d-d48a58e2cfb5@github.com> Message-ID: <5bo7tE5jD1F09tllj-jUE2QY5VhokIv9QINqpga2XsQ=.12371d33-f337-4cd4-b940-c705be0c6442@github.com> On Tue, 5 Dec 2023 15:03:47 GMT, Hamlin Li wrote: >> Hi, >> Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? >> Thanks >> >> ## Test >> test/hotspot/jtreg/compiler/intrinsics/ >> test/hotspot/jtreg/compiler/vectorapi/ >> and tests found via: >> grep -nr test/hotspot/jtreg/ -we Math.signum >> and test found via: >> grep -nr test/jdk/ -we Math.signum > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > enable TestSignumVector.java on riscv Changes requested by fyang (Reviewer). src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1687: > 1685: mv(t0, fclass_mask::zero | fclass_mask::nan); > 1686: vand_vx(v0, v0, t0); > 1687: vmseq_vv(v0, v0, zero); I don't think that the input `zero` (a vector of floating-point 0.0) is appropriate here for `vmseq_vv` which does vector integer comparison. Why not do `vmseq_vi(v0, v0, 0)` instead? This will also help remove the `zero` parameter of this function. src/hotspot/cpu/riscv/riscv_v.ad line 3675: > 3673: BasicType bt = Matcher::vector_element_basic_type(this); > 3674: __ signum_fp_v(as_VectorRegister($dst$$reg), bt, Matcher::vector_length(this), > 3675: as_VectorRegister($zero$$reg), as_VectorRegister($one$$reg)); Nit: maybe leave one extra space here to align with parameters of the preceding line. ------------- PR Review: https://git.openjdk.org/jdk/pull/16925#pullrequestreview-1766524186 PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1416579404 PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1416582961 From fyang at openjdk.org Wed Dec 6 02:46:37 2023 From: fyang at openjdk.org (Fei Yang) Date: Wed, 6 Dec 2023 02:46:37 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v3] In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 13:58:09 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/assembler_riscv.hpp line 1592: >> >>> 1590: INSN(vfsgnj_vf, 0b1010111, 0b101, 0b001000); >>> 1591: INSN(vfsgnjx_vf, 0b1010111, 0b101, 0b001010); >>> 1592: INSN(vfsgnjn_vf, 0b1010111, 0b101, 0b001001); >> >> Not used anywhere? > > I can remove them if you think it's better to do it. > Reason I added them is that they're quite similar instructions and it's annoying to lookup the instruction formatting in spec and add every single one when needing them. Ah, I would suggest remove them for test coverity reasons. We can add them back when needed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1416581988 From fgao at openjdk.org Wed Dec 6 03:54:48 2023 From: fgao at openjdk.org (Fei Gao) Date: Wed, 6 Dec 2023 03:54:48 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 12:42:35 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Suggestions by Christian for naming src/hotspot/share/opto/superword.cpp line 1735: > 1733: // "C_const" accounts for "init" instead. > 1734: // 5) The "C_pre * pre_iter" term represents how much the iv is incremented > 1735: // during the "pre_iter" many pre-loop iterations. This term can be adjusted "during the "pre_iter" pre-loop iterations"? Drop "many". src/hotspot/share/opto/superword.cpp line 1739: > 1737: // of the main-loop memory reference. > 1738: // 6) The "C_main * j" term represents how much the iv is increased during "j" > 1739: // "j" main-loop iterations. " during "j" main-loop iterations."? You have two repetitive "j" here. src/hotspot/share/opto/superword.cpp line 1824: > 1822: // C_invar % abs(C_pre) = 0 (3b*) > 1823: // > 1824: // to ensure that the variable term for init and invar can be aligned with the C_pre term. How about illustrating it more directly? Like: // Only when the variable terms for init and invar are aligned with the C_pre term, i.e., // C_init % abs(C_pre) = 0 (3a*) // C_invar % abs(C_pre) = 0 (3b*) // in what follows, we can ensure that the C_pre term can align the C_const, C_init and C_invar terms, // by adjusting the pre-loop limit (pre_iter). src/hotspot/share/opto/superword.cpp line 1902: > 1900: // > 1901: // 2. If a invariant is present, then we make the solution dependent > 1902: // on C_pre and invar. Only solutions with tthe same dependenceis are Typos: "the same dependencies" src/hotspot/share/opto/superword.cpp line 3874: > 3872: // otherwise the address does not depend on iv, and the alignment cannot be > 3873: // affected by adjusting the pre-loop limit. > 3874: // Further, if abs(scale) >= aw, then N has no effect on alignment, and we are not Add one blank line after line 3873? src/hotspot/share/opto/superword.cpp line 3916: > 3914: > 3915: // We chose an aw that is the maximal possible vector width for the type of > 3916: // align_to_ref. I have a question here: Could we always get benefit from aligning address to maximal possible vector width for small-size types, as vector width becomes large? E.g., for `byte` type on `512-bit` platform, will the pre-loop limit become very large and cost much more to execute the whole loop? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1415233354 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1415234834 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416589939 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416574488 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416607050 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416614970 From fgao at openjdk.org Wed Dec 6 04:08:33 2023 From: fgao at openjdk.org (Fei Gao) Date: Wed, 6 Dec 2023 04:08:33 GMT Subject: RFR: 8321308: AArch64: Fix matching predication for cbz/cbnz In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 01:54:59 GMT, Fei Gao wrote: > For array length check like: > > if (a.length > 0) { > [Block 1] > } else { > [Block 2] > } > > > Since `a.length` is unsigned, it's semantically equivalent to: > > if (a.length != 0) { > [Block 1] > } else { > [Block 2] > } > > > On aarch64 port, we can do the conversion like above, during c2 compiler instruction matching, for certain unsigned integral comparisons. > > For example, > > cmpw w11, #0 # unsigned > bls label # unsigned > [Block 1] > > label: > [Block 2] > > > can be converted to: > > cbz w11, label > [Block 1] > > label: > [Block 2] > > > Currently, we have some matching rules to do the conversion [[1]](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64.ad#L16179). But the predicate here [[2]](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64.ad#L6140) matches wrong `BoolTest` masks, so these rules fail to convert. I guess it's a typo introduced in [JDK-8160006](https://bugs.openjdk.org/browse/JDK-8160006). The patch fixes it. Suppose the GHA failure of `java/util/stream/GathererTest` on `linux-x86` is not caused by the patch :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/16989#issuecomment-1842050060 From thartmann at openjdk.org Wed Dec 6 07:37:36 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 6 Dec 2023 07:37:36 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified [v4] In-Reply-To: References: Message-ID: On Sat, 2 Dec 2023 07:53:13 GMT, Jatin Bhateja wrote: >> Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. >> >> A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. >> >> Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. >> >> Following are the performance numbers of and existing JMH micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Correting BIG_ENDIAN_ONLY check Changes look good to me and all tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16888#pullrequestreview-1766847708 From thartmann at openjdk.org Wed Dec 6 07:55:32 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 6 Dec 2023 07:55:32 GMT Subject: RFR: 8320649: C2: Optimize scoped values In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 08:33:08 GMT, Roland Westrelin wrote: > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... No review yet, I just performed some quick testing. The optimized build fails: [2023-12-05T16:26:12,957Z] open/src/hotspot/share/opto/loopnode.cpp:4745: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const' [2023-12-05T16:26:12,960Z] open/src/hotspot/share/opto/loopnode.cpp:4761: error: undefined reference to 'ScopedValueGetLoadFromCacheNode::verify() const' [2023-12-05T16:26:12,964Z] open/src/hotspot/share/opto/loopnode.cpp:4908: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const' [2023-12-05T16:26:12,967Z] open/src/hotspot/share/opto/loopnode.cpp:4911: error: undefined reference to 'ScopedValueGetLoadFromCacheNode::verify() const' [2023-12-05T16:26:12,976Z] open/src/hotspot/share/opto/loopopts.cpp:3935: error: undefined reference to 'ScopedValueGetHitsInCacheNode::verify() const' [2023-12-05T16:26:15,455Z] collect2: error: ld returned 1 exit status `compiler/c2/irTests/TestScopedValue.java` fails with `-Xcomp` on Linux x64: # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f8de687d1b9, pid=270115, tid=270131 # # JRE version: Java(TM) SE Runtime Environment (22.0) (fastdebug build 22-internal-2023-12-05-1616186.tobias.hartmann.jdk2) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-internal-2023-12-05-1616186.tobias.hartmann.jdk2, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x12931b9] PhaseIdealLoop::get_early_ctrl(Node*)+0x4c9 Current CompileTask: C2:30390 8110 b 4 compiler.c2.irTests.TestScopedValue::testFastPath13 (28 bytes) Stack: [0x00007f8dc4353000,0x00007f8dc4453000], sp=0x00007f8dc444d6c0, free space=1001k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x12931b9] PhaseIdealLoop::get_early_ctrl(Node*)+0x4c9 (loopnode.hpp:1139) V [libjvm.so+0x1293d95] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0x75 (loopnode.cpp:251) V [libjvm.so+0x1293df6] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6 (node.hpp:399) V [libjvm.so+0x1293df6] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6 (node.hpp:399) V [libjvm.so+0x1293df6] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6 (node.hpp:399) V [libjvm.so+0x1293df6] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6 (node.hpp:399) V [libjvm.so+0x1293df6] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6 (node.hpp:399) V [libjvm.so+0x1293df6] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6 (node.hpp:399) V [libjvm.so+0x1293df6] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xd6 (node.hpp:399) V [libjvm.so+0x12960e0] PhaseIdealLoop::test_and_load_from_cache(Node*, Node*, Node*, Node*, float, float, Node*, Node*&, Node*&, Node*&)+0x820 (loopnode.cpp:4900) V [libjvm.so+0x1296b6a] PhaseIdealLoop::expand_get_from_sv_cache(ScopedValueGetHitsInCacheNode*)+0x82a (loopnode.cpp:4822) V [libjvm.so+0x1297473] PhaseIdealLoop::expand_scoped_value_get_nodes()+0x243 (loopnode.cpp:4737) V [libjvm.so+0x12a45ed] PhaseIdealLoop::build_and_optimize()+0xf0d (loopnode.cpp:4672) V [libjvm.so+0x9f4ea2] PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x432 (loopnode.hpp:1113) V [libjvm.so+0x9ed945] Compile::optimize_loops(PhaseIterGVN&, LoopOptsMode)+0x75 (compile.cpp:2248) V [libjvm.so+0x9f0253] Compile::Optimize()+0xfd3 (compile.cpp:2500) V [libjvm.so+0x9f37e1] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c21 (compile.cpp:860) V [libjvm.so+0x83eca7] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e7 (c2compiler.cpp:134) V [libjvm.so+0x9ff17c] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x92c (compileBroker.cpp:2299) V [libjvm.so+0x9ffe08] CompileBroker::compiler_thread_loop()+0x468 (compileBroker.cpp:1958) V [libjvm.so+0xeb93bc] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:720) V [libjvm.so+0x17992c6] Thread::call_run()+0xb6 (thread.cpp:220) V [libjvm.so+0x14a30f7] thread_native_entry(Thread*)+0x127 (os_linux.cpp:787) `compiler/c2/irTests/TestScopedValue.java` fails with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation` on Linux x64: Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "public static void compiler.c2.irTests.TestScopedValue.testFastPath7()" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={}, failOn={"_#C#CALL_OF_METHOD#_", "slowGet"}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - failOn: Graph contains forbidden nodes: * Constraint 1: "(\\d+(\\s){2}(Call.*Java.*)+(\\s){2}===.*slowGet )" - Matched forbidden node: * 501 CallStaticJava === 370 6 7 8 1 (648 1 1 1 1 1 ) [[ 502 503 504 ]] # Static java.lang.ScopedValue::slowGet `compiler/c2/irTests/TestScopedValue.java` fails with `-XX:TypeProfileLevel=222` on AArch64: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S29331/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/e2db05c4-923c-4a63-923b-5f9870681cc5/runs/c74e986d-15f1-46dc-822b-a41d12c079e0/workspace/open/src/hotspot/share/opto/callGenerator.cpp:929), pid=44590, tid=26115 # Error: assert(in->Opcode() == Op_LoadP || in->Opcode() == Op_LoadN) failed Current CompileTask: C2:766 689 b 4 compiler.c2.irTests.TestScopedValue::testFastPath1 (30 bytes) Stack: [0x00000001719ec000,0x0000000171bef000], sp=0x0000000171beb0c0, free space=2044k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.dylib+0x1130268] VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x564 (callGenerator.cpp:929) V [libjvm.dylib+0x1130a88] VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*)+0x0 V [libjvm.dylib+0x5618b0] print_error_for_unit_test(char const*, char const*, char*)+0x0 V [libjvm.dylib+0x396ea0] LateInlineScopedValueCallGenerator::process_result(GraphKit&)+0x2534 V [libjvm.dylib+0x38f8dc] CallGenerator::do_late_inline_helper()+0x660 V [libjvm.dylib+0x4cd2bc] Compile::inline_scoped_value_calls(PhaseIterGVN&)+0x570 V [libjvm.dylib+0x4c6944] Compile::Optimize()+0x210 V [libjvm.dylib+0x4c54bc] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1228 V [libjvm.dylib+0x38a590] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0 V [libjvm.dylib+0x4e2f48] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854 V [libjvm.dylib+0x4e238c] CompileBroker::compiler_thread_loop()+0x348 V [libjvm.dylib+0x8bb170] JavaThread::thread_main_inner()+0x1dc V [libjvm.dylib+0x1076548] Thread::call_run()+0xf4 V [libjvm.dylib+0xe39138] thread_native_entry(Thread*)+0x138 C [libsystem_pthread.dylib+0x726c] _pthread_start+0x94 `compiler/c2/irTests/TestScopedValue.java` fails with `-XX:+UnlockDiagnosticVMOptions -XX:TieredStopAtLevel=3 -XX:+StressLoopInvariantCodeMotion -XX:+StressRangeCheckElimination -XX:+StressLinearScan` on AArch64: compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method private void compiler.c2.irTests.TestScopedValue.testFastPath1Runner() throws java.lang.Exception at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162) at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:87) at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:822) at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249) at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164) Caused by: java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159) ... 4 more Caused by: java.lang.RuntimeException: should be compiled at compiler.c2.irTests.TestScopedValue.testFastPath1Runner(TestScopedValue.java:87) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ... 6 more `compiler/c2/TestUnsignedByteCompare.java` and `compiler/codegen/TestSignedMultiplyLong.java` fail with `-Duse.JTREG_TEST_THREAD_FACTORY=Virtual -XX:-VerifyContinuations` intermittent on Windows x64: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S29331/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/e2db05c4-923c-4a63-923b-5f9870681cc5/runs/c74e986d-15f1-46dc-822b-a41d12c079e0/workspace/open/src/hotspot/share/opto/compile.cpp:813), pid=29127, tid=26371 # assert(IncrementalInline || (_late_inlines.length() == 0 && !has_mh_late_inlines())) failed: incremental inlining is off Current CompileTask: C2:5797 3627 b java.lang.System$2::scopedValueCache (4 bytes) Stack: [0x0000000171694000,0x0000000171897000], sp=0x0000000171894bc0, free space=2050k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.dylib+0x1130268] VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x564 (compile.cpp:813) V [libjvm.dylib+0x1130a88] VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*)+0x0 V [libjvm.dylib+0x5618b0] print_error_for_unit_test(char const*, char const*, char*)+0x0 V [libjvm.dylib+0x4c5794] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1500 V [libjvm.dylib+0x38a590] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x1e0 V [libjvm.dylib+0x4e2f48] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x854 V [libjvm.dylib+0x4e238c] CompileBroker::compiler_thread_loop()+0x348 V [libjvm.dylib+0x8bb170] JavaThread::thread_main_inner()+0x1dc V [libjvm.dylib+0x1076548] Thread::call_run()+0xf4 V [libjvm.dylib+0xe39138] thread_native_entry(Thread*)+0x138 C [libsystem_pthread.dylib+0x726c] _pthread_start+0x94 Just let me know if you need any more information. ------------- Changes requested by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16966#pullrequestreview-1766883038 From fgao at openjdk.org Wed Dec 6 08:38:39 2023 From: fgao at openjdk.org (Fei Gao) Date: Wed, 6 Dec 2023 08:38:39 GMT Subject: RFR: 8315361: C2 SuperWord: refactor out loop analysis into shared auto-vectorization facility VLoopAnalyzer In-Reply-To: References: Message-ID: On Fri, 10 Nov 2023 17:24:22 GMT, Emanuel Peter wrote: > This is a refactoring of `SuperWord`. > I intend to push it for JDK23, after [this bug fix](https://github.com/openjdk/jdk/pull/14785). > > **Goals** > > 1. Clean up `SuperWord`: disentangle different components, make them more **modular**. > 2. Make the loop analysis parts a **shared facility**, not just for SuperWord but also the post-loop-vectorizer ([JDK-8308994](https://bugs.openjdk.org/browse/JDK-8308994)). > 3. It is also a necessary step on my bigger plans for improvement with the C2 Auto-Vectorizer ([see my blog post](https://eme64.github.io/blog/2023/11/03/C2-AutoVectorizer-Improvement-Ideas.html)). > 4. Improve tracing in the auto-vectorization by making it more systematic. > > **Summary** > > - I wrote a summary of how C2 auto-vectorization with SuperWord works (please read!): > https://github.com/openjdk/jdk/blob/95fd361e60fc66eb91edad321662e508b2d1bdde/src/hotspot/share/opto/superword.hpp#L32-L177 > - I moved many `Superword` components out to `VLoop` and its subclass `VLoopAnalyzer`. The idea is that any vectorizer can use these facilities in the future. They are therefore made more modular, which should hopefully make future changes easier. These components are: > - Checking the pre-conditions for vectorization (e.g. no unwanted ctrl-flow). > - `VLoop::check_preconditions_helper` replaces code from old `SuperWord::transform_loop`. > - Running all submodules of `VLoopAnalyzer`: `VLoopAnalyzer::analyze_helper`. Replaces analysis part of `SuperWord::SLP_extract`. > - Finding and marking reductions -> `VLoopReductions` > - Detecting memory slices -> `VLoopMemorySlices` > - Analyzing the body -> `VLoopBody` (renamed `in_bb` -> `in_body`) > - Determining vector element types, and functions to determine the `vector_width` of a node -> `VLoopTypes` > - Constructing the dependence graph -> `VLoopDependenceGraph`. Replaces old `DepGraph` with all its components. > - New: CompileCommand option `TraceAutovectorization` > - Run with `-XX:CompileCommand=traceAutovectorization,*::*,help` to get a usage description. > - Replaced all printing with flags `TraceSuperWord` (and `Verbose`) and of `VectorizeDebug`. > - The advantage of a CompileCommand is that tracing can be applied selectively for only a limited set of java classes / methods. > - It uses tags, which are more readable than the `VectorizeDebug` bit-flags. These tags can be used for all parts of the vectorizer, but one can also target SuperWord specifically. > - I systematically added tracing at every point where vector... src/hotspot/share/opto/superword.hpp line 282: > 280: bool is_trace_superword_adjacent_memops() const { > 281: return vla().is_trace_superword_adjacent_memops(); > 282: } How about redefining it as: bool is_trace_superword_adjacent_memops() const { return TraceSuperWord || vla().is_trace_tag_active(TraceAutovectorizationTag::TAG_SW_ADJACENT_MEMOPS); } And add a consulting interface in `class VLoop`: bool is_trace_tag_active(TraceAutovectorizationTag tag) const { return _trace_tags.at(tag); } Thus, we don't have to involve any `SuperWord` specific words or options in shared facility. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16620#discussion_r1416920701 From duke at openjdk.org Wed Dec 6 08:39:32 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 6 Dec 2023 08:39:32 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 14:47:32 GMT, Tobias Hartmann wrote: >> This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Fix the issue by replacing `adr` with `lea`. >> - Add a regression test. >> >> Thanks to @fisk and @xmas92 for the assistance. >> >> ### Testing >> Tests: tier1, tier2, tier3, tier4, tier5 >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > > Looks good to me. Thanks for the reviews @TobiHartmann and @theRealAph. Integrating now, but I need a sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842428572 From duke at openjdk.org Wed Dec 6 08:50:40 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 6 Dec 2023 08:50:40 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: <95x_5ClhJG1tjcMpXO2879BUk3B8WR7OFOFEedX_Osk=.d7499d64-1dcc-471d-9a42-2f8697680694@github.com> References: <95x_5ClhJG1tjcMpXO2879BUk3B8WR7OFOFEedX_Osk=.d7499d64-1dcc-471d-9a42-2f8697680694@github.com> Message-ID: <-B4ydr5YB2n7TrNOnsCrLhrWLgNdjJs7AWL_wykug5A=.2a2e88a1-fc42-41a1-8fe3-c7bf52298e25@github.com> On Tue, 5 Dec 2023 16:03:28 GMT, Tobias Hartmann wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Address comments > > test/hotspot/jtreg/compiler/c2/irTests/igvn/TestLoadNIdeal.java line 54: > >> 52: p[0] = new A(); >> 53: >> 54: // Dummy is not compiled and hence not inlined => Escape analysis > > Is there a reason you are not using [DontInline](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/DontInline.java) to prevent inlining of `dummy`? No particular reason. I checked, and `@DontInline` also makes the test work as expected. Is it preferable to use `@DontInline`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1416933860 From thartmann at openjdk.org Wed Dec 6 09:03:36 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 6 Dec 2023 09:03:36 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: <-B4ydr5YB2n7TrNOnsCrLhrWLgNdjJs7AWL_wykug5A=.2a2e88a1-fc42-41a1-8fe3-c7bf52298e25@github.com> References: <95x_5ClhJG1tjcMpXO2879BUk3B8WR7OFOFEedX_Osk=.d7499d64-1dcc-471d-9a42-2f8697680694@github.com> <-B4ydr5YB2n7TrNOnsCrLhrWLgNdjJs7AWL_wykug5A=.2a2e88a1-fc42-41a1-8fe3-c7bf52298e25@github.com> Message-ID: On Wed, 6 Dec 2023 08:47:45 GMT, Daniel Lund?n wrote: >> test/hotspot/jtreg/compiler/c2/irTests/igvn/TestLoadNIdeal.java line 54: >> >>> 52: p[0] = new A(); >>> 53: >>> 54: // Dummy is not compiled and hence not inlined => Escape analysis >> >> Is there a reason you are not using [DontInline](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/DontInline.java) to prevent inlining of `dummy`? > > No particular reason. I checked, and `@DontInline` also makes the test work as expected. Is it preferable to use `@DontInline`? Yes, I think `@DontInline` would be clearer here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1416950959 From aph at openjdk.org Wed Dec 6 09:16:36 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Dec 2023 09:16:36 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: <3FAxzgQkrFqWwD2iMUi30P95ewLH0lZAgUFzPYLYaK8=.cfd70c9e-d2d0-467b-849d-f3bbc14701a7@github.com> On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 It doesn't seem to have been integrated. Wait a little while please, there's something I'd like to try first. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842482607 From duke at openjdk.org Wed Dec 6 09:30:34 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 6 Dec 2023 09:30:34 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: <3FAxzgQkrFqWwD2iMUi30P95ewLH0lZAgUFzPYLYaK8=.cfd70c9e-d2d0-467b-849d-f3bbc14701a7@github.com> References: <3FAxzgQkrFqWwD2iMUi30P95ewLH0lZAgUFzPYLYaK8=.cfd70c9e-d2d0-467b-849d-f3bbc14701a7@github.com> Message-ID: <_3jdp86l3uuUPcMQ4VFXXej58pn-1MBAhmRYFfY8dLk=.57b21cf4-a282-48bb-9ea5-1c213c5609c0@github.com> On Wed, 6 Dec 2023 09:14:22 GMT, Andrew Haley wrote: > It doesn't seem to have been integrated. Wait a little while please, there's something I'd like to try first. Sure, I removed my integrate command and Roberto also removed his sponsor. Hopefully, that'll stop the integration. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842502609 From epeter at openjdk.org Wed Dec 6 09:39:46 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 09:39:46 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 02:48:31 GMT, Fei Gao wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Suggestions by Christian for naming > > src/hotspot/share/opto/superword.cpp line 1824: > >> 1822: // C_invar % abs(C_pre) = 0 (3b*) >> 1823: // >> 1824: // to ensure that the variable term for init and invar can be aligned with the C_pre term. > > How about illustrating it more directly? Like: > > // Only when the variable terms for init and invar are aligned with the C_pre term, i.e., > // C_init % abs(C_pre) = 0 (3a*) > // C_invar % abs(C_pre) = 0 (3b*) > // in what follows, we can ensure that the C_pre term can align the C_const, C_init and C_invar terms, > // by adjusting the pre-loop limit (pre_iter). Your comment prompted me to rewrite the proof a bit all the way down. I think things are now a bit more explicit. Let me know what you think! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416999976 From epeter at openjdk.org Wed Dec 6 09:43:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 09:43:10 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v22] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - add newline suggested by Faye - improve the alignment proof, make it more explicit ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/85cda773..fed4b013 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=20-21 Stats: 126 lines in 1 file changed: 68 ins; 20 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 09:52:43 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 09:52:43 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs In-Reply-To: References: Message-ID: <3XWSvhohqRmR34gd0qGn62rZrfQU_liCr8gAVXIMQPE=.0f29e9e5-dcf2-4888-bf99-c9d4b7a0030b@github.com> On Wed, 15 Nov 2023 03:14:22 GMT, Fei Gao wrote: >>> @fg1417 thanks for the help! I ran tier1-6 with `-XX:+AlignVector -XX:+IgnoreUnrecognizedVMOptions -XX:+VerifyAlignVector`. Though `VerifyAlignVector` only has an effect if there is a matching rule in the `ad` files. And I ran quite a few repetitions of `compiler/loopopts/superword/TestAlignVectorFuzzer.java`. >> >> Hi @eme64, we have [ `tst`](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/TST--immediate---Test-bits--immediate---an-alias-of-ANDS--immediate--?lang=en) on aarch64. >> >> instruct verify_vector_alignment(iRegP r, immL_positive_bitmaskI mask, rFlagsReg cr) %{ >> match(Set r (VerifyVectorAlignment r mask)); >> effect(KILL cr); >> format %{ "verify_vector_alignment $r $mask \t! verify alignment" %} >> ins_encode %{ >> Label Lskip; >> // check if masked bits of r are zero >> __ tst($r$$Register, $mask$$constant); >> __ br(Assembler::EQ, Lskip); >> __ stop("verify_vector_alignment found a misaligned vector memory access"); >> __ bind(Lskip); >> %} >> ins_pipe( pipe_slow ); >> %} >> >> >> I tested tier1-tier3 with `-XX:+AlignVector -XX:+IgnoreUnrecognizedVMOptions -XX:+VerifyAlignVector` on aarch64(neon and 128-bit sve) platforms, and several repetitions of `compiler/loopopts/superword/TestAlignVectorFuzzer.java`. No new failures found. Hope it works for you. Thanks. > >> @fg1417 thanks for the snippet! Where do I add it? I don't know how the `ad` and `m4` work. Is there some script that converts them? > > Hi @eme64 , you can add it [here](https://github.com/openjdk/jdk/blob/d9a89c59daa40fdc8da620940d5c518a9f18bc7b/src/hotspot/cpu/aarch64/aarch64.ad#L16394). Generally, we add **vector** operations to `src/hotspot/cpu/aarch64/aarch64_vector_ad.m4`, and redirect the output of the command: `m4 src/hotspot/cpu/aarch64/aarch64_vector_ad.m4` to `src/hotspot/cpu/aarch64/aarch64_vector.ad`. And other instructions are often added to `src/hotspot/cpu/aarch64/aarch64.ad`. Thanks! @fg1417 I rewrote the proof a bit, after some of your suggestions. I wanted to make the decomposition `pre_iter = pre_iter_C_const + pre_iter_C_invar + pre_iter_C_init` more explicit and bring it up earlier. ------------- PR Comment: https://git.openjdk.org/jdk/pull/14785#issuecomment-1842538942 From epeter at openjdk.org Wed Dec 6 09:52:46 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 09:52:46 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 03:38:51 GMT, Fei Gao wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Suggestions by Christian for naming > > src/hotspot/share/opto/superword.cpp line 3916: > >> 3914: >> 3915: // We chose an aw that is the maximal possible vector width for the type of >> 3916: // align_to_ref. > > I have a question here: Could we always get benefit from aligning address to maximal possible vector width for small-size types, as vector width becomes large? E.g., for `byte` type on `512-bit` platform, will the pre-loop limit become very large and cost much more to execute the whole loop? I agree, this is a concern. But this was already like that before my fix: `int vw = vector_width_in_bytes(p.mem());` If we want to relax this, I suggest we do that in an a future RFE. Some thoughts: - I did not want to change this now, since it may affect performance, and this is a bug fix here. - We may relax the `aw` to be the maximal maximal vector size used in the vectorization. Or we can make it even smaller, i.e. `MIN2(max_vw, ObjectAlignmentInBytes);', which would then be analogue to `AlignmentSolution SuperWord::pack_alignment_solution`. - We may even want to completely remove pre-loop adjustment if alignment is neither a strict requirement, and actually on average a performance cost that outweighs the performance gains of alignment. It is for example questionable to align one of the mem_refs if there are multiple, and hence we cannot guarantee alignment of all anyway. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417017418 From fgao at openjdk.org Wed Dec 6 09:59:50 2023 From: fgao at openjdk.org (Fei Gao) Date: Wed, 6 Dec 2023 09:59:50 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" Message-ID: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: cast<64> (32-bit compressed reference) + field_offset When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. Tier 1-3 passed on aarch64. ------------- Commit messages: - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" Changes: https://git.openjdk.org/jdk/pull/16991/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16991&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8319690 Stats: 297 lines in 2 files changed: 262 ins; 28 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/16991.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16991/head:pull/16991 PR: https://git.openjdk.org/jdk/pull/16991 From jbhateja at openjdk.org Wed Dec 6 10:00:43 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Dec 2023 10:00:43 GMT Subject: Integrated: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 17:49:45 GMT, Jatin Bhateja wrote: > Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. > > A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. > > Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. > > Following are the performance numbers of and existing JMH micro. > > ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) > > Please review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 2678e4cd Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/2678e4cd9424ca4e33ebb9693c84f9a86bf5504c Stats: 210 lines in 14 files changed: 54 ins; 5 del; 151 mod 8319111: Mismatched MemorySegment heap access is not consistently intrinsified Reviewed-by: sviswanathan, psandoz, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/16888 From jbhateja at openjdk.org Wed Dec 6 10:00:42 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Dec 2023 10:00:42 GMT Subject: RFR: 8319111: Mismatched MemorySegment heap access is not consistently intrinsified In-Reply-To: References: Message-ID: <-0CXOQlD3AXxHS8UdFdV46CuCzxvCRY5n_3reVFAuw8=.03dbcda3-5ae6-44b4-aa30-5f134cca72f0@github.com> On Thu, 30 Nov 2023 22:54:55 GMT, Sandhya Viswanathan wrote: >> Patch enables intrinsification of fromMemorySegment, intoMemorySegment APIs and their masked variants for mismatched memory segments i.e. heap based memory segments whose backing storage type differs from the vector type in which they are loaded to or stored from. >> >> A load from a mismatched segment first moves the contents into type compatible vector followed by reinterpretation to desired vector type. This facilitates value forwarding from a preceding vector store as alias indices are computed using backing storage type. >> >> Mismatched masked vector loads and stores are performed at byte granularity, this handles both narrowing and widening scenarios where vector lane size is smaller than backing storage element type and vice versa. >> >> Following are the performance numbers of and existing JMH micro. >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/a0b177af-78ca-4ac8-b6b0-bfe3655b16a6) >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > @PaulSandoz Could you please also take a look at this PR? Thanks @sviswa7 , @PaulSandoz , @TobiHartmann ------------- PR Comment: https://git.openjdk.org/jdk/pull/16888#issuecomment-1842551532 From aph at openjdk.org Wed Dec 6 10:17:36 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Dec 2023 10:17:36 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 This fix generates too much code. Please do this instead in both cases: @@ -585,8 +585,8 @@ void LIR_Assembler::const2reg(LIR_Opr src, LIR_Opr dest, LIR_PatchCode patch_cod if (__ operand_valid_for_float_immediate(c->as_jdouble())) { __ fmovd(dest->as_double_reg(), (c->as_jdouble())); } else { - __ adr(rscratch1, InternalAddress(double_constant(c->as_jdouble()))); - __ ldrd(dest->as_double_reg(), Address(rscratch1)); + __ mov(rscratch1, jlong_cast(c->as_jdouble())); + __ fmovd(dest->as_double_reg(), rscratch1); } break; } I don't think this is the only place where we assume that a single method will be less than a megabyte when compiled. What triggered it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842575812 PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842578283 From aph at openjdk.org Wed Dec 6 10:42:34 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Dec 2023 10:42:34 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Wed, 6 Dec 2023 06:24:59 GMT, Fei Gao wrote: > On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: > > cast<64> (32-bit compressed reference) + field_offset > > > When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. > > For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. > > In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. > > Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. > > We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. > > Tier 1-3 passed on aarch64. This is a complex fix for a corner case that can trivially be fixed via legitimize_address. Unless we can find something simpler, we should do so. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1842616496 From duke at openjdk.org Wed Dec 6 10:59:39 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 6 Dec 2023 10:59:39 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 Thanks Andrew. The trigger for this issue was a test case that I added as part of [JDK-8318817](https://bugs.openjdk.org/browse/JDK-8318817) (in `TestC1Globals.java`). Specifically, the test uses a very large value for `-XX:NMethodSizeLimit`, which causes the non-nmethod code heap to take up most of the code cache (leaving the profiled and non-profiled code heaps at minimum size, usually 4KB). If there is an implicit assumption regarding the cache size in more places, we should perhaps not integrate this fix and instead ensure the cache size is always properly sized. We are, for example, considering setting some more reasonable upper bound for `-XX:NMethodSizeLimit`. What do you think Andrew? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842641915 From epeter at openjdk.org Wed Dec 6 11:01:40 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 11:01:40 GMT Subject: RFR: 8315361: C2 SuperWord: refactor out loop analysis into shared auto-vectorization facility VLoopAnalyzer In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 08:35:51 GMT, Fei Gao wrote: >> This is a refactoring of `SuperWord`. >> I intend to push it for JDK23, after [this bug fix](https://github.com/openjdk/jdk/pull/14785). >> >> **Goals** >> >> 1. Clean up `SuperWord`: disentangle different components, make them more **modular**. >> 2. Make the loop analysis parts a **shared facility**, not just for SuperWord but also the post-loop-vectorizer ([JDK-8308994](https://bugs.openjdk.org/browse/JDK-8308994)). >> 3. It is also a necessary step on my bigger plans for improvement with the C2 Auto-Vectorizer ([see my blog post](https://eme64.github.io/blog/2023/11/03/C2-AutoVectorizer-Improvement-Ideas.html)). >> 4. Improve tracing in the auto-vectorization by making it more systematic. >> >> **Summary** >> >> - I wrote a summary of how C2 auto-vectorization with SuperWord works (please read!): >> https://github.com/openjdk/jdk/blob/95fd361e60fc66eb91edad321662e508b2d1bdde/src/hotspot/share/opto/superword.hpp#L32-L177 >> - I moved many `Superword` components out to `VLoop` and its subclass `VLoopAnalyzer`. The idea is that any vectorizer can use these facilities in the future. They are therefore made more modular, which should hopefully make future changes easier. These components are: >> - Checking the pre-conditions for vectorization (e.g. no unwanted ctrl-flow). >> - `VLoop::check_preconditions_helper` replaces code from old `SuperWord::transform_loop`. >> - Running all submodules of `VLoopAnalyzer`: `VLoopAnalyzer::analyze_helper`. Replaces analysis part of `SuperWord::SLP_extract`. >> - Finding and marking reductions -> `VLoopReductions` >> - Detecting memory slices -> `VLoopMemorySlices` >> - Analyzing the body -> `VLoopBody` (renamed `in_bb` -> `in_body`) >> - Determining vector element types, and functions to determine the `vector_width` of a node -> `VLoopTypes` >> - Constructing the dependence graph -> `VLoopDependenceGraph`. Replaces old `DepGraph` with all its components. >> - New: CompileCommand option `TraceAutovectorization` >> - Run with `-XX:CompileCommand=traceAutovectorization,*::*,help` to get a usage description. >> - Replaced all printing with flags `TraceSuperWord` (and `Verbose`) and of `VectorizeDebug`. >> - The advantage of a CompileCommand is that tracing can be applied selectively for only a limited set of java classes / methods. >> - It uses tags, which are more readable than the `VectorizeDebug` bit-flags. These tags can be used for all parts of the vectorizer, but one can also target SuperWord specifically. >> - ... > > src/hotspot/share/opto/superword.hpp line 282: > >> 280: bool is_trace_superword_adjacent_memops() const { >> 281: return vla().is_trace_superword_adjacent_memops(); >> 282: } > > How about redefining it as: > > bool is_trace_superword_adjacent_memops() const { > return TraceSuperWord || vla().is_trace_tag_active(TraceAutovectorizationTag::TAG_SW_ADJACENT_MEMOPS); > } > > > And add a consulting interface in `class VLoop`: > > bool is_trace_tag_active(TraceAutovectorizationTag tag) const { > return _trace_tags.at(tag); > } > > > Thus, we don't have to involve any `SuperWord` specific words or options in shared facility. @fg1417 ok, I can do that :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16620#discussion_r1417103516 From aph at openjdk.org Wed Dec 6 11:29:34 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Dec 2023 11:29:34 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 The PC-relative ldr() instruction is designed for exactly this use case, and its range is +/-1MB. We use this form for constant loads into SIMD and FP registers, in C1 and C2. We could also use it for other constant loads too, and the code is there to do so, but we don't at the present time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842684289 From aph at openjdk.org Wed Dec 6 11:30:33 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Dec 2023 11:30:33 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Wed, 6 Dec 2023 06:24:59 GMT, Fei Gao wrote: > On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: > > cast<64> (32-bit compressed reference) + field_offset > > > When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. > > For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. > > In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. > > Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. > > We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. > > Tier 1-3 passed on aarch64. One question: is this only about misaligned loads? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1842686874 From mli at openjdk.org Wed Dec 6 11:36:57 2023 From: mli at openjdk.org (Hamlin Li) Date: Wed, 6 Dec 2023 11:36:57 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v4] In-Reply-To: References: Message-ID: > Hi, > Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? > Thanks > > ## Test > test/hotspot/jtreg/compiler/intrinsics/ > test/hotspot/jtreg/compiler/vectorapi/ > and tests found via: > grep -nr test/hotspot/jtreg/ -we Math.signum > and test found via: > grep -nr test/jdk/ -we Math.signum Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: remove extra code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16925/files - new: https://git.openjdk.org/jdk/pull/16925/files/3dddd029..6c30657f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=02-03 Stats: 6 lines in 2 files changed: 0 ins; 5 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16925.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16925/head:pull/16925 PR: https://git.openjdk.org/jdk/pull/16925 From mli at openjdk.org Wed Dec 6 11:37:01 2023 From: mli at openjdk.org (Hamlin Li) Date: Wed, 6 Dec 2023 11:37:01 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v3] In-Reply-To: <5bo7tE5jD1F09tllj-jUE2QY5VhokIv9QINqpga2XsQ=.12371d33-f337-4cd4-b940-c705be0c6442@github.com> References: <1HWA8nW4l8CmCtLplPnKDuDrlJg-jOhOkjk1OLFINhQ=.52b418a9-2c73-4996-ab1d-d48a58e2cfb5@github.com> <5bo7tE5jD1F09tllj-jUE2QY5VhokIv9QINqpga2XsQ=.12371d33-f337-4cd4-b940-c705be0c6442@github.com> Message-ID: On Wed, 6 Dec 2023 02:32:57 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> enable TestSignumVector.java on riscv > > src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1687: > >> 1685: mv(t0, fclass_mask::zero | fclass_mask::nan); >> 1686: vand_vx(v0, v0, t0); >> 1687: vmseq_vv(v0, v0, zero); > > I don't think that the input `zero` (a vector of floating-point 0.0) is appropriate here for `vmseq_vv` which does vector integer comparison. Why not do `vmseq_vi(v0, v0, 0)` instead? This will also help remove the `zero` parameter of this function. Do you mean some patch like below? diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp index ed421c9e287..aa82447b943 100644 --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp @@ -1676,15 +1676,14 @@ void C2_MacroAssembler::signum_fp(FloatRegister dst, FloatRegister one, bool is_ bind(done); } -void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, - VectorRegister zero, VectorRegister one) { +void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, VectorRegister one) { vsetvli_helper(bt, vlen); // check if input is -0, +0, signaling NaN or quiet NaN vfclass_v(v0, dst); mv(t0, fclass_mask::zero | fclass_mask::nan); vand_vx(v0, v0, t0); - vmseq_vv(v0, v0, zero); + vmseq_vi(v0, v0, 0); // use floating-point 1.0 with a sign of input vfsgnj_vv(dst, one, dst, v0_t); diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad index 9b3c9125b3b..2940655f44e 100644 --- a/src/hotspot/cpu/riscv/riscv_v.ad +++ b/src/hotspot/cpu/riscv/riscv_v.ad @@ -3664,15 +3664,15 @@ instruct vexpand(vReg dst, vReg src, vRegMask_V0 v0, vReg tmp) %{ // Vector Math.signum -instruct vsignum_reg(vReg dst, vReg zero, vReg one, vRegMask_V0 v0) %{ - match(Set dst (SignumVF dst (Binary zero one))); - match(Set dst (SignumVD dst (Binary zero one))); +instruct vsignum_reg(vReg dst, vReg one, vRegMask_V0 v0) %{ + match(Set dst (SignumVF dst one)); + match(Set dst (SignumVD dst one)); effect(TEMP_DEF dst, TEMP v0); format %{ "vsignum $dst, $dst\t" %} ins_encode %{ BasicType bt = Matcher::vector_element_basic_type(this); __ signum_fp_v(as_VectorRegister($dst$$reg), bt, Matcher::vector_length(this), - as_VectorRegister($zero$$reg), as_VectorRegister($one$$reg)); + as_VectorRegister($one$$reg)); %} ins_pipe(pipe_slow); %} In fact this is also my initial patch, but at runtime it will report a `mismatch` issue. And in x86 and aarch64, they all have `zero` too. o749 SignumVD === _ o746 o1276 [[ o750 ]] #vectora[4]:{double} --N: o749 SignumVD === _ o746 o1276 [[ o750 ]] #vectora[4]:{double} --N: o746 LoadVector === o440 o865 o697 |o250 [[ o749 ]] @double[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectora[4]:{double} VREG 300 loadV VREG_V1 300 loadV VREG_V2 300 loadV VREG_V3 300 loadV VREG_V4 300 loadV VREG_V5 300 loadV VREG_V6 300 loadV VREG_V7 300 loadV VREG_V8 300 loadV VREG_V9 300 loadV VREG_V10 300 loadV VREG_V11 300 loadV VREG_V12 300 loadV VREG_V13 300 loadV VREG_V14 300 loadV VREG_V15 300 loadV --N: o697 AddP === _ o61 o1069 o1177 [[ o746 ]] IREGP 100 addP_reg_imm IREGPNOSP 100 addP_reg_imm IREGP_R10 100 addP_reg_imm IREGP_R11 100 addP_reg_imm IREGP_R12 100 addP_reg_imm IREGP_R13 100 addP_reg_imm IREGP_R14 100 addP_reg_imm IREGP_R15 100 addP_reg_imm IREGP_R16 100 addP_reg_imm IREGP_R28 100 addP_reg_imm IREGP_R30 100 addP_reg_imm IREGP_R31 100 addP_reg_imm JAVATHREAD_REGP 100 addP_reg_imm INDIRECT 100 addP_reg_imm INDOFFL 0 INDOFFL INLINE_CACHE_REGP 100 addP_reg_imm MEMORY 0 INDOFFL IREGNORP 100 IREGP IREGILNP 100 IREGP IREGILNPNOSP 100 IREGPNOSP VMEMA 100 INDIRECT --N: o1069 AddP === _ o61 o61 o1124 [[ o1056 o1046 o954 o949 o697 o869 o1051 o1041 ]] IREGP 0 IREGP IREGPNOSP 0 IREGPNOSP IREGP_R10 0 IREGP_R10 IREGP_R11 0 IREGP_R11 IREGP_R12 0 IREGP_R12 IREGP_R13 0 IREGP_R13 IREGP_R14 0 IREGP_R14 IREGP_R15 0 IREGP_R15 IREGP_R16 0 IREGP_R16 IREGP_R28 0 IREGP_R28 IREGP_R30 0 IREGP_R30 IREGP_R31 0 IREGP_R31 JAVATHREAD_REGP 0 JAVATHREAD_REGP INDIRECT 0 INDIRECT INLINE_CACHE_REGP 0 INLINE_CACHE_REGP MEMORY 0 INDIRECT IREGNORP 0 IREGP IREGILNP 0 IREGP IREGILNPNOSP 0 IREGPNOSP VMEMA 0 INDIRECT --N: o1177 ConL === o0 [[ o697 o698 ]] #long:240 IMML 0 IMML IMMLADD 0 IMMLADD IMMLSUB 0 IMMLSUB IMMLOFFSET 0 IMMLOFFSET IREGL 100 loadConL IREGLNOSP 100 loadConL IREGL_R28 100 loadConL IREGL_R29 100 loadConL IREGL_R30 100 loadConL IREGL_R10 100 loadConL IREGIORL 100 IREGL IREGILNP 100 IREGL IREGILNPNOSP 100 IREGLNOSP IMMIORL 0 IMML --N: o1276 Binary === _ o747 o748 [[ o749 ]] _Binary_vReg_vReg 0 _Binary_vReg_vReg --N: o747 Replicate === _ o192 [[ o1276 o1277 o1275 o1274 o1273 o1272 o1271 o1270 o1269 ]] #vectora[4]:{double} VREG 0 VREG VREG_V1 0 VREG_V1 VREG_V2 0 VREG_V2 VREG_V3 0 VREG_V3 VREG_V4 0 VREG_V4 VREG_V5 0 VREG_V5 VREG_V6 0 VREG_V6 VREG_V7 0 VREG_V7 VREG_V8 0 VREG_V8 VREG_V9 0 VREG_V9 VREG_V10 0 VREG_V10 VREG_V11 0 VREG_V11 VREG_V12 0 VREG_V12 VREG_V13 0 VREG_V13 VREG_V14 0 VREG_V14 VREG_V15 0 VREG_V15 --N: o748 Replicate === _ o193 [[ o1276 o1277 o1275 o1274 o1273 o1272 o1271 o1270 o1269 ]] #vectora[4]:{double} VREG 0 VREG VREG_V1 0 VREG_V1 VREG_V2 0 VREG_V2 VREG_V3 0 VREG_V3 VREG_V4 0 VREG_V4 VREG_V5 0 VREG_V5 VREG_V6 0 VREG_V6 VREG_V7 0 VREG_V7 VREG_V8 0 VREG_V8 VREG_V9 0 VREG_V9 VREG_V10 0 VREG_V10 VREG_V11 0 VREG_V11 VREG_V12 0 VREG_V12 VREG_V13 0 VREG_V13 VREG_V14 0 VREG_V14 VREG_V15 0 VREG_V15 > src/hotspot/cpu/riscv/riscv_v.ad line 3675: > >> 3673: BasicType bt = Matcher::vector_element_basic_type(this); >> 3674: __ signum_fp_v(as_VectorRegister($dst$$reg), bt, Matcher::vector_length(this), >> 3675: as_VectorRegister($zero$$reg), as_VectorRegister($one$$reg)); > > Nit: maybe leave one extra space here to align with parameters of the preceding line. > Ah, I would suggest remove them for test coverity reasons. We can add them back when needed. That's a good point! the code is removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1417144005 PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1417144210 From ihse at openjdk.org Wed Dec 6 12:02:42 2023 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 6 Dec 2023 12:02:42 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - add GCC version guards > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - Remove C++17 from C flags > - add avoid masked stores operation > - update the code to check for supported simd sort cpus > - Disable AVX2 sort for 64-bit types > - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort > - fix jcheck failures due to windows encoding > - fix carriage return and change insertion sort thresholds > - ... and 7 more: https://git.openjdk.org/jdk/compare/dbbc5f0a...bc590d9f Build changes look fine. You will still need the usual 2 reviewers from hotspot. ------------- Marked as reviewed by ihse (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16534#pullrequestreview-1767374468 From ihse at openjdk.org Wed Dec 6 12:02:43 2023 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 6 Dec 2023 12:02:43 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> Message-ID: <7_T6sM3wjbSzZ0ab9FsptbpPnlQ2J4NNctQNkdbDFdI=.b595a8cc-4b14-44c6-8319-00e68fea21c3@github.com> On Tue, 5 Dec 2023 17:26:06 GMT, Srinivas Vamsi Parasa wrote: >> That sounds weird. You can't check for if compiler options should be enabled or not inside source code files. >> >> Are you saying that when compiling with GCC 6, it will just silently ignore `-std=c++17`? I'd have assumed that it printed a warning or error about an unknown or invalid option, if C++17 is not supported. > > Hi Magnus (@magicus), > >> Are you saying that when compiling with GCC 6, it will just silently ignore `-std=c++17`? I'd have assumed that it printed a warning or error about an unknown or invalid option, if C++17 is not supported. > > The GCC complier for versions 6 (and even 5) silently ignores the flag `-std=c++17`. It does not print any warning or error. I tested it with a toy C++ program and also by building OpenJDK using GCC 6. > >> You can't check for if compiler options should be enabled or not inside source code files. > > what I meant was, there are #ifdef guards using predefined macros in the C++ source code to check for GCC version and make the simdsort code available for compilation or not based on the GCC version > > > // src/java.base/linux/native/libsimdsort/simdsort-support.hpp > #if defined(_LP64) && (defined(__GNUC__) && ((__GNUC__ > 7) || ((__GNUC__ == 7) && (__GNUC_MINOR__ >= 5)))) > #define __SIMDSORT_SUPPORTED_LINUX > #endif > > > > //src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp > #include "simdsort-support.hpp" > #ifdef __SIMDSORT_SUPPORTED_LINUX > > #endif Okay, then I guess I am fine with this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417170882 From chagedorn at openjdk.org Wed Dec 6 12:39:52 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Dec 2023 12:39:52 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v22] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 09:43:10 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - add newline suggested by Faye > - improve the alignment proof, make it more explicit Impressive work! I'm still working my way through the proofs but here are some first comments. src/hotspot/share/opto/superword.cpp line 1820: > 1818: > 1819: // In what follows, we need to show that the C_const, init and invar terms can be aligned by > 1820: // adjusting the pre-loop limit (pre-iter). We decompose pre_iter: (pre-iter) -> (pre_iter)? src/hotspot/share/opto/superword.cpp line 1894: > 1892: // for any pre_iter_C_const >= 0: C_pre * pre_iter_C_const = 0 (mod aw) > 1893: // > 1894: // which implies that C_iter (and pre_iter_C_const) have no effect on the alignment of What is `C_iter`? ------------- PR Review: https://git.openjdk.org/jdk/pull/14785#pullrequestreview-1747517126 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417193554 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417171988 From chagedorn at openjdk.org Wed Dec 6 12:40:05 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Dec 2023 12:40:05 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v19] In-Reply-To: References: Message-ID: On Tue, 28 Nov 2023 06:10:42 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > For Faye: remove 64-bit platform requirement in test src/hotspot/cpu/aarch64/aarch64.ad line 8243: > 8241: // VerifyVectorAlignment Instruction > 8242: > 8243: instruct verify_vector_alignment(iRegP r, immL_positive_bitmaskI mask, rFlagsReg cr) %{ I suggest to use `addr` instead of `r` to make it simpler to read. Same for `x86.ad`. src/hotspot/cpu/aarch64/aarch64.ad line 8255: > 8253: __ bind(Lskip); > 8254: %} > 8255: ins_pipe( pipe_slow ); Suggestion: ins_pipe(pipe_slow); src/hotspot/cpu/x86/x86.ad line 8979: > 8977: __ bind(Lskip); > 8978: %} > 8979: ins_pipe( pipe_slow ); Suggestion: ins_pipe(pipe_slow); src/hotspot/share/opto/c2_globals.hpp line 97: > 95: \ > 96: develop(bool, VerifyAlignVector, false, \ > 97: "Check that vector store/load are aligned if AlignVector is on.") \ Suggestion: "Check that vector stores/loads are aligned if AlignVector is on.") \ src/hotspot/share/opto/c2compiler.cpp line 70: > 68: #ifdef ASSERT > 69: if (!AlignVector && VerifyAlignVector) { > 70: warning("VerifyAlignVector disabled because AlignVector not enabled."); Suggestion: warning("VerifyAlignVector disabled because AlignVector is not enabled."); src/hotspot/share/opto/compile.cpp line 3695: > 3693: n->as_StoreVector()->must_verify_alignment(); > 3694: if (must_verify_alignment) { > 3695: jlong memory_size = n->is_LoadVector() ? n->as_LoadVector()->memory_size() : I suggest to name this `vector_length` as you are mentioning it below in the comment. Suggestion: jlong vector_length = n->is_LoadVector() ? n->as_LoadVector()->memory_size() : src/hotspot/share/opto/compile.cpp line 3701: > 3699: // to ObjectAlignmentInBytes. Hence, even if multiple arrays are accessed in > 3700: // a loop we can expect at least the following alignment: > 3701: jlong alignment = MIN2(memory_size, (jlong)ObjectAlignmentInBytes); To make it more explicit, I suggest to name it: Suggestion: jlong guaranteed_alignment = MIN2(memory_size, (jlong)ObjectAlignmentInBytes); src/hotspot/share/opto/superword.cpp line 590: > 588: // Find the adjacent memory references and create pack pairs for them. > 589: // This is the initial set of packs that will then be extended by > 590: // following use->def and def->use links. Maybe add a comment/hint here that alignment will be checked later src/hotspot/share/opto/superword.cpp line 612: > 610: // Take the first mem_ref as the reference to align to. The pre-loop trip count is > 611: // modified to align this reference to a vector-aligned address. If strict alignment > 612: // is required, we may change the reference later (see filter_packs_for_alignment). Suggestion: // is required, we may change the reference later (see filter_packs_for_alignment()). src/hotspot/share/opto/superword.cpp line 669: > 667: } // while (memops.size() != 0) > 668: > 669: set_align_to_ref(align_to_mem_ref); You could directly set it inside the loop now that we always pick the first one. src/hotspot/share/opto/superword.cpp line 1594: > 1592: > 1593: // Remove all nullptr from packset > 1594: compress_packset(); IIUC, there should not be any nullptr in the packs before `combine_packs()`. Can we assert that at the entry of `combine_packs()`? src/hotspot/share/opto/superword.cpp line 1616: > 1614: > 1615: // Find the set of alignment solutions for load/store pack p. > 1616: AlignmentSolution SuperWord::pack_alignment_solution(Node_List* p) { As there are already a lot of variables in this methods, I suggest to name the parameter `pack` instead of `p`: Suggestion: AlignmentSolution SuperWord::pack_alignment_solution(Node_List* pack) { src/hotspot/share/opto/superword.hpp line 250: > 248: // Where scale is 0 if no scale dependency, > 249: // and invar is nullptr if no invar dependency. > 250: class AlignmentSolution { Please add some new lines between non-one-liner methods for better readability. src/hotspot/share/opto/superword.hpp line 251: > 249: // and invar is nullptr if no invar dependency. > 250: class AlignmentSolution { > 251: private: `private` is implied and can be removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407506080 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407494609 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407492226 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407497784 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407498480 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407513072 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407519276 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407522168 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407524783 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407545008 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407551296 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407689762 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407559295 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1407558728 From chagedorn at openjdk.org Wed Dec 6 12:40:20 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Dec 2023 12:40:20 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 12:42:35 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Suggestions by Christian for naming src/hotspot/share/opto/superword.cpp line 1628: > 1626: int element_size = mem_ref->memory_size(); > 1627: int vw = pack_size * element_size; // vector_width > 1628: int aw = MIN2(vw, ObjectAlignmentInBytes); // alignment_width Is there a specific reason to go with `vw` and `aw` instead of directly naming the variable `vector_width` and `alighment_width`? src/hotspot/share/opto/superword.cpp line 1643: > 1641: Node* base = mem_ref_p.base(); > 1642: Node* invar = mem_ref_p.invar(); > 1643: int invar_factor = mem_ref_p.invar_factor(); I suggest to declare all constants `const`. Same for the constants further down in this method. src/hotspot/share/opto/superword.cpp line 1678: > 1676: tty->print_cr(" + scale(%d) * iv", scale); > 1677: } > 1678: #endif Not sure if you are planning to refactor the tracing code anyway but you might want to think about extracting any tracing code in this method to separate methods to not disrupt the readability of the code. You can derive some of the variables again (e.g. from VPointer etc.) to avoid having to pass many arguments to the tracing function. It might even be cleaner if this entire method would be part of `AlignmentSolution`. Then you do not need to pass around all the information to separate tracing methods. src/hotspot/share/opto/superword.cpp line 1684: > 1682: return AlignmentSolution("non power-of-2 stride not supported"); > 1683: } > 1684: assert(is_power_of_2(abs(pre_stride)), "pre_stride is power of 2"); Since you've just checked this condition with a bailout, you can probably get rid of this redundant assertion. src/hotspot/share/opto/superword.cpp line 1693: > 1691: } > 1692: > 1693: // We analyze the adress of the mem_ref. The idea is to disassemble it into a linear Suggestion: // We analyze the address of mem_ref. The idea is to disassemble it into a linear src/hotspot/share/opto/superword.cpp line 1707: > 1705: // init: value before pre-loop > 1706: // pre_stride: increment per pre-loop iteration > 1707: // pre_iter: number of pre-loop iterations (adjustible via pre-loop limit) Suggestion: // pre_iter: number of pre-loop iterations (adjustable via pre-loop limit) src/hotspot/share/opto/superword.cpp line 1709: > 1707: // pre_iter: number of pre-loop iterations (adjustible via pre-loop limit) > 1708: // main_stride: increment per main-loop iteration (= pre_stride * unroll_factor) > 1709: // j: number of main-loop iterations (j >= 0) Could we also name `j` -> `main_iter` to be consistent with the naming for `pre_iter`? src/hotspot/share/opto/superword.cpp line 1712: > 1710: // > 1711: // In the following, we restate the simple form of the address expression, by first > 1712: // expanding the iv varialbe. In a second step, we reshape the expression again, and Suggestion: // expanding the iv variable. In a second step, we reshape the expression again, and src/hotspot/share/opto/superword.cpp line 1720: > 1718: // + offset + offset + C_const (sum of constant terms) > 1719: // + invar + invar_factor * var_invar + C_invar * var_invar (term for variable init) > 1720: // / + scale * init + C_init * var_init (term for invariant) You flipped the comments for the variable init and invariant terms src/hotspot/share/opto/superword.cpp line 1725: > 1723: // > 1724: // We describe the 6 terms: > 1725: // 1) The "base" of the address is the address of a java object (e.g. array), Suggestion: // 1) The "base" of the address is the address of a Java object (e.g. array), src/hotspot/share/opto/superword.cpp line 1736: > 1734: // 5) The "C_pre * pre_iter" term represents how much the iv is incremented > 1735: // during the "pre_iter" many pre-loop iterations. This term can be adjusted > 1736: // by changing the pre-loop limit. This allows us to adjust the alignment Maybe add for completeness: by changing the pre-loop limit which defines how many pre-loop iterations are executed. src/hotspot/share/opto/superword.cpp line 1750: > 1748: } else { > 1749: C_init = scale; > 1750: } I suggest to make it implicit to better follow the logic: Suggestion: if (init_node->is_ConI()) { C_const_init = init_node->as_ConI()->get_int(); C_init = 0; } else { C_const_init = 0 C_init = scale; } src/hotspot/share/opto/superword.cpp line 1791: > 1789: > 1790: // We must find a pre_iter, such that adr is aw aligned: adr % aw = 0. > 1791: // Since "base mod aw = 0", we only need to ensure alignment of the other 5 terms: Maybe you can directly use `%` everywhere instead of writing "mod/modulo". src/hotspot/share/opto/superword.cpp line 1793: > 1791: // Since "base mod aw = 0", we only need to ensure alignment of the other 5 terms: > 1792: // > 1793: // C_const + C_invar * var_invar + C_init * var_init + C_pre * pre_iter + C_main * j = 0 (modulo aw) (1) Maybe state the modulo equation explicitly. Same for the other equations further down. Suggestion: // (C_const + C_invar * var_invar + C_init * var_init + C_pre * pre_iter + C_main * j) % aw = 0 (1) src/hotspot/share/opto/superword.cpp line 1795: > 1793: // C_const + C_invar * var_invar + C_init * var_init + C_pre * pre_iter + C_main * j = 0 (modulo aw) (1) > 1794: // > 1795: // Alignment must be maintained over all main-loop iterations, i.e for any j >= 0, we require: Suggestion: // Alignment must be maintained over all main-loop iterations, i.e. for any j >= 0, we require: src/hotspot/share/opto/superword.cpp line 1797: > 1795: // Alignment must be maintained over all main-loop iterations, i.e for any j >= 0, we require: > 1796: // > 1797: // C_main % aw = 0 (2*) So, if (2*) then we require: (C_const + C_invar * var_invar + C_init * var_init + C_pre * pre_iter) % aw = 0 to satisfy (1). Maybe you can add that for completeness. src/hotspot/share/opto/superword.cpp line 1818: > 1816: } > 1817: > 1818: // In what follows, me must ensure that the C_pre term can align the C_const, C_init and C_invar terms, Suggestion: // In what follows, we must ensure that the C_pre term can align the C_const, C_init and C_invar terms, src/hotspot/share/opto/superword.cpp line 1847: > 1845: // We must now show that the C_const term can be aligned. > 1846: // > 1847: // We can assume that abs(C_pre) is a power of 2. As a reminder, you could also mention here that `aw` is a power of 2. src/hotspot/share/opto/superword.cpp line 1848: > 1846: // > 1847: // We can assume that abs(C_pre) is a power of 2. > 1848: // If abs(C_pre) >= aw, then for any pre_iter >= 0: C_pre * pre_iter = 0 (mod aw), Suggestion: // If abs(C_pre) >= aw, then for any pre_iter >= 0: C_pre * pre_iter % aw = 0, ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413834430 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413836369 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413839023 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416879986 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413841272 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413862662 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413864497 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413863686 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413875167 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413887209 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416912014 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413896281 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413936123 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413938614 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413941141 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413944179 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1414047876 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416887945 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416839343 From chagedorn at openjdk.org Wed Dec 6 12:40:21 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Dec 2023 12:40:21 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 13:30:28 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Suggestions by Christian for naming > > src/hotspot/share/opto/superword.cpp line 1720: > >> 1718: // + offset + offset + C_const (sum of constant terms) >> 1719: // + invar + invar_factor * var_invar + C_invar * var_invar (term for variable init) >> 1720: // / + scale * init + C_init * var_init (term for invariant) > > You flipped the comments for the variable init and invariant terms `var_invar` is slightly confusing. Maybe we should flip it to `invar_var` to be consistent with `invar_factor`? Or maybe you find another name for that term. I could not come up with something better for now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1413876208 From chagedorn at openjdk.org Wed Dec 6 12:40:24 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Dec 2023 12:40:24 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v16] In-Reply-To: References: Message-ID: <0aTi3Gf2Bz5f_TBeM2_2XxTWxMg5N-zShkX_nuqbgDg=.1434516c-1d41-46a6-a0d0-f77015463da1@github.com> On Tue, 21 Nov 2023 10:38:46 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 67 commits: > > - Merge branch 'master' into JDK-8311586 > - TestBufferVectorization.java: Faye reported issue with AlignVector. Made to IR test, removed flagless > - Merge branch 'master' into JDK-8311586 > - aarch64 match rule from Faye for VerifyVectorAlignment > - Faye found failure on 256 SVE machine, fixed > - Merge branch 'master' into JDK-8311586 > - fix flags register in VerifyAlignVector > - Merge branch 'master' into JDK-8311586 > - small fix > - Merge branch 'master' into JDK-8311586 > - ... and 57 more: https://git.openjdk.org/jdk/compare/e055fae1...b491fbcb src/hotspot/share/opto/superword.cpp line 3815: > 3813: // lim0: current pre-loop limit > 3814: // lim: new pre-loop limit > 3815: // N: difference between lim and lim0 I find it hard to remember which is which when reading the equations below. I suggest to use more explicit names: Suggestion: // old_limit: current pre-loop limit // new_limit: new pre-loop limit // diff_limits: difference between lim and lim0 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1404079008 From duke at openjdk.org Wed Dec 6 12:57:05 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 6 Dec 2023 12:57:05 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v3] In-Reply-To: References: Message-ID: > This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). > > Changes: > - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. > - Add a regression test. > > ### Testing > - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Switch to @DontInline ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16967/files - new: https://git.openjdk.org/jdk/pull/16967/files/23f80e63..78a86af2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16967&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16967&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16967.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16967/head:pull/16967 PR: https://git.openjdk.org/jdk/pull/16967 From duke at openjdk.org Wed Dec 6 12:57:05 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 6 Dec 2023 12:57:05 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v2] In-Reply-To: References: <95x_5ClhJG1tjcMpXO2879BUk3B8WR7OFOFEedX_Osk=.d7499d64-1dcc-471d-9a42-2f8697680694@github.com> <-B4ydr5YB2n7TrNOnsCrLhrWLgNdjJs7AWL_wykug5A=.2a2e88a1-fc42-41a1-8fe3-c7bf52298e25@github.com> Message-ID: On Wed, 6 Dec 2023 09:01:20 GMT, Tobias Hartmann wrote: >> No particular reason. I checked, and `@DontInline` also makes the test work as expected. Is it preferable to use `@DontInline`? > > Yes, I think `@DontInline` would be clearer here. OK, thanks. Updated now (and rerunning tests just to be sure). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16967#discussion_r1417248051 From aph at openjdk.org Wed Dec 6 13:01:38 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Dec 2023 13:01:38 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 11:26:35 GMT, Andrew Haley wrote: > The PC-relative ldr() instruction... Sorry, PC-relative adr(), but the same +/-1MB, and the reason for it, still applies. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842832947 From rcastanedalo at openjdk.org Wed Dec 6 13:18:39 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 6 Dec 2023 13:18:39 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v3] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 12:57:05 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). >> >> Changes: >> - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. >> - Add a regression test. >> >> ### Testing >> - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Switch to @DontInline Marked as reviewed by rcastanedalo (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16967#pullrequestreview-1767526462 From chagedorn at openjdk.org Wed Dec 6 13:26:35 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Dec 2023 13:26:35 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v3] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 12:57:05 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). >> >> Changes: >> - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. >> - Add a regression test. >> >> ### Testing >> - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Switch to @DontInline Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16967#pullrequestreview-1767543495 From duke at openjdk.org Wed Dec 6 13:36:34 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 6 Dec 2023 13:36:34 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 Given the +/-1MB range for `adr`, would it instead be a good idea to limit `-XX:NMethodSizeLimit` to 1MB? Then we would not need the fix at all and could still use `adr`. Something similar to: diff --git a/src/hotspot/share/c1/c1_globals.hpp b/src/hotspot/share/c1/c1_globals.hpp index 1c22cf16cfe..e2057d20e59 100644 --- a/src/hotspot/share/c1/c1_globals.hpp +++ b/src/hotspot/share/c1/c1_globals.hpp @@ -277,7 +277,7 @@ \ develop(intx, NMethodSizeLimit, (64*K)*wordSize, \ "Maximum size of a compiled method.") \ - range(0, max_jint) \ + range(0, 1*M) \ \ develop(bool, TraceFPUStack, false, \ "Trace emulation of the FPU stack (intel only)") \ ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1842890054 From fyang at openjdk.org Wed Dec 6 13:48:37 2023 From: fyang at openjdk.org (Fei Yang) Date: Wed, 6 Dec 2023 13:48:37 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v3] In-Reply-To: References: <1HWA8nW4l8CmCtLplPnKDuDrlJg-jOhOkjk1OLFINhQ=.52b418a9-2c73-4996-ab1d-d48a58e2cfb5@github.com> <5bo7tE5jD1F09tllj-jUE2QY5VhokIv9QINqpga2XsQ=.12371d33-f337-4cd4-b940-c705be0c6442@github.com> Message-ID: On Wed, 6 Dec 2023 11:33:40 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1687: >> >>> 1685: mv(t0, fclass_mask::zero | fclass_mask::nan); >>> 1686: vand_vx(v0, v0, t0); >>> 1687: vmseq_vv(v0, v0, zero); >> >> I don't think that the input `zero` (a vector of floating-point 0.0) is appropriate here for `vmseq_vv` which does vector integer comparison. Why not do `vmseq_vi(v0, v0, 0)` instead? This will also help remove the `zero` parameter of this function. > > Do you mean some patch like below? > > diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > index ed421c9e287..aa82447b943 100644 > --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > @@ -1676,15 +1676,14 @@ void C2_MacroAssembler::signum_fp(FloatRegister dst, FloatRegister one, bool is_ > bind(done); > } > > -void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, > - VectorRegister zero, VectorRegister one) { > +void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, VectorRegister one) { > vsetvli_helper(bt, vlen); > > // check if input is -0, +0, signaling NaN or quiet NaN > vfclass_v(v0, dst); > mv(t0, fclass_mask::zero | fclass_mask::nan); > vand_vx(v0, v0, t0); > - vmseq_vv(v0, v0, zero); > + vmseq_vi(v0, v0, 0); > > // use floating-point 1.0 with a sign of input > vfsgnj_vv(dst, one, dst, v0_t); > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 9b3c9125b3b..2940655f44e 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -3664,15 +3664,15 @@ instruct vexpand(vReg dst, vReg src, vRegMask_V0 v0, vReg tmp) %{ > > // Vector Math.signum > > -instruct vsignum_reg(vReg dst, vReg zero, vReg one, vRegMask_V0 v0) %{ > - match(Set dst (SignumVF dst (Binary zero one))); > - match(Set dst (SignumVD dst (Binary zero one))); > +instruct vsignum_reg(vReg dst, vReg one, vRegMask_V0 v0) %{ > + match(Set dst (SignumVF dst one)); > + match(Set dst (SignumVD dst one)); > effect(TEMP_DEF dst, TEMP v0); > format %{ "vsignum $dst, $dst\t" %} > ins_encode %{ > BasicType bt = Matcher::vector_element_basic_type(this); > __ signum_fp_v(as_VectorRegister($dst$$reg), bt, Matcher::vector_length(this), > - as_VectorRegister($zero$$reg), as_VectorRegister($one$$reg)); > + as_VectorRegister($one$$reg)); > %} > ins_pipe(pipe_slow); > %} > > > In fact this is also my initial patch, but at runtime it will report a `mismatch` issue. > And in x86 and aarch64, they all have `zero` too. > > > o749 SignumVD === _ o746 o1276 [[ o750 ]] #vectora[4]:{double} > > --N: o749 SignumVD === _ o746 o1276 [[ o750 ]] #vectora[4]:{double} > > --N: o746 LoadVector === o440 o865 o697 |o250 [[ o749 ]] @double[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectora[... Hi, sorry for not being clear enough. In fact, I am suggesting following add-on change: diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp index 18b5df3f68c..b732e70ed84 100644 --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp @@ -1677,15 +1677,15 @@ void C2_MacroAssembler::signum_fp(FloatRegister dst, FloatRegister one, bool is_ bind(done); } -void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, - VectorRegister zero, VectorRegister one) { +void C2_MacroAssembler::signum_fp_v(VectorRegister dst, VectorRegister one, + BasicType bt, int vlen) { vsetvli_helper(bt, vlen); // check if input is -0, +0, signaling NaN or quiet NaN vfclass_v(v0, dst); mv(t0, fclass_mask::zero | fclass_mask::nan); vand_vx(v0, v0, t0); - vmseq_vv(v0, v0, zero); + vmseq_vi(v0, v0, 0); // use floating-point 1.0 with a sign of input vfsgnj_vv(dst, one, dst, v0_t); diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp index 0ed7c3686dc..b9a7749631a 100644 --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp @@ -163,7 +163,7 @@ void signum_fp(FloatRegister dst, FloatRegister one, bool is_double); - void signum_fp_v(VectorRegister dst, BasicType bt, int vlen, VectorRegister zero, VectorRegister one); + void signum_fp_v(VectorRegister dst, VectorRegister one, BasicType bt, int vlen); // intrinsic methods implemented by rvv instructions diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad index 9c2349a6c92..c163325fc81 100644 --- a/src/hotspot/cpu/riscv/riscv_v.ad +++ b/src/hotspot/cpu/riscv/riscv_v.ad @@ -3671,8 +3671,8 @@ instruct vsignum_reg(vReg dst, vReg zero, vReg one, vRegMask_V0 v0) %{ format %{ "vsignum $dst, $dst\t" %} ins_encode %{ BasicType bt = Matcher::vector_element_basic_type(this); - __ signum_fp_v(as_VectorRegister($dst$$reg), bt, Matcher::vector_length(this), - as_VectorRegister($zero$$reg), as_VectorRegister($one$$reg)); + __ signum_fp_v(as_VectorRegister($dst$$reg), as_VectorRegister($one$$reg), + bt, Matcher::vector_length(this)); %} ins_pipe(pipe_slow); %} ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1417284083 From thartmann at openjdk.org Wed Dec 6 13:55:38 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 6 Dec 2023 13:55:38 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v3] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 12:57:05 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). >> >> Changes: >> - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. >> - Add a regression test. >> >> ### Testing >> - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Switch to @DontInline Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16967#pullrequestreview-1767628750 From aph at openjdk.org Wed Dec 6 14:43:37 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Dec 2023 14:43:37 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 13:33:32 GMT, Daniel Lund?n wrote: > Given the +/-1MB range for `adr`, would it instead be a good idea to limit `-XX:NMethodSizeLimit` to 1MB? I guess that would be OK. I'm thinking of extreme scenarios where a method's constant pool is large but the stub code at its end is smaller, in which case an `adr` wouldn't quite reach. But I think that's unlikely. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1843020438 From mli at openjdk.org Wed Dec 6 14:43:49 2023 From: mli at openjdk.org (Hamlin Li) Date: Wed, 6 Dec 2023 14:43:49 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v5] In-Reply-To: References: Message-ID: > Hi, > Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? > Thanks > > ## Test > test/hotspot/jtreg/compiler/intrinsics/ > test/hotspot/jtreg/compiler/vectorapi/ > and tests found via: > grep -nr test/hotspot/jtreg/ -we Math.signum > and test found via: > grep -nr test/jdk/ -we Math.signum Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: Fix vmseq_vv ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16925/files - new: https://git.openjdk.org/jdk/pull/16925/files/6c30657f..394f68b0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16925&range=03-04 Stats: 6 lines in 3 files changed: 0 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/16925.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16925/head:pull/16925 PR: https://git.openjdk.org/jdk/pull/16925 From mli at openjdk.org Wed Dec 6 14:43:51 2023 From: mli at openjdk.org (Hamlin Li) Date: Wed, 6 Dec 2023 14:43:51 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v3] In-Reply-To: References: <1HWA8nW4l8CmCtLplPnKDuDrlJg-jOhOkjk1OLFINhQ=.52b418a9-2c73-4996-ab1d-d48a58e2cfb5@github.com> <5bo7tE5jD1F09tllj-jUE2QY5VhokIv9QINqpga2XsQ=.12371d33-f337-4cd4-b940-c705be0c6442@github.com> Message-ID: On Wed, 6 Dec 2023 13:17:11 GMT, Fei Yang wrote: >> Do you mean some patch like below? >> >> diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp >> index ed421c9e287..aa82447b943 100644 >> --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp >> +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp >> @@ -1676,15 +1676,14 @@ void C2_MacroAssembler::signum_fp(FloatRegister dst, FloatRegister one, bool is_ >> bind(done); >> } >> >> -void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, >> - VectorRegister zero, VectorRegister one) { >> +void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, VectorRegister one) { >> vsetvli_helper(bt, vlen); >> >> // check if input is -0, +0, signaling NaN or quiet NaN >> vfclass_v(v0, dst); >> mv(t0, fclass_mask::zero | fclass_mask::nan); >> vand_vx(v0, v0, t0); >> - vmseq_vv(v0, v0, zero); >> + vmseq_vi(v0, v0, 0); >> >> // use floating-point 1.0 with a sign of input >> vfsgnj_vv(dst, one, dst, v0_t); >> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad >> index 9b3c9125b3b..2940655f44e 100644 >> --- a/src/hotspot/cpu/riscv/riscv_v.ad >> +++ b/src/hotspot/cpu/riscv/riscv_v.ad >> @@ -3664,15 +3664,15 @@ instruct vexpand(vReg dst, vReg src, vRegMask_V0 v0, vReg tmp) %{ >> >> // Vector Math.signum >> >> -instruct vsignum_reg(vReg dst, vReg zero, vReg one, vRegMask_V0 v0) %{ >> - match(Set dst (SignumVF dst (Binary zero one))); >> - match(Set dst (SignumVD dst (Binary zero one))); >> +instruct vsignum_reg(vReg dst, vReg one, vRegMask_V0 v0) %{ >> + match(Set dst (SignumVF dst one)); >> + match(Set dst (SignumVD dst one)); >> effect(TEMP_DEF dst, TEMP v0); >> format %{ "vsignum $dst, $dst\t" %} >> ins_encode %{ >> BasicType bt = Matcher::vector_element_basic_type(this); >> __ signum_fp_v(as_VectorRegister($dst$$reg), bt, Matcher::vector_length(this), >> - as_VectorRegister($zero$$reg), as_VectorRegister($one$$reg)); >> + as_VectorRegister($one$$reg)); >> %} >> ins_pipe(pipe_slow); >> %} >> >> >> In fact this is also my initial patch, but at runtime it will report a `mismatch` issue. >> And in x86 and aarch64, they all have `zero` too. >> >> >> o749 SignumVD === _ o746 o1276 [[ o750 ]] #vectora[4]:{double} >> >> --N: o749 SignumVD === _ o746 o1276 [[ o750 ]] #vectora[4]:{double} >> >> --N: o746 LoadVector === o440 o865 o697 |o250 [[... > > Hi, sorry for not being clear enough. In fact, I am suggesting following add-on change: > > diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > index 18b5df3f68c..b732e70ed84 100644 > --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > @@ -1677,15 +1677,15 @@ void C2_MacroAssembler::signum_fp(FloatRegister dst, FloatRegister one, bool is_ > bind(done); > } > > -void C2_MacroAssembler::signum_fp_v(VectorRegister dst, BasicType bt, int vlen, > - VectorRegister zero, VectorRegister one) { > +void C2_MacroAssembler::signum_fp_v(VectorRegister dst, VectorRegister one, > + BasicType bt, int vlen) { > vsetvli_helper(bt, vlen); > > // check if input is -0, +0, signaling NaN or quiet NaN > vfclass_v(v0, dst); > mv(t0, fclass_mask::zero | fclass_mask::nan); > vand_vx(v0, v0, t0); > - vmseq_vv(v0, v0, zero); > + vmseq_vi(v0, v0, 0); > > // use floating-point 1.0 with a sign of input > vfsgnj_vv(dst, one, dst, v0_t); > diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp > index 0ed7c3686dc..b9a7749631a 100644 > --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp > +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.hpp > @@ -163,7 +163,7 @@ > > void signum_fp(FloatRegister dst, FloatRegister one, bool is_double); > > - void signum_fp_v(VectorRegister dst, BasicType bt, int vlen, VectorRegister zero, VectorRegister one); > + void signum_fp_v(VectorRegister dst, VectorRegister one, BasicType bt, int vlen); > > // intrinsic methods implemented by rvv instructions > > diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad > index 9c2349a6c92..c163325fc81 100644 > --- a/src/hotspot/cpu/riscv/riscv_v.ad > +++ b/src/hotspot/cpu/riscv/riscv_v.ad > @@ -3671,8 +3671,8 @@ instruct vsignum_reg(vReg dst, vReg zero, vReg one, vRegMask_V0 v0) %{ > format %{ "vsignum $dst, $dst\t" %} > ins_encode %{ > BasicType bt = Matcher::vector_element_basic_type(this); > - __ signum_fp_v(as_VectorRegister($dst$$reg), bt, Matcher::vector_length(this), > - as_VectorRegister($zero$$reg), as_VectorRegister($one$$reg)); > + __ signum_fp_v(as_VectorRegister($dst$$reg), as_VectorRegister($one$$reg), > + bt, Matcher::vector_length(this)); > %} > ins_pipe(pipe_slow); > %} Good catch of the bug! Fixed. I did not realize it before. I think the reason that tests did not catch it is because both int and float zero has the exact same binary content. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16925#discussion_r1417432947 From epeter at openjdk.org Wed Dec 6 15:06:48 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 15:06:48 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 13:47:43 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Suggestions by Christian for naming > > src/hotspot/share/opto/superword.cpp line 1750: > >> 1748: } else { >> 1749: C_init = scale; >> 1750: } > > I suggest to make it implicit to better follow the logic: > Suggestion: > > if (init_node->is_ConI()) { > C_const_init = init_node->as_ConI()->get_int(); > C_init = 0; > } else { > C_const_init = 0 > C_init = scale; > } I should also add some more comments here ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417472901 From epeter at openjdk.org Wed Dec 6 15:10:06 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 15:10:06 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v23] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review by Christian Thanks to Christian! Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/fed4b013..840906bd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=22 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=21-22 Stats: 12 lines in 5 files changed: 2 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 15:45:08 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 15:45:08 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v19] In-Reply-To: References: Message-ID: On Tue, 28 Nov 2023 10:39:40 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> For Faye: remove 64-bit platform requirement in test > > src/hotspot/share/opto/superword.hpp line 251: > >> 249: // and invar is nullptr if no invar dependency. >> 250: class AlignmentSolution { >> 251: private: > > `private` is implied and can be removed. I would prefer to keep it. I think it is nicer to read, I make the intention explicit. Let me know if you disagree. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417532878 From epeter at openjdk.org Wed Dec 6 15:45:05 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 15:45:05 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v24] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: more review updates for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/840906bd..571bd01a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=23 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=22-23 Stats: 39 lines in 4 files changed: 12 ins; 3 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 15:53:00 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 15:53:00 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v25] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: improve formatting for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/571bd01a..8590916e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=23-24 Stats: 16 lines in 1 file changed: 7 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 16:02:03 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:02:03 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v26] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: p -> pack and vw -> vector_width ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/8590916e..9bd31652 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=25 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=24-25 Stats: 11 lines in 1 file changed: 1 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 16:06:52 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:06:52 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 12:56:17 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Suggestions by Christian for naming > > src/hotspot/share/opto/superword.cpp line 1628: > >> 1626: int element_size = mem_ref->memory_size(); >> 1627: int vw = pack_size * element_size; // vector_width >> 1628: int aw = MIN2(vw, ObjectAlignmentInBytes); // alignment_width > > Is there a specific reason to go with `vw` and `aw` instead of directly naming the variable `vector_width` and `alighment_width`? I replaced `vw` with `vector_width`, there were relatively few occurances. But if I replace `aw` with `alighment_width` then everything becomes much more "wordy", all lines become much longer. I would like to keep `aw`, I think it is much more readable then the long form. What do you think? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417592584 From epeter at openjdk.org Wed Dec 6 16:31:18 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:31:18 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v27] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: made constants constant, and some more refactoring ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/9bd31652..ac24f845 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=26 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=25-26 Stats: 77 lines in 2 files changed: 28 ins; 10 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 16:31:19 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:31:19 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: <8_cxOlz5zG5khk5L6fuE0yBb0YH6zHn77qRYB6vbgpE=.86b3fdee-91a3-4311-98b7-8b7d2a012aff@github.com> On Mon, 4 Dec 2023 13:00:00 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Suggestions by Christian for naming > > src/hotspot/share/opto/superword.cpp line 1678: > >> 1676: tty->print_cr(" + scale(%d) * iv", scale); >> 1677: } >> 1678: #endif > > Not sure if you are planning to refactor the tracing code anyway but you might want to think about extracting any tracing code in this method to separate methods to not disrupt the readability of the code. You can derive some of the variables again (e.g. from VPointer etc.) to avoid having to pass many arguments to the tracing function. > > It might even be cleaner if this entire method would be part of `AlignmentSolution`. Then you do not need to pass around all the information to separate tracing methods. @chhagedorn What if I make it a `AlignmentSolver` that then returns an `AlignmentSolution`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417625914 From epeter at openjdk.org Wed Dec 6 16:41:05 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:41:05 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v28] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: j -> main_iter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/ac24f845..2f762aa1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=27 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=26-27 Stats: 21 lines in 2 files changed: 2 ins; 0 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 16:46:02 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:46:02 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v29] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix up invar / init comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/2f762aa1..ff9073e0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=28 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=27-28 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 16:46:05 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:46:05 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 13:31:24 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/superword.cpp line 1720: >> >>> 1718: // + offset + offset + C_const (sum of constant terms) >>> 1719: // + invar + invar_factor * var_invar + C_invar * var_invar (term for variable init) >>> 1720: // / + scale * init + C_init * var_init (term for invariant) >> >> You flipped the comments for the variable init and invariant terms > > `var_invar` is slightly confusing. Maybe we should flip it to `invar_var` to be consistent with `invar_factor`? Or maybe you find another name for that term. I could not come up with something better for now. I really don't know anything better than `var_invar`. The idea is that there is both a constant `C` and variable `var` term for both the `init` and `invar` terms. I even explain that below, that there is a such a factorization into constant and variable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417647566 From epeter at openjdk.org Wed Dec 6 16:58:07 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 16:58:07 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v30] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: replace mod/modulo with ampercent in comments, and bracket things instead of modulo-equals ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/ff9073e0..4ede9cc5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=29 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=28-29 Stats: 27 lines in 1 file changed: 0 ins; 0 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 17:04:04 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 17:04:04 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v31] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: a few more suggestions by Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/4ede9cc5..e1a66633 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=30 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=29-30 Stats: 7 lines in 1 file changed: 1 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 17:04:08 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 17:04:08 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v22] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 12:00:21 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - add newline suggested by Faye >> - improve the alignment proof, make it more explicit > > src/hotspot/share/opto/superword.cpp line 1894: > >> 1892: // for any pre_iter_C_const >= 0: C_pre * pre_iter_C_const = 0 (mod aw) >> 1893: // >> 1894: // which implies that C_iter (and pre_iter_C_const) have no effect on the alignment of > > What is `C_iter`? typo. `pre_iter` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417669831 From duke at openjdk.org Wed Dec 6 17:16:50 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 17:16:50 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: On Tue, 5 Dec 2023 19:33:48 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort >> - add GCC version guards >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort >> - Remove C++17 from C flags >> - add avoid masked stores operation >> - update the code to check for supported simd sort cpus >> - Disable AVX2 sort for 64-bit types >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort >> - fix jcheck failures due to windows encoding >> - fix carriage return and change insertion sort thresholds >> - ... and 7 more: https://git.openjdk.org/jdk/compare/d8b29378...bc590d9f > > src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp line 235: > >> 233: return avx512_double_compressstore>( >> 234: left_addr, right_addr, k, reg); >> 235: } > > Can be removed. This is needed for AVX512 sort... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417690992 From epeter at openjdk.org Wed Dec 6 17:16:53 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 17:16:53 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v16] In-Reply-To: <0aTi3Gf2Bz5f_TBeM2_2XxTWxMg5N-zShkX_nuqbgDg=.1434516c-1d41-46a6-a0d0-f77015463da1@github.com> References: <0aTi3Gf2Bz5f_TBeM2_2XxTWxMg5N-zShkX_nuqbgDg=.1434516c-1d41-46a6-a0d0-f77015463da1@github.com> Message-ID: On Fri, 24 Nov 2023 08:39:04 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 67 commits: >> >> - Merge branch 'master' into JDK-8311586 >> - TestBufferVectorization.java: Faye reported issue with AlignVector. Made to IR test, removed flagless >> - Merge branch 'master' into JDK-8311586 >> - aarch64 match rule from Faye for VerifyVectorAlignment >> - Faye found failure on 256 SVE machine, fixed >> - Merge branch 'master' into JDK-8311586 >> - fix flags register in VerifyAlignVector >> - Merge branch 'master' into JDK-8311586 >> - small fix >> - Merge branch 'master' into JDK-8311586 >> - ... and 57 more: https://git.openjdk.org/jdk/compare/e055fae1...b491fbcb > > src/hotspot/share/opto/superword.cpp line 3815: > >> 3813: // lim0: current pre-loop limit >> 3814: // lim: new pre-loop limit >> 3815: // N: difference between lim and lim0 > > I find it hard to remember which is which when reading the equations below. I suggest to use more explicit names: > Suggestion: > > // old_limit: current pre-loop limit > // new_limit: new pre-loop limit > // diff_limits: difference between lim and lim0 done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1417689474 From epeter at openjdk.org Wed Dec 6 17:20:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 17:20:10 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v32] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: For Christian: rename variables in adjust_pre_loop_limit_to_align_main_loop_vectors ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/e1a66633..cd781c46 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=31 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=30-31 Stats: 44 lines in 1 file changed: 0 ins; 0 del; 44 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 6 17:22:50 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 6 Dec 2023 17:22:50 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v22] In-Reply-To: References: Message-ID: <7vm8YXPz9dVN_uA6OcrYOpnkndTTKtAbp7epL6nqKV8=.a611bb4f-9dad-4476-8a36-96abb0db3c6d@github.com> On Wed, 6 Dec 2023 12:36:30 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - add newline suggested by Faye >> - improve the alignment proof, make it more explicit > > Impressive work! I'm still working my way through the proofs but here are some first comments. @chhagedorn @fg1417 thanks for the first passes of reviews! I think I want to follow Christian's idea, and refactor `SuperWord::pack_alignment_solution` into a separate class. I will make it a bit more general if I can, so that future vectorizers can use it as well. I'll call it `AlignmentSolver`, which returns a `AlignmentSolution`. Wrapping it like this allows me to split the code more into sub-functions, without having to pass massive amounts of parameters. Hopefully that will make the code more readable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/14785#issuecomment-1843329419 From duke at openjdk.org Wed Dec 6 17:23:03 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 17:23:03 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v9] In-Reply-To: References: Message-ID: > The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. > > For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. > > For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. > > **Note:** This PR also improves the performance of AVX512 sort by upto 35%. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup > -- | -- | -- | -- | -- > ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 > ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 > ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 > ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 > ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 > ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 > ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 > ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 > ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 > ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 > ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 > ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 > ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 > ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 > ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 > ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 > ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 > ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/... Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: remove unused avx2 64 bit sort functions; add assertions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16534/files - new: https://git.openjdk.org/jdk/pull/16534/files/bc590d9f..c143e0b9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=07-08 Stats: 128 lines in 4 files changed: 12 ins; 116 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16534.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16534/head:pull/16534 PR: https://git.openjdk.org/jdk/pull/16534 From duke at openjdk.org Wed Dec 6 17:23:04 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 17:23:04 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v9] In-Reply-To: <7_T6sM3wjbSzZ0ab9FsptbpPnlQ2J4NNctQNkdbDFdI=.b595a8cc-4b14-44c6-8319-00e68fea21c3@github.com> References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> <7_T6sM3wjbSzZ0ab9FsptbpPnlQ2J4NNctQNkdbDFdI=.b595a8cc-4b14-44c6-8319-00e68fea21c3@github.com> Message-ID: On Wed, 6 Dec 2023 11:59:19 GMT, Magnus Ihse Bursie wrote: >> Hi Magnus (@magicus), >> >>> Are you saying that when compiling with GCC 6, it will just silently ignore `-std=c++17`? I'd have assumed that it printed a warning or error about an unknown or invalid option, if C++17 is not supported. >> >> The GCC complier for versions 6 (and even 5) silently ignores the flag `-std=c++17`. It does not print any warning or error. I tested it with a toy C++ program and also by building OpenJDK using GCC 6. >> >>> You can't check for if compiler options should be enabled or not inside source code files. >> >> what I meant was, there are #ifdef guards using predefined macros in the C++ source code to check for GCC version and make the simdsort code available for compilation or not based on the GCC version >> >> >> // src/java.base/linux/native/libsimdsort/simdsort-support.hpp >> #if defined(_LP64) && (defined(__GNUC__) && ((__GNUC__ > 7) || ((__GNUC__ == 7) && (__GNUC_MINOR__ >= 5)))) >> #define __SIMDSORT_SUPPORTED_LINUX >> #endif >> >> >> >> //src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp >> #include "simdsort-support.hpp" >> #ifdef __SIMDSORT_SUPPORTED_LINUX >> >> #endif > > Okay, then I guess I am fine with this. Thank you Magnus! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417707661 From duke at openjdk.org Wed Dec 6 17:23:13 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 17:23:13 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8] In-Reply-To: References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: On Tue, 5 Dec 2023 19:37:34 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort >> - add GCC version guards >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort >> - Remove C++17 from C flags >> - add avoid masked stores operation >> - update the code to check for supported simd sort cpus >> - Disable AVX2 sort for 64-bit types >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into simdsort >> - fix jcheck failures due to windows encoding >> - fix carriage return and change insertion sort thresholds >> - ... and 7 more: https://git.openjdk.org/jdk/compare/d4151e5b...bc590d9f > > src/java.base/linux/native/libsimdsort/avx2-emu-funcs.hpp line 64: > >> 62: } >> 63: return lut; >> 64: }(); > > Lut64 is needed for compress64 emulation, can be removed. Removed in the latest commit... > src/java.base/linux/native/libsimdsort/avx2-emu-funcs.hpp line 234: > >> 232: >> 233: vtype::mask_storeu(leftStore, left, temp); >> 234: } > > Can be removed if not being used. Removed in the latest commit... > src/java.base/linux/native/libsimdsort/avx2-emu-funcs.hpp line 277: > >> 275: >> 276: return _mm_popcnt_u32(shortMask); >> 277: } > > Can be removed if not being used. Removed in the latest commit... > src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp line 44: > >> 42: break; >> 43: case JVM_T_FLOAT: >> 44: avx2_fast_sort((float*)array, from_index, to_index, INSERTION_SORT_THRESHOLD_32BIT); > > Assertions for unsupported types. Added in the latest commit... > src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp line 56: > >> 54: case JVM_T_FLOAT: >> 55: avx2_fast_partition((float*)array, from_index, to_index, pivot_indices, index_pivot1, index_pivot2); >> 56: break; > > Please add assertion for unsupported types. Added in the latest commit... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417701182 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417702999 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417702251 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417701469 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417701705 From duke at openjdk.org Wed Dec 6 17:23:14 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 17:23:14 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v9] In-Reply-To: References: <7ocsRxaWjoU2vxwPUSE7BrnLSL1bF_7Pp8vReacNJvE=.1c044b21-b27e-4e94-8db0-6ae888a1e8b9@github.com> Message-ID: On Tue, 5 Dec 2023 19:19:23 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unused avx2 64 bit sort functions; add assertions > > src/java.base/linux/native/libsimdsort/avx2-linux-qsort.cpp line 50: > >> 48: case JVM_T_DOUBLE: >> 49: avx2_fast_sort((double*)array, from_index, to_index, INSERTION_SORT_THRESHOLD_64BIT); >> 50: break; > > Please add safe assertions for missing types. This is from an older (but outdated) commit. The assertions have been added in other cases. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417706670 From duke at openjdk.org Wed Dec 6 17:32:53 2023 From: duke at openjdk.org (Raphael Mosaner) Date: Wed, 6 Dec 2023 17:32:53 GMT Subject: RFR: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects Message-ID: This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. ------------- Commit messages: - [JVMCI] Export pointer to JvmtiExport::_should_notify_object_alloc via CompilerToVM::Data. Changes: https://git.openjdk.org/jdk/pull/16980/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16980&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8320139 Stats: 8 lines in 3 files changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16980.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16980/head:pull/16980 PR: https://git.openjdk.org/jdk/pull/16980 From duke at openjdk.org Wed Dec 6 17:48:04 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 17:48:04 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: Message-ID: > The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. > > For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. > > For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. > > **Note:** This PR also improves the performance of AVX512 sort by upto 35%. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup > -- | -- | -- | -- | -- > ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 > ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 > ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 > ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 > ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 > ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 > ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 > ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 > ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 > ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 > ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 > ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 > ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 > ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 > ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 > ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 > ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 > ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/... Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: add missing header files ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16534/files - new: https://git.openjdk.org/jdk/pull/16534/files/c143e0b9..7e124581 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=08-09 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16534.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16534/head:pull/16534 PR: https://git.openjdk.org/jdk/pull/16534 From jbhateja at openjdk.org Wed Dec 6 17:48:05 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Dec 2023 17:48:05 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 17:44:25 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > add missing header files LGTM, thanks! ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16534#pullrequestreview-1768255412 From duke at openjdk.org Wed Dec 6 17:48:06 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 17:48:06 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 17:42:39 GMT, Jatin Bhateja wrote: > LGTM, thanks! Thanks Jatin! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843372385 From sviswanathan at openjdk.org Wed Dec 6 18:07:45 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 6 Dec 2023 18:07:45 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: Message-ID: <1H8N81T-C79IER_JFotgxYegmmZHI8-Efbe0vbmO5oU=.2150bacd-7a5a-4a66-8096-662a35cc893e@github.com> On Wed, 6 Dec 2023 17:48:04 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > add missing header files @TobiHartmann @vnkozlov Please advice if we can go head and integrate this PR today before the fork. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843407940 From kvn at openjdk.org Wed Dec 6 18:29:41 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Dec 2023 18:29:41 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: <1H8N81T-C79IER_JFotgxYegmmZHI8-Efbe0vbmO5oU=.2150bacd-7a5a-4a66-8096-662a35cc893e@github.com> References: <1H8N81T-C79IER_JFotgxYegmmZHI8-Efbe0vbmO5oU=.2150bacd-7a5a-4a66-8096-662a35cc893e@github.com> Message-ID: On Wed, 6 Dec 2023 18:05:22 GMT, Sandhya Viswanathan wrote: > @TobiHartmann @vnkozlov Please advice if we can go head and integrate this PR today before the fork. Too late. Changes looks fine to me (I am still on fence that we moving to C++ implementation of intrinsics and require latest C++ compiler version). I need to run testing for latest version of changes before approval. Lets not rush. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843446956 From sviswanathan at openjdk.org Wed Dec 6 18:33:45 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 6 Dec 2023 18:33:45 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: <1H8N81T-C79IER_JFotgxYegmmZHI8-Efbe0vbmO5oU=.2150bacd-7a5a-4a66-8096-662a35cc893e@github.com> Message-ID: On Wed, 6 Dec 2023 18:26:34 GMT, Vladimir Kozlov wrote: >> @TobiHartmann @vnkozlov Please advice if we can go head and integrate this PR today before the fork. > >> @TobiHartmann @vnkozlov Please advice if we can go head and integrate this PR today before the fork. > > Too late. Changes looks fine to me (I am still on fence that we moving to C++ implementation of intrinsics and require latest C++ compiler version). I need to run testing for latest version of changes before approval. Lets not rush. Thanks a lot @vnkozlov, we will wait for your approval. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843454162 From kvn at openjdk.org Wed Dec 6 18:44:47 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Dec 2023 18:44:47 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 17:48:04 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > add missing header files src/hotspot/share/opto/library_call.cpp line 5393: > 5391: if (!Matcher::supports_simd_sort(bt)) { > 5392: return false; > 5393: } This check should be in `C2Compiler::is_intrinsic_supported()` src/hotspot/share/opto/library_call.cpp line 5450: > 5448: if (!Matcher::supports_simd_sort(bt)) { > 5449: return false; > 5450: } Same. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417831171 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417832246 From dlong at openjdk.org Wed Dec 6 20:12:32 2023 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Dec 2023 20:12:32 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: <6Uwf_EDsQf74mAOXAN4lwP1JRBLgVno0o5f_yGYImBc=.f9c2027b-9eb5-441d-a7e9-8fc5fab8a2cc@github.com> On Wed, 6 Dec 2023 11:28:08 GMT, Andrew Haley wrote: > One question: is this only about misaligned loads? It looks to me that it also handles aligned loads with too-large offsets. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1843613577 From dlong at openjdk.org Wed Dec 6 20:12:34 2023 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Dec 2023 20:12:34 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: <4KRCqYxn02wMjYDuN3_HbYWxc9BDtYMNd40bNrJ4K8w=.5362dda4-8d30-48fd-8915-909eb15a6023@github.com> On Wed, 6 Dec 2023 06:24:59 GMT, Fei Gao wrote: > On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: > > cast<64> (32-bit compressed reference) + field_offset > > > When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. > > For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. > > In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. > > Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. > > We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. > > Tier 1-3 passed on aarch64. After this change, `immIOffset` and `immLOffset` appear to be obsolete. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1843614588 From duke at openjdk.org Wed Dec 6 20:36:04 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 20:36:04 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v11] In-Reply-To: References: Message-ID: > The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. > > For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. > > For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. > > **Note:** This PR also improves the performance of AVX512 sort by upto 35%. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup > -- | -- | -- | -- | -- > ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 > ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 > ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 > ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 > ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 > ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 > ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 > ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 > ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 > ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 > ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 > ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 > ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 > ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 > ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 > ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 > ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 > ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/... Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Change supported intrinsic check ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16534/files - new: https://git.openjdk.org/jdk/pull/16534/files/7e124581..9621eb04 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=09-10 Stats: 28 lines in 4 files changed: 20 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/16534.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16534/head:pull/16534 PR: https://git.openjdk.org/jdk/pull/16534 From duke at openjdk.org Wed Dec 6 20:36:09 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 20:36:09 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 18:41:26 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> add missing header files > > src/hotspot/share/opto/library_call.cpp line 5393: > >> 5391: if (!Matcher::supports_simd_sort(bt)) { >> 5392: return false; >> 5393: } > > This check should be in `C2Compiler::is_intrinsic_supported()` Hi Vladimir (@vnkozlov), please see the updated changes which use `C2Compiler::is_intrinsic_supported(id, bt)` > src/hotspot/share/opto/library_call.cpp line 5450: > >> 5448: if (!Matcher::supports_simd_sort(bt)) { >> 5449: return false; >> 5450: } > > Same. Please see the updated changes which use C2Compiler::is_intrinsic_supported(id, bt) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417946689 PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417946968 From kvn at openjdk.org Wed Dec 6 22:59:41 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Dec 2023 22:59:41 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 17:44:24 GMT, Srinivas Vamsi Parasa wrote: >> LGTM, thanks! > >> LGTM, thanks! > > Thanks Jatin! @vamsi-parasa, sorry, I was wrong. I missed that you need to check type `bt`. Latest change is more complicated than it was before. Please revert it back (undo last change). I will test previous version 09. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843822075 From duke at openjdk.org Wed Dec 6 23:12:13 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 23:12:13 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: Message-ID: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> > The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. > > For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. > > For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. > > **Note:** This PR also improves the performance of AVX512 sort by upto 35%. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup > -- | -- | -- | -- | -- > ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 > ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 > ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 > ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 > ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 > ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 > ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 > ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 > ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 > ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 > ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 > ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 > ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 > ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 > ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 > ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 > ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 > ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/... Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Revert "Change supported intrinsic check" This reverts commit 9621eb045c2958582f81ec06b237789a07481ddd. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16534/files - new: https://git.openjdk.org/jdk/pull/16534/files/9621eb04..eadba369 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16534&range=10-11 Stats: 28 lines in 4 files changed: 0 ins; 20 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/16534.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16534/head:pull/16534 PR: https://git.openjdk.org/jdk/pull/16534 From duke at openjdk.org Wed Dec 6 23:12:14 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 6 Dec 2023 23:12:14 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 17:44:24 GMT, Srinivas Vamsi Parasa wrote: >> LGTM, thanks! > >> LGTM, thanks! > > Thanks Jatin! > @vamsi-parasa, sorry, I was wrong. I missed that you need to check type `bt`. Latest change is more complicated than it was before. Please revert it back (undo last change). I will test previous version 09. @vnkozlov Vladimir, please see the commit reverted in the updated changes pushed now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843834085 From kvn at openjdk.org Wed Dec 6 23:17:42 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Dec 2023 23:17:42 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> References: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> Message-ID: On Wed, 6 Dec 2023 23:12:13 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Change supported intrinsic check" > > This reverts commit 9621eb045c2958582f81ec06b237789a07481ddd. Good. I submitted testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843839669 From kvn at openjdk.org Thu Dec 7 00:34:48 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 7 Dec 2023 00:34:48 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 23:09:01 GMT, Srinivas Vamsi Parasa wrote: >>> LGTM, thanks! >> >> Thanks Jatin! > >> @vamsi-parasa, sorry, I was wrong. I missed that you need to check type `bt`. Latest change is more complicated than it was before. Please revert it back (undo last change). I will test previous version 09. > @vnkozlov > Vladimir, please see the commit reverted in the updated changes pushed now. @vamsi-parasa, please, remind me which tests check that code in `libsmdsort.so` is used? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843938518 From duke at openjdk.org Thu Dec 7 01:03:47 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 7 Dec 2023 01:03:47 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 23:09:01 GMT, Srinivas Vamsi Parasa wrote: >>> LGTM, thanks! >> >> Thanks Jatin! > >> @vamsi-parasa, sorry, I was wrong. I missed that you need to check type `bt`. Latest change is more complicated than it was before. Please revert it back (undo last change). I will test previous version 09. > @vnkozlov > Vladimir, please see the commit reverted in the updated changes pushed now. > @vamsi-parasa, please, remind me which tests check that code in `libsmdsort.so` is used? @vnkozlov Please see the tests for simd sort code in `test/jdk/java/util/Arrays/Sorting.java` ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843963054 From fgao at openjdk.org Thu Dec 7 01:41:48 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 7 Dec 2023 01:41:48 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 09:49:14 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/superword.cpp line 3916: >> >>> 3914: >>> 3915: // We chose an aw that is the maximal possible vector width for the type of >>> 3916: // align_to_ref. >> >> I have a question here: Could we always get benefit from aligning address to maximal possible vector width for small-size types, as vector width becomes large? E.g., for `byte` type on `512-bit` platform, will the pre-loop limit become very large and cost much more to execute the whole loop? > > I agree, this is a concern. But this was already like that before my fix: > `int vw = vector_width_in_bytes(p.mem());` > > If we want to relax this, I suggest we do that in an a future RFE. > Some thoughts: > - I did not want to change this now, since it may affect performance, and this is a bug fix here. > - We may relax the `aw` to be the maximal maximal vector size used in the vectorization. Or we can make it even smaller, i.e. `MIN2(max_vw, ObjectAlignmentInBytes);', which would then be analogue to `AlignmentSolution SuperWord::pack_alignment_solution`. > - We may even want to completely remove pre-loop adjustment if alignment is neither a strict requirement, and actually on average a performance cost that outweighs the performance gains of alignment. It is for example questionable to align one of the mem_refs if there are multiple, and hence we cannot guarantee alignment of all anyway. Make sense to me. Thanks for your reply! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1418194284 From dnsimon at openjdk.org Thu Dec 7 03:00:37 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 7 Dec 2023 03:00:37 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes In-Reply-To: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: On Tue, 5 Dec 2023 19:00:51 GMT, Tom Rodriguez wrote: > HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. > > This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java line 504: > 502: > 503: > 504: static class KlassWeakReference extends WeakReference { Can you please add javadoc to this class explaining why it's needed/useful. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java line 506: > 504: static class KlassWeakReference extends WeakReference { > 505: > 506: private final Long klassPointer; I assume this is `Long` instead of `long` to avoid boxing in `expungeStaleEntries`? src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java line 703: > 701: * Clean up WeakReferences whose referents have been cleared. > 702: */ > 703: private void expungeStaleEntries() { `expungeStaleEntries` -> `expungeStaleKlassEntries` (or `expungeStaleKlasses`) src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java line 706: > 704: KlassWeakReference current = (KlassWeakReference) resolvedJavaTypesQueue.poll(); > 705: while (current != null) { > 706: // Make sure the entry is still mapped to the weak reference The atomicity of this test-and-update relies on `fromMetaspace` being synchronized right? Maybe worth pointing this out in the comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16981#discussion_r1418244577 PR Review Comment: https://git.openjdk.org/jdk/pull/16981#discussion_r1418250729 PR Review Comment: https://git.openjdk.org/jdk/pull/16981#discussion_r1418247243 PR Review Comment: https://git.openjdk.org/jdk/pull/16981#discussion_r1418248381 From fyang at openjdk.org Thu Dec 7 06:13:33 2023 From: fyang at openjdk.org (Fei Yang) Date: Thu, 7 Dec 2023 06:13:33 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v5] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 14:43:49 GMT, Hamlin Li wrote: >> Hi, >> Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? >> Thanks >> >> ## Test >> test/hotspot/jtreg/compiler/intrinsics/ >> test/hotspot/jtreg/compiler/vectorapi/ >> and tests found via: >> grep -nr test/hotspot/jtreg/ -we Math.signum >> and test found via: >> grep -nr test/jdk/ -we Math.signum > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > Fix vmseq_vv Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16925#pullrequestreview-1769280972 From fgao at openjdk.org Thu Dec 7 06:42:49 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 7 Dec 2023 06:42:49 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v2] In-Reply-To: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: > On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: > > cast<64> (32-bit compressed reference) + field_offset > > > When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. > > For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. > > In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. > > Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. > > We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. > > Tier 1-3 passed on aarch64. Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Remove unused immIOffset/immLOffset - Merge branch 'master' into fg8319690 - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: ``` cast<64> (32-bit compressed reference) + field_offset ``` When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by LDR/STR instructions. But for certain data width, not all immediates can be encoded in the instruction field of LDR/STR[1]. The ranges are different as data widths vary. For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760[2]. `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. Since 8-bit and 16-bit LDR/STR instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. Tier 1-3 passed on aarch64. [1] https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33 [2] https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16991/files - new: https://git.openjdk.org/jdk/pull/16991/files/1895cf31..a7bfe267 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16991&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16991&range=00-01 Stats: 36013 lines in 403 files changed: 11716 ins; 22816 del; 1481 mod Patch: https://git.openjdk.org/jdk/pull/16991.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16991/head:pull/16991 PR: https://git.openjdk.org/jdk/pull/16991 From fgao at openjdk.org Thu Dec 7 06:47:34 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 7 Dec 2023 06:47:34 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: <6Uwf_EDsQf74mAOXAN4lwP1JRBLgVno0o5f_yGYImBc=.f9c2027b-9eb5-441d-a7e9-8fc5fab8a2cc@github.com> References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> <6Uwf_EDsQf74mAOXAN4lwP1JRBLgVno0o5f_yGYImBc=.f9c2027b-9eb5-441d-a7e9-8fc5fab8a2cc@github.com> Message-ID: On Wed, 6 Dec 2023 20:08:57 GMT, Dean Long wrote: > > One question: is this only about misaligned loads? > > It looks to me that it also handles aligned loads with too-large offsets. Yes. E.g., for long type, before the patch, it only accepts aligned offsets ranging from 0 to 4095, after the patch, it now accepts aligned offsets ranging from 0 to 32760. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1844767846 From fgao at openjdk.org Thu Dec 7 06:47:36 2023 From: fgao at openjdk.org (Fei Gao) Date: Thu, 7 Dec 2023 06:47:36 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" In-Reply-To: <4KRCqYxn02wMjYDuN3_HbYWxc9BDtYMNd40bNrJ4K8w=.5362dda4-8d30-48fd-8915-909eb15a6023@github.com> References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> <4KRCqYxn02wMjYDuN3_HbYWxc9BDtYMNd40bNrJ4K8w=.5362dda4-8d30-48fd-8915-909eb15a6023@github.com> Message-ID: On Wed, 6 Dec 2023 20:09:44 GMT, Dean Long wrote: > After this change, `immIOffset` and `immLOffset` appear to be obsolete. Removed them in the new commit. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1844768711 From epeter at openjdk.org Thu Dec 7 06:49:06 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 06:49:06 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v33] In-Reply-To: References: Message-ID: <32EvFEDq_OFlTBBn4tsf4k9ApY0aX2RiueeO4WtqjQ8=.409d358e-5f17-4afd-873d-b0ec13bccb1c@github.com> > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: made mem_ref const for VPointer ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/cd781c46..40d681d6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=32 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=31-32 Stats: 23 lines in 4 files changed: 2 ins; 0 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 7 07:01:00 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 07:01:00 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v34] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - more const - made invar const ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/40d681d6..d67b97a0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=33 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=32-33 Stats: 15 lines in 2 files changed: 2 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From chagedorn at openjdk.org Thu Dec 7 07:06:49 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Dec 2023 07:06:49 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: <8_cxOlz5zG5khk5L6fuE0yBb0YH6zHn77qRYB6vbgpE=.86b3fdee-91a3-4311-98b7-8b7d2a012aff@github.com> References: <8_cxOlz5zG5khk5L6fuE0yBb0YH6zHn77qRYB6vbgpE=.86b3fdee-91a3-4311-98b7-8b7d2a012aff@github.com> Message-ID: On Wed, 6 Dec 2023 16:28:21 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/superword.cpp line 1678: >> >>> 1676: tty->print_cr(" + scale(%d) * iv", scale); >>> 1677: } >>> 1678: #endif >> >> Not sure if you are planning to refactor the tracing code anyway but you might want to think about extracting any tracing code in this method to separate methods to not disrupt the readability of the code. You can derive some of the variables again (e.g. from VPointer etc.) to avoid having to pass many arguments to the tracing function. >> >> It might even be cleaner if this entire method would be part of `AlignmentSolution`. Then you do not need to pass around all the information to separate tracing methods. > > @chhagedorn What if I make it a `AlignmentSolver` that then returns an `AlignmentSolution`? That's even better! I like that idea. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1418476498 From chagedorn at openjdk.org Thu Dec 7 07:06:50 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Dec 2023 07:06:50 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: <2WDup1spcw1G17Gsw1ZoO5aZPtHOVJzIm5xVOiVabeI=.7832aff1-8b5f-4b66-be2f-d02829cb7733@github.com> On Wed, 6 Dec 2023 16:42:24 GMT, Emanuel Peter wrote: >> `var_invar` is slightly confusing. Maybe we should flip it to `invar_var` to be consistent with `invar_factor`? Or maybe you find another name for that term. I could not come up with something better for now. > > I really don't know anything better than `var_invar`. The idea is that there is both a constant `C` and variable `var` term for both the `init` and `invar` terms. I even explain that below, that there is a such a factorization into constant and variable. I agree with you - I also cannot think of something better. Let's keep it like that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1418476576 From chagedorn at openjdk.org Thu Dec 7 07:22:47 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Dec 2023 07:22:47 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: Message-ID: <9e_RX2hsEocA1EDq8o7-buPi7ix3EMTmWoh098gapS8=.f830c50a-4d5f-46e5-9692-b8368690eee7@github.com> On Wed, 6 Dec 2023 16:04:22 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/superword.cpp line 1628: >> >>> 1626: int element_size = mem_ref->memory_size(); >>> 1627: int vw = pack_size * element_size; // vector_width >>> 1628: int aw = MIN2(vw, ObjectAlignmentInBytes); // alignment_width >> >> Is there a specific reason to go with `vw` and `aw` instead of directly naming the variable `vector_width` and `alighment_width`? > > I replaced `vw` with `vector_width`, there were relatively few occurances. But if I replace `aw` with `alighment_width` then everything becomes much more "wordy", all lines become much longer. I would like to keep `aw`, I think it is much more readable then the long form. What do you think? Thanks. Okay, I see your point. I agree that we can go with `aw` in the local scope of `pack_alignment_solution`/`AlignmentSolver` for readability purposes (and you explain what it is in the comments/print statements). However, I think the `AlignmentSolution` class should use the full name `alignment_width` as a general property to store/return. What are your thoughts about that? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1418488407 From rcastanedalo at openjdk.org Thu Dec 7 07:29:39 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Dec 2023 07:29:39 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 69: > 67: BEFORE_LOOP_PREDICATION_IC("Before Loop Predication IC"), > 68: BEFORE_LOOP_PREDICATION_RC("Before Loop Predication RC"), > 69: AFTER_LOOP_PREDICATION("After Loop Predication"), Would it be possible to define `AFTER_LOOP_PREDICATION_IC` and `AFTER_LOOP_PREDICATION_RC` separately, to match the corresponding `BEFORE_*` phases? This would make it easier to understand what is the scope of each transformation in the IGV outline and also play better with the enumeration of repeated IGV graph dumps introduced in this changeset. To be more specific, what I would expect to see here: ![igv-pred](https://github.com/openjdk/jdk/assets/8792647/02a11ce7-d41a-43f1-9711-a7a403c678e9) is: 15. Before Loop Predication IC: 90 If 16. After Loop Predication IC: 163 If 17. Before Loop Predication RC: 106 RangeCheck 18. After Loop Predication RC: 196 RangeCheck ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418493734 From epeter at openjdk.org Thu Dec 7 07:30:59 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 07:30:59 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v35] In-Reply-To: References: Message-ID: <98a07OBKoHv417GOI6jPMkt-Bk4Ljho4e6kBjtHo8_M=.7ebc30f3-539c-418e-88fa-7ab33473625b@github.com> > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: aw -> alignment_vector in AlignmentSolution ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/d67b97a0..d8868d71 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=34 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=33-34 Stats: 16 lines in 1 file changed: 3 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 7 07:31:00 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 07:31:00 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: <9e_RX2hsEocA1EDq8o7-buPi7ix3EMTmWoh098gapS8=.f830c50a-4d5f-46e5-9692-b8368690eee7@github.com> References: <9e_RX2hsEocA1EDq8o7-buPi7ix3EMTmWoh098gapS8=.f830c50a-4d5f-46e5-9692-b8368690eee7@github.com> Message-ID: On Thu, 7 Dec 2023 07:20:00 GMT, Christian Hagedorn wrote: >> I replaced `vw` with `vector_width`, there were relatively few occurances. But if I replace `aw` with `alighment_width` then everything becomes much more "wordy", all lines become much longer. I would like to keep `aw`, I think it is much more readable then the long form. What do you think? > > Thanks. Okay, I see your point. I agree that we can go with `aw` in the local scope of `pack_alignment_solution`/`AlignmentSolver` for readability purposes (and you explain what it is in the comments/print statements). However, I think the `AlignmentSolution` class should use the full name `alignment_width` as a general property to store/return. What are your thoughts about that? Ok, that sounds good! I did the update. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1418494908 From chagedorn at openjdk.org Thu Dec 7 07:35:45 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Dec 2023 07:35:45 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21] In-Reply-To: References: <9e_RX2hsEocA1EDq8o7-buPi7ix3EMTmWoh098gapS8=.f830c50a-4d5f-46e5-9692-b8368690eee7@github.com> Message-ID: On Thu, 7 Dec 2023 07:27:43 GMT, Emanuel Peter wrote: >> Thanks. Okay, I see your point. I agree that we can go with `aw` in the local scope of `pack_alignment_solution`/`AlignmentSolver` for readability purposes (and you explain what it is in the comments/print statements). However, I think the `AlignmentSolution` class should use the full name `alignment_width` as a general property to store/return. What are your thoughts about that? > > Ok, that sounds good! I did the update. Great, thanks! :-) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1418499670 From chagedorn at openjdk.org Thu Dec 7 07:55:40 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Dec 2023 07:55:40 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: <6f3gJeWklXABfc41cNHX6qyZnu04ROG8Q8vfBrvIeMk=.24de43eb-757f-45e7-ba8d-3310e7c694d2@github.com> On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. src/hotspot/share/opto/split_if.cpp line 595: > 593: void PhaseIdealLoop::do_split_if(Node* iff, RegionNode** new_false_region, RegionNode** new_true_region) { > 594: > 595: C->print_method(PHASE_BEFORE_SPLIT_IF, 4, iff); We call `do_split_if()` for the actual split if here: https://github.com/openjdk/jdk/blob/632a3c56e0626b4c4f79c8cb3d2ae312668d63fc/src/hotspot/share/opto/loopopts.cpp#L1448-L1450 but also to merge identical back to back ifs here: https://github.com/openjdk/jdk/blob/632a3c56e0626b4c4f79c8cb3d2ae312668d63fc/src/hotspot/share/opto/loopopts.cpp#L1529-L1545 I think we should only track the "real" split ifs done in the former case. Should we move this and the `tty` printing to L1448? But could also be done separately. On a separate note, I think we do not need the printing twice and can merge these two lines as well when doing this change: if (PrintOpto && VerifyLoopOptimizations) { tty->print_cr("Split-if"); } if (TraceLoopOpts) { tty->print_cr("SplitIf"); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418517568 From rcastanedalo at openjdk.org Thu Dec 7 08:51:38 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Dec 2023 08:51:38 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. src/utils/IdealGraphVisualizer/README.md line 31: > 29: * `N=1`: after parsing, before matching, and final code (also for failed > 30: compilations, if available) > 31: * `N=2`: additionally, after every major phase (including loop opts) Suggestion: * `N=2`: additionally, after every major phase ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418600694 From rcastanedalo at openjdk.org Thu Dec 7 09:08:42 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Dec 2023 09:08:42 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 103: > 101: BEFORE_MATCHING("Before matching"), > 102: MATCHING("After matching", RegexType.MACH), > 103: MACH_ANALYSIS("After mach analysis", RegexType.MACH), Could you move this line to after `POSTALLOC_EXPAND("Post-Allocation Expand", RegexType.MACH)`, for consistency with the order in which phases are defined in `phasetype.hpp`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418631532 From rcastanedalo at openjdk.org Thu Dec 7 09:14:43 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Dec 2023 09:14:43 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. src/hotspot/share/opto/compile.cpp line 2999: > 2997: cfg.remove_unreachable_blocks(); > 2998: cfg.verify_dominator_tree(); > 2999: Nit: no need for extra line here. src/hotspot/share/opto/compile.cpp line 3008: > 3006: PhasePeephole peep( _regalloc, cfg); > 3007: peep.do_transform(); > 3008: Nit: no need for extra line here. src/hotspot/share/opto/compile.cpp line 3016: > 3014: TracePhase tp("postalloc_expand", &timers[_t_postalloc_expand]); > 3015: cfg.postalloc_expand(_regalloc); > 3016: Nit: no need for extra line here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418638810 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418638725 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418638621 From duke at openjdk.org Thu Dec 7 09:17:40 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 09:17:40 GMT Subject: RFR: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 07:26:29 GMT, Roberto Casta?eda Lozano wrote: >> This changeset >> 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): >> - Before conditional constant propagation >> - After register allocation >> - After block ordering >> - After peephole optimization >> - After post-allocation expansion >> - Before and after >> - loop predication >> - loop peeling >> - pre/main/post loops >> - loop unrolling >> - range check elimination >> - loop unswitching >> - partial peeling >> - split if >> - superword >> 2. adds support for enumeration of repeated IGV graph dumps. >> 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. >> >> Example phase list screenshots in IGV (first at level 6, second at level 4) >> ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) >> >> >> Some notes: >> - While discussing the above changes, a separate question was brought up by @chhagedorn: >> > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? >> - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). >> >> ### Testing >> #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5. >> - Check that optimized builds (`--with-debug-level optimized`) still work. >> >> #### Platforms: linux-x64 >> - Tested that thousands of graphs are correctly opened and visualized with IGV. > > test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 69: > >> 67: BEFORE_LOOP_PREDICATION_IC("Before Loop Predication IC"), >> 68: BEFORE_LOOP_PREDICATION_RC("Before Loop Predication RC"), >> 69: AFTER_LOOP_PREDICATION("After Loop Predication"), > > Would it be possible to define `AFTER_LOOP_PREDICATION_IC` and `AFTER_LOOP_PREDICATION_RC` separately, to match the corresponding `BEFORE_*` phases? This would make it easier to understand what is the scope of each transformation in the IGV outline and also play better with the enumeration of repeated IGV graph dumps introduced in this changeset. To be more specific, what I would expect to see here: > ![igv-pred](https://github.com/openjdk/jdk/assets/8792647/02a11ce7-d41a-43f1-9711-a7a403c678e9) > is: > > 15. Before Loop Predication IC: 90 If > 16. After Loop Predication IC: 163 If > 17. Before Loop Predication RC: 106 RangeCheck > 18. After Loop Predication RC: 196 RangeCheck Sounds good to me, this is how we implemented it originally. I see @chhagedorn reacted with a thumbs up, so I'll go ahead with the change! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418642724 From duke at openjdk.org Thu Dec 7 10:00:03 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 10:00:03 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v2] In-Reply-To: References: Message-ID: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16120/files - new: https://git.openjdk.org/jdk/pull/16120/files/07aac1c5..28c81d5a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=00-01 Stats: 31 lines in 7 files changed: 13 ins; 15 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/16120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16120/head:pull/16120 PR: https://git.openjdk.org/jdk/pull/16120 From duke at openjdk.org Thu Dec 7 10:00:06 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 10:00:06 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v2] In-Reply-To: <6f3gJeWklXABfc41cNHX6qyZnu04ROG8Q8vfBrvIeMk=.24de43eb-757f-45e7-ba8d-3310e7c694d2@github.com> References: <6f3gJeWklXABfc41cNHX6qyZnu04ROG8Q8vfBrvIeMk=.24de43eb-757f-45e7-ba8d-3310e7c694d2@github.com> Message-ID: On Thu, 7 Dec 2023 07:52:40 GMT, Christian Hagedorn wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Address comments > > src/hotspot/share/opto/split_if.cpp line 595: > >> 593: void PhaseIdealLoop::do_split_if(Node* iff, RegionNode** new_false_region, RegionNode** new_true_region) { >> 594: >> 595: C->print_method(PHASE_BEFORE_SPLIT_IF, 4, iff); > > We call `do_split_if()` for the actual split if here: > https://github.com/openjdk/jdk/blob/632a3c56e0626b4c4f79c8cb3d2ae312668d63fc/src/hotspot/share/opto/loopopts.cpp#L1448-L1450 > > but also to merge identical back to back ifs here: > > https://github.com/openjdk/jdk/blob/632a3c56e0626b4c4f79c8cb3d2ae312668d63fc/src/hotspot/share/opto/loopopts.cpp#L1529-L1545 > > I think we should only track the "real" split ifs done in the former case. Should we move this and the `tty` printing to L1448? But could also be done separately. > > On a separate note, I think we do not need the printing twice and can merge these two lines as well when doing this change: > > if (PrintOpto && VerifyLoopOptimizations) { > tty->print_cr("Split-if"); > } > if (TraceLoopOpts) { > tty->print_cr("SplitIf"); > } Sounds good to me; I've made the changes now. @chhagedorn, please check so that I've not misunderstood anything. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418691898 From duke at openjdk.org Thu Dec 7 10:00:08 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 10:00:08 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v2] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 08:48:45 GMT, Roberto Casta?eda Lozano wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Address comments > > src/utils/IdealGraphVisualizer/README.md line 31: > >> 29: * `N=1`: after parsing, before matching, and final code (also for failed >> 30: compilations, if available) >> 31: * `N=2`: additionally, after every major phase (including loop opts) > > Suggestion: > > * `N=2`: additionally, after every major phase Fixed, thanks > test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 103: > >> 101: BEFORE_MATCHING("Before matching"), >> 102: MATCHING("After matching", RegexType.MACH), >> 103: MACH_ANALYSIS("After mach analysis", RegexType.MACH), > > Could you move this line to after `POSTALLOC_EXPAND("Post-Allocation Expand", RegexType.MACH)`, for consistency with the order in which phases are defined in `phasetype.hpp`? Yes, fixed now. In general, I suspect there are other inconsistencies between `CompilePhase.java` and `phasetype.hpp`. I'll add it as an item to the cleanup RFE. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418692221 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418693811 From duke at openjdk.org Thu Dec 7 10:06:53 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 10:06:53 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v3] In-Reply-To: References: Message-ID: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Remove leftover AFTER_SPLIT_IF ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16120/files - new: https://git.openjdk.org/jdk/pull/16120/files/28c81d5a..66f901a5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16120/head:pull/16120 PR: https://git.openjdk.org/jdk/pull/16120 From duke at openjdk.org Thu Dec 7 11:29:09 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 11:29:09 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v4] In-Reply-To: References: Message-ID: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Rename superword phases ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16120/files - new: https://git.openjdk.org/jdk/pull/16120/files/66f901a5..ca180f0e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=02-03 Stats: 9 lines in 3 files changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/16120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16120/head:pull/16120 PR: https://git.openjdk.org/jdk/pull/16120 From duke at openjdk.org Thu Dec 7 13:01:58 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 13:01:58 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v5] In-Reply-To: References: Message-ID: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Remove trailing whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16120/files - new: https://git.openjdk.org/jdk/pull/16120/files/ca180f0e..c2d001af Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16120/head:pull/16120 PR: https://git.openjdk.org/jdk/pull/16120 From epeter at openjdk.org Thu Dec 7 13:15:06 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 13:15:06 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v36] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: wip AlignmentSolver ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/d8868d71..4793370a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=35 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=34-35 Stats: 774 lines in 4 files changed: 577 ins; 192 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 7 13:48:05 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 13:48:05 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v37] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: v2 WIP AlignmentSolver ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/4793370a..224f0f1a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=36 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=35-36 Stats: 552 lines in 3 files changed: 161 ins; 384 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From qamai at openjdk.org Thu Dec 7 13:49:40 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 7 Dec 2023 13:49:40 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v34] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 76 commits: - isolate javaArithmetic changes - Merge branch 'master' into unsignedDiv - fix proof - Merge branch 'master' into unsignedDiv - fix assert macro, benchmarks - comment styles - disable test with Xcomp - remove verify - fix x86 test - more rigorous control - ... and 66 more: https://git.openjdk.org/jdk/compare/c42535f1...501494a1 ------------- Changes: https://git.openjdk.org/jdk/pull/9947/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=33 Stats: 2274 lines in 13 files changed: 1797 ins; 287 del; 190 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From duke at openjdk.org Thu Dec 7 14:14:30 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 14:14:30 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v6] In-Reply-To: References: Message-ID: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Daniel Lund?n has updated the pull request incrementally with three additional commits since the last revision: - Update test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java Co-authored-by: Christian Hagedorn - Update src/hotspot/share/opto/phasetype.hpp Co-authored-by: Christian Hagedorn - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16120/files - new: https://git.openjdk.org/jdk/pull/16120/files/c2d001af..6f872e5d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=04-05 Stats: 5 lines in 3 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/16120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16120/head:pull/16120 PR: https://git.openjdk.org/jdk/pull/16120 From chagedorn at openjdk.org Thu Dec 7 14:14:41 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Dec 2023 14:14:41 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v5] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 13:01:58 GMT, Daniel Lund?n wrote: >> This changeset >> 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): >> - Before conditional constant propagation >> - After register allocation >> - After block ordering >> - After peephole optimization >> - After post-allocation expansion >> - Before and after >> - loop predication >> - loop peeling >> - pre/main/post loops >> - loop unrolling >> - range check elimination >> - loop unswitching >> - partial peeling >> - split if >> - superword >> 2. adds support for enumeration of repeated IGV graph dumps. >> 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. >> >> Example phase list screenshots in IGV (first at level 6, second at level 4) >> ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) >> >> >> Some notes: >> - While discussing the above changes, a separate question was brought up by @chhagedorn: >> > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? >> - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). >> >> ### Testing >> #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5. >> - Check that optimized builds (`--with-debug-level optimized`) still work. >> >> #### Platforms: linux-x64 >> - Tested that thousands of graphs are correctly opened and visualized with IGV. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Remove trailing whitespace Otherwise, it looks good to me! Thanks for addressing all the comments and suggestions. src/hotspot/share/opto/loopopts.cpp line 1451: > 1449: C->print_method(PHASE_BEFORE_SPLIT_IF, 4, iff); > 1450: if ((PrintOpto && VerifyLoopOptimizations) || TraceLoopOpts) { > 1451: tty->print_cr("Split-if"); Suggestion: tty->print_cr("Split-If"); src/hotspot/share/opto/phasetype.hpp line 55: > 53: flags(AFTER_LOOP_UNROLLING, "After Loop Unrolling") \ > 54: flags(BEFORE_SPLIT_IF, "Before Split If") \ > 55: flags(AFTER_SPLIT_IF, "After Split If") \ I suggest to use a `-`: Suggestion: flags(BEFORE_SPLIT_IF, "Before Split-If") \ flags(AFTER_SPLIT_IF, "After Split-If") \ test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 66: > 64: AFTER_LOOP_UNROLLING("After Loop Unrolling"), > 65: BEFORE_SPLIT_IF("Before Split If"), > 66: AFTER_SPLIT_IF("After Split If"), Suggestion: BEFORE_SPLIT_IF("Before Split-If"), AFTER_SPLIT_IF("After Split-If"), ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16120#pullrequestreview-1770210494 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1418995778 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1419002759 PR Review Comment: https://git.openjdk.org/jdk/pull/16120#discussion_r1419003229 From mli at openjdk.org Thu Dec 7 14:20:31 2023 From: mli at openjdk.org (Hamlin Li) Date: Thu, 7 Dec 2023 14:20:31 GMT Subject: RFR: 8321001: RISC-V: C2 SignumVF [v5] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 06:11:12 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix vmseq_vv > > Marked as reviewed by fyang (Reviewer). @RealFYang @VladimirKempik Thanks for your reviewing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16925#issuecomment-1845422330 From roland at openjdk.org Thu Dec 7 14:20:26 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Dec 2023 14:20:26 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v2] In-Reply-To: References: Message-ID: > This change implements C2 optimizations for calls to > ScopedValue.get(). Indeed, in: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > `v2` can be replaced by `v1` and the second call to `get()` can be > optimized out. That's true whatever is between the 2 calls unless a > new mapping for `scopedValue` is created in between (when that happens > no optimizations is performed for the method being compiled). Hoisting > a `get()` call out of loop for a loop invariant `scopedValue` should > also be legal in most cases. > > `ScopedValue.get()` is implemented in java code as a 2 step process. A > cache is attached to the current thread object. If the `ScopedValue` > object is in the cache then the result from `get()` is read from > there. Otherwise a slow call is performed that also inserts the > mapping in the cache. The cache itself is lazily allocated. One > `ScopedValue` can be hashed to 2 different indexes in the cache. On a > cache probe, both indexes are checked. As a consequence, the process > of probing the cache is a multi step process (check if the cache is > present, check first index, check second index if first index > failed). If the cache is populated early on, then when the method that > calls `ScopedValue.get()` is compiled, profile reports the slow path > as never taken and only the read from the cache is compiled. > > To perform the optimizations, I added 3 new node types to C2: > > - the pair > ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for > the cache probe > > - a cfg node ScopedValueGetResultNode to help locate the result of the > `get()` call in the IR graph. > > In pseudo code, once the nodes are inserted, the code of a `get()` is: > > > hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue) > if (hits_in_the_cache) { > res = ScopedValueGetLoadFromCache(hits_in_the_cache); > } else { > res = ..; //slow call possibly inlined. Subgraph can be arbitray complex > } > res = ScopedValueGetResult(res) > > > In the snippet: > > > v1 = scopedValue.get(); > ... > v2 = scopedValue.get(); > > > Replacing `v2` by `v1` is then done by starting from the > `ScopedValueGetResult` node for the second `get()` and looking for a > dominating `ScopedValueGetResult` for the same `ScopedValue` > object. When one is found, it is used as a replacement. Eliminating > the second `get()` call is achieved by making > `ScopedValueGetHitsInCache` always successful if there's a dominating > `ScopedValueGetResult` and replacing its companion > `ScopedValueGetLoadFromCache` by the dominating > `ScopedValueGetResult`. > > Hoisting a `g... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: test failures ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16966/files - new: https://git.openjdk.org/jdk/pull/16966/files/bc028fbd..489bb2ab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=00-01 Stats: 85 lines in 6 files changed: 11 ins; 51 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/16966.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966 PR: https://git.openjdk.org/jdk/pull/16966 From qamai at openjdk.org Thu Dec 7 14:24:31 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 7 Dec 2023 14:24:31 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v35] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: missing include ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/501494a1..e8b54dad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=34 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=33-34 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From roland at openjdk.org Thu Dec 7 14:26:31 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 7 Dec 2023 14:26:31 GMT Subject: RFR: 8320649: C2: Optimize scoped values [v2] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 07:52:35 GMT, Tobias Hartmann wrote: > No review yet, I just performed some quick testing. Thanks for doing that. All issues should be fixed now (I couldn't reproduce the last one so not 100% sure about that one). ------------- PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-1845432654 From mli at openjdk.org Thu Dec 7 14:32:13 2023 From: mli at openjdk.org (Hamlin Li) Date: Thu, 7 Dec 2023 14:32:13 GMT Subject: Integrated: 8321001: RISC-V: C2 SignumVF In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 15:24:35 GMT, Hamlin Li wrote: > Hi, > Can you review the patch to add intrinisc SignumVF/SignumVD on riscv? > Thanks > > ## Test > test/hotspot/jtreg/compiler/intrinsics/ > test/hotspot/jtreg/compiler/vectorapi/ > and tests found via: > grep -nr test/hotspot/jtreg/ -we Math.signum > and test found via: > grep -nr test/jdk/ -we Math.signum This pull request has now been integrated. Changeset: 2f9e70e4 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/2f9e70e4ad94af0b94fd2fbc97356b32f0b73628 Stats: 37 lines in 5 files changed: 35 ins; 0 del; 2 mod 8321001: RISC-V: C2 SignumVF 8321002: RISC-V: C2 SignumVD Reviewed-by: fyang ------------- PR: https://git.openjdk.org/jdk/pull/16925 From epeter at openjdk.org Thu Dec 7 14:41:01 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 14:41:01 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v38] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: finished creating AlignmentSolver ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/224f0f1a..cc66ad4a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=37 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=36-37 Stats: 102 lines in 4 files changed: 49 ins; 43 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From aph at openjdk.org Thu Dec 7 14:55:02 2023 From: aph at openjdk.org (Andrew Haley) Date: Thu, 7 Dec 2023 14:55:02 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v2] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Thu, 7 Dec 2023 06:42:49 GMT, Fei Gao wrote: >> On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by: >> >> cast<64> (32-bit compressed reference) + field_offset >> >> >> When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary. >> >> For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail. >> >> In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead. >> >> Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. >> >> We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse. >> >> Tier 1-3 passed on aarch64. > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Remove unused immIOffset/immLOffset > - Merge branch 'master' into fg8319690 > - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" > > On LP64 systems, if the heap can be moved into low virtual > address space (below 4GB) and the heap size is smaller than the > interesting threshold of 4 GB, we can use unscaled decoding > pattern for narrow klass decoding. It means that a generic field > reference can be decoded by: > ``` > cast<64> (32-bit compressed reference) + field_offset > ``` > > When the `field_offset` is an immediate, on aarch64 platform, the > unscaled decoding pattern can match perfectly with a direct > addressing mode, i.e., `base_plus_offset`, supported by LDR/STR > instructions. But for certain data width, not all immediates can > be encoded in the instruction field of LDR/STR[1]. The ranges are > different as data widths vary. > > For example, when we try to load a value of long type at offset of > `1030`, the address expression is `(AddP (DecodeN base) 1030)`. > Before the patch, the expression was matching with > `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate > byte offset must be in the range -256 to 255 or positive immediate > byte offset must be a multiple of 8 in the range 0 to 32760[2]. > `1030` can't be encoded in the instruction field. So, after > matching, when we do checking for instruction encoding, the > assertion would fail. > > In this patch, we're going to filter out invalid immediates > when deciding if current addressing mode can be matched as > `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and > `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data > type separately in the patch. E.g., for `memory4`, we remove > the generic `indOffIN/indOffLN`, which matches wrong unscaled > immediate range, and replace them with `indOffIN4/indOffLN4` > instead. > > Since 8-bit and 16-bit LDR/STR instructions also support the > unscaled decoding pattern, we add the addressing mode in the > lists of `memory1` and `memory2` by introducing > `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. > > We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` > to avoid misuse. > > ... I think this patch is excessive for the problem and introduces a lot of code dupiication. Maybe it would be simpler, smaller, and faster to check for what we need: diff --git a/src/hotspot/cpu/aarch64/aarch64.ad b/src/hotspot/cpu/aarch64/aarch64.ad index 233f9b6af7c..ea842912ce9 100644 --- a/src/hotspot/cpu/aarch64/aarch64.ad +++ b/src/hotspot/cpu/aarch64/aarch64.ad @@ -5911,7 +5911,8 @@ operand indIndexN(iRegN reg, iRegL lreg) operand indOffIN(iRegN reg, immIOffset off) %{ - predicate(CompressedOops::shift() == 0); + predicate(CompressedOops::shift() == 0 + && Address::offset_ok_for_immed(n->in(3)->find_int_con(min_jint), exact_log2(sizeof(jint)))); constraint(ALLOC_IN_RC(ptr_reg)); match(AddP (DecodeN reg) off); op_cost(0); @@ -5926,7 +5927,8 @@ operand indOffIN(iRegN reg, immIOffset off) operand indOffLN(iRegN reg, immLoffset off) %{ - predicate(CompressedOops::shift() == 0); + predicate(CompressedOops::shift() == 0 + && Address::offset_ok_for_immed(n->in(3)->find_long_con(min_jint), exact_log2(sizeof(jlong)))); constraint(ALLOC_IN_RC(ptr_reg)); match(AddP (DecodeN reg) off); op_cost(0); ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1845483889 From chagedorn at openjdk.org Thu Dec 7 14:56:43 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 7 Dec 2023 14:56:43 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v6] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 14:14:30 GMT, Daniel Lund?n wrote: >> This changeset >> 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): >> - Before conditional constant propagation >> - After register allocation >> - After block ordering >> - After peephole optimization >> - After post-allocation expansion >> - Before and after >> - loop predication >> - loop peeling >> - pre/main/post loops >> - loop unrolling >> - range check elimination >> - loop unswitching >> - partial peeling >> - split if >> - superword >> 2. adds support for enumeration of repeated IGV graph dumps. >> 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. >> >> Example phase list screenshots in IGV (first at level 6, second at level 4) >> ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) >> >> >> Some notes: >> - While discussing the above changes, a separate question was brought up by @chhagedorn: >> > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? >> - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). >> >> ### Testing >> #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5. >> - Check that optimized builds (`--with-debug-level optimized`) still work. >> >> #### Platforms: linux-x64 >> - Tested that thousands of graphs are correctly opened and visualized with IGV. > > Daniel Lund?n has updated the pull request incrementally with three additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java > > Co-authored-by: Christian Hagedorn > - Update src/hotspot/share/opto/phasetype.hpp > > Co-authored-by: Christian Hagedorn > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16120#pullrequestreview-1770349856 From duke at openjdk.org Thu Dec 7 15:03:54 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 15:03:54 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v2] In-Reply-To: References: Message-ID: > This changeset fixes an issue on aarch64 where addresses for float and double constants were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Fix the issue by replacing `adr` with `lea`. > - Add a regression test. > > Thanks to @fisk and @xmas92 for the assistance. > > ### Testing > Tests: tier1, tier2, tier3, tier4, tier5 > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Revert fix and restrict NMethodSizeLimit instead ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16951/files - new: https://git.openjdk.org/jdk/pull/16951/files/ed27abec..a35bad0e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16951&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16951&range=00-01 Stats: 44 lines in 3 files changed: 0 ins; 35 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/16951.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16951/head:pull/16951 PR: https://git.openjdk.org/jdk/pull/16951 From duke at openjdk.org Thu Dec 7 15:06:24 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 15:06:24 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: <-nGgL43TAhxZBbL1YkHU4LgegeNiZKHnsyrgHRppgKY=.e3d3d85f-ee62-44d1-80d8-768365e9cb74@github.com> On Wed, 6 Dec 2023 14:40:30 GMT, Andrew Haley wrote: >> Given the +/-1MB range for `adr`, would it instead be a good idea to limit `-XX:NMethodSizeLimit` to 1MB? Then we would not need the fix at all and could still use `adr`. >> >> Something similar to: >> >> diff --git a/src/hotspot/share/c1/c1_globals.hpp b/src/hotspot/share/c1/c1_globals.hpp >> index 1c22cf16cfe..e2057d20e59 100644 >> --- a/src/hotspot/share/c1/c1_globals.hpp >> +++ b/src/hotspot/share/c1/c1_globals.hpp >> @@ -277,7 +277,7 @@ >> \ >> develop(intx, NMethodSizeLimit, (64*K)*wordSize, \ >> "Maximum size of a compiled method.") \ >> - range(0, max_jint) \ >> + range(0, 1*M) \ >> \ >> develop(bool, TraceFPUStack, false, \ >> "Trace emulation of the FPU stack (intel only)") \ > >> Given the +/-1MB range for `adr`, would it instead be a good idea to limit `-XX:NMethodSizeLimit` to 1MB? > > I guess that would be OK. I'm thinking of extreme scenarios where a method's constant pool is large but the stub code at its end is smaller, in which case an `adr` wouldn't quite reach. But I think that's unlikely. @theRealAph @TobiHartmann: I've now reverted the fix and instead set an upper bound for `NMethodSizeLimit`. Preliminary testing shows no issues, but I'm also running more tests before integrating. Requesting a re-review, thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1845509302 From epeter at openjdk.org Thu Dec 7 15:08:00 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 15:08:00 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v39] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 92 commits: - Merge branch 'master' into JDK-8311586 - finished creating AlignmentSolver - v2 WIP AlignmentSolver - wip AlignmentSolver - aw -> alignment_vector in AlignmentSolution - more const - made invar const - made mem_ref const for VPointer - For Christian: rename variables in adjust_pre_loop_limit_to_align_main_loop_vectors - a few more suggestions by Christian - ... and 82 more: https://git.openjdk.org/jdk/compare/2f9e70e4...9075309f ------------- Changes: https://git.openjdk.org/jdk/pull/14785/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=38 Stats: 8474 lines in 23 files changed: 7183 ins; 376 del; 915 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 7 15:15:13 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 15:15:13 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v40] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove dead code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/9075309f..09104b6b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=39 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=38-39 Stats: 13 lines in 1 file changed: 0 ins; 13 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 7 15:15:14 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 7 Dec 2023 15:15:14 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v22] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 12:36:30 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - add newline suggested by Faye >> - improve the alignment proof, make it more explicit > > Impressive work! I'm still working my way through the proofs but here are some first comments. @chhagedorn @fg1417 Ok, I now moved basically all code over to the `AlignmentSolver`. I placed it in `vectorization.hpp/cpp`, since I plan to use this facility in the future, possibly outside of SuperWord. I'd be very thankful for re-reviews ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/14785#issuecomment-1845523575 From duke at openjdk.org Thu Dec 7 15:52:51 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 15:52:51 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v7] In-Reply-To: References: Message-ID: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update copyright ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16120/files - new: https://git.openjdk.org/jdk/pull/16120/files/6f872e5d..61be3424 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16120&range=05-06 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16120.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16120/head:pull/16120 PR: https://git.openjdk.org/jdk/pull/16120 From duke at openjdk.org Thu Dec 7 15:53:47 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 7 Dec 2023 15:53:47 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: Message-ID: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> > This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. > - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. > > ### Testing (in progress) > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5 > - Targeted and repeated tests for `TestC1Globals.java` in all tiers Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update copyright ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16951/files - new: https://git.openjdk.org/jdk/pull/16951/files/a35bad0e..d19040e5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16951&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16951&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16951.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16951/head:pull/16951 PR: https://git.openjdk.org/jdk/pull/16951 From qamai at openjdk.org Thu Dec 7 16:42:00 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 7 Dec 2023 16:42:00 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v35] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 14:24:31 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > missing include I see, I have reverted most changes in `globalDefinition.hpp` apart from the minor fix to `ABS` since it seems more straightforward from here. ------------- PR Comment: https://git.openjdk.org/jdk/pull/9947#issuecomment-1845672811 From never at openjdk.org Thu Dec 7 16:49:28 2023 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 7 Dec 2023 16:49:28 GMT Subject: RFR: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 18:26:57 GMT, Raphael Mosaner wrote: > This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. Marked as reviewed by never (Reviewer). src/hotspot/share/jvmci/jvmciCompilerToVM.hpp line 118: > 116: static int data_section_item_alignment; > 117: > 118: static int* _should_notify_object_alloc; I think a small comment wouldn't hurt. Something like: // Pointer to JvmtiExport::_should_notify_object_alloc. Exposed as an int* instead of an address so the // underlying type is part of the JVMCIVMStructs definition. ------------- PR Review: https://git.openjdk.org/jdk/pull/16980#pullrequestreview-1770613118 PR Review Comment: https://git.openjdk.org/jdk/pull/16980#discussion_r1419277487 From never at openjdk.org Thu Dec 7 16:50:28 2023 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 7 Dec 2023 16:50:28 GMT Subject: RFR: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 18:26:57 GMT, Raphael Mosaner wrote: > This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. Marked as reviewed by never (Reviewer). src/hotspot/share/jvmci/jvmciCompilerToVM.hpp line 118: > 116: static int data_section_item_alignment; > 117: > 118: static int* _should_notify_object_alloc; I think a small comment wouldn't hurt. Something like: // Pointer to JvmtiExport::_should_notify_object_alloc. Exposed as an int* instead of an address so the // underlying type is part of the JVMCIVMStructs definition. ------------- PR Review: https://git.openjdk.org/jdk/pull/16980#pullrequestreview-1770613118 PR Review Comment: https://git.openjdk.org/jdk/pull/16980#discussion_r1419277487 From qamai at openjdk.org Thu Dec 7 16:58:30 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 7 Dec 2023 16:58:30 GMT Subject: RFR: 8319451: PhaseIdealLoop::conditional_move is too conservative In-Reply-To: References: Message-ID: On Mon, 13 Nov 2023 19:53:30 GMT, Vladimir Kozlov wrote: >> Hi, >> >> When transforming a Phi into a CMove, the threshold is set to be approximately BlockLayoutMinDiamondPercentage, the reason is given: >> >> // BlockLayoutByFrequency optimization moves infrequent branch >> // from hot path. No point in CMOV'ing in such case >> >> This sets the default value of the threshold to be around 18%, which is too conservative. The reason also does not make a lot of sense since the important property which makes jumping expensive is not code layout. We should remove this. >> >> Please kindly review, thank you very much. > > Looks fine to me. > > Looking on history of this code and I added it to address [JDK-7097546](https://bugs.openjdk.org/browse/JDK-7097546). > > But later it was found not correct for some case and I even had similar fix prototype: [JDK-8034833](https://bugs.openjdk.org/browse/JDK-8034833). There was additional changes proposed there: in `block.hpp` and `.ad` file. > > Please, look on attached in that report test and additional code changes there. May be be we can improve more `cmove`. It could be done separately from this your fix if you want to spend more time on it. @vnkozlov I have investigated a little bit. For these kinds of loops public static int test(int result, int limit, int mask) { // mask = 15 for (int i = 0; i < limit; i++) { if ((i&mask) == 0) result++; // Non frequent } return result; } Since this loop is perfectly predictable, no threshold of `CMove` transformation may offer performance advantages. I don't think this predictable branch is common, though. Regarding the register pressure relating to a `CMove`, the main issue is that our local code motion does not do much (it does some heuristics around calls; the other element is block-wise latency, which is kind of useless in LCM context), I have tried some heuristics but it is easy to find a case where it is insufficient. I think it is probably a good idea to reimplement LCM using a more optimal algorithm. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16524#issuecomment-1845703278 From rcastanedalo at openjdk.org Thu Dec 7 19:10:55 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Dec 2023 19:10:55 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v7] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 15:52:51 GMT, Daniel Lund?n wrote: >> This changeset >> 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): >> - Before conditional constant propagation >> - After register allocation >> - After block ordering >> - After peephole optimization >> - After post-allocation expansion >> - Before and after >> - loop predication >> - loop peeling >> - pre/main/post loops >> - loop unrolling >> - range check elimination >> - loop unswitching >> - partial peeling >> - split if >> - superword >> 2. adds support for enumeration of repeated IGV graph dumps. >> 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. >> >> Example phase list screenshots in IGV (first at level 6, second at level 4) >> ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) >> >> >> Some notes: >> - While discussing the above changes, a separate question was brought up by @chhagedorn: >> > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? >> - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). >> >> ### Testing >> #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5. >> - Check that optimized builds (`--with-debug-level optimized`) still work. >> >> #### Platforms: linux-x64 >> - Tested that thousands of graphs are correctly opened and visualized with IGV. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright Thanks for addressing all the comments and thanks for this nice contribution! The additional dumps, in combination with the CFG view in IGV, make C2 loop optimizations much more approachable. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16120#pullrequestreview-1770861075 From kvn at openjdk.org Thu Dec 7 19:37:36 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 7 Dec 2023 19:37:36 GMT Subject: RFR: 8319451: PhaseIdealLoop::conditional_move is too conservative In-Reply-To: References: Message-ID: On Mon, 6 Nov 2023 19:10:42 GMT, Quan Anh Mai wrote: > Hi, > > When transforming a Phi into a CMove, the threshold is set to be approximately BlockLayoutMinDiamondPercentage, the reason is given: > > // BlockLayoutByFrequency optimization moves infrequent branch > // from hot path. No point in CMOV'ing in such case > > This sets the default value of the threshold to be around 18%, which is too conservative. The reason also does not make a lot of sense since the important property which makes jumping expensive is not code layout. We should remove this. > > Please kindly review, thank you very much. Thank you for additional investigation. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16524#pullrequestreview-1770906167 From duke at openjdk.org Thu Dec 7 22:51:50 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 7 Dec 2023 22:51:50 GMT Subject: RFR: 8319850: PrintInlining should print which methods are late inlines [v2] In-Reply-To: References: Message-ID: <42h7t16pyeYV2jszIztjGu0JE2ZZWnnJCiyRd2s2oLg=.fffb35a5-e208-442c-9157-ec5d3fcaa31d@github.com> > I'm not 100% sure if this covers all case of late inlines. > > Passes jtreg tier1 locally on my Linux machine with a fastdebug build. With sample Java programs and -XX:+PrintInlining, I can see > > > @ 15 java.lang.Float::valueOf (9 bytes) late inline (boxing method) Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - 8319850: PrintInlining should report late inlines - Revert "8319850: PrintInlining should report late inlines" This reverts commit c5bfb832ff989261b6b2c98f26017c6491fe3067. - 8319850: PrintInlining should report late inlines ------------- Changes: https://git.openjdk.org/jdk/pull/16595/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16595&range=01 Stats: 19 lines in 2 files changed: 19 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16595.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16595/head:pull/16595 PR: https://git.openjdk.org/jdk/pull/16595 From duke at openjdk.org Thu Dec 7 22:51:50 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 7 Dec 2023 22:51:50 GMT Subject: RFR: 8319850: PrintInlining should print which methods are late inlines In-Reply-To: References: Message-ID: On Thu, 16 Nov 2023 12:07:10 GMT, Roland Westrelin wrote: > > Yes, `PrintInlining` reports late inlines, but I think it would be nice for it to explicitly state which inlines are late inlines. I want to print `late inline`. > > I get it now and that looks reasonable to me. What about method handle invokes and late inlining of virtual calls. For those 2, the call site is initially found to not be a candidate for inlining and only later the compiler finds that it can inline. Does your change cover those 2 cases? I reworked so that the patch can handle the cases you just mentioned. It does not explicitly mentions if a call is virtual. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16595#issuecomment-1846225499 From kbarrett at openjdk.org Thu Dec 7 22:54:47 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 7 Dec 2023 22:54:47 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v35] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 14:24:31 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > missing include Not a full review, just a bit of spot-checking. I'm not planning to review the magic constants algorithm. src/hotspot/share/opto/divconstants.cpp line 27: > 25: #include "precompiled.hpp" > 26: #include > 27: #include We generally put standard library includes at the end. There was a long internal to Oracle discussion about include ordering about a year ago that still hasn't made it into the style guide. src/hotspot/share/opto/divconstants.cpp line 33: > 31: // division by constant into a multiply/shift series. > 32: > 33: // (1) Theory: All these blank lines in what is really a single large block comment seem odd to me. Anyone else bothered? src/hotspot/share/opto/divnode.cpp line 85: > 83: } > 84: > 85: // magic_divide_constants in utilities/javaArithmetic.hpp calculates the constant c, s javaArithmetic.hpp doesn't exist in this version of the PR. src/hotspot/share/opto/divnode.hpp line 40: > 38: template > 39: void magic_divide_constants(T d, T N_neg, T N_pos, juint min_s, T& c, bool& c_ovf, juint& s); > 40: void magic_divide_constants_round_down(juint d, juint& c, juint& s); The definitions of these are in new file divconstants.cpp. So I think these should be in divconstants.hpp. src/hotspot/share/utilities/globalDefinitions.hpp line 1107: > 1105: using U = std::make_unsigned_t; > 1106: return (x >= 0) ? x : U(0) - U(x); > 1107: } I understand what this to change to ABS is doing, though it's not obvious. (Dodging overflow UB for -x when x is the minimum value of a signed integral type.) I'm not entirely sure that's a wise move. As written this will trigger `-Wconversion` warnings someday (maybe). static_casting the subtraction result to T will eliminate that concern. However, this is an API change. The previous definition worked for floating point types, while this change does not. (std::make_unsigned requires T be an integral or enum, but not bool, type.) I also don't understand why this change is part of this PR. So I'm inclined to say no to this change without some compelling rationale. test/hotspot/gtest/opto/test_constant_division.cpp line 29: > 27: #include > 28: #include > 29: #include We mostly don't use C++ standard library facilities in HotSpot, even in tests; see the style guide. is permitted. We have GrowableArray instead of . And we have os::random, unless this really needs something better (seems unlikely, other than needing to deal with wider types than int.) Also, stdlib includes at the end again (except I think unittest.hpp is supposed to _really_ be last.) test/hotspot/gtest/opto/test_constant_division.cpp line 31: > 29: #include > 30: > 31: #undef assert We have utilities/vmassert_(uninstall,reinstall).hpp for dealing with stdlib assert. ------------- Changes requested by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/9947#pullrequestreview-1771164625 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1419683294 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1419684321 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1419690273 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1419688078 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1419725886 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1419740698 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1419740532 From duke at openjdk.org Fri Dec 8 00:00:18 2023 From: duke at openjdk.org (Joshua Cao) Date: Fri, 8 Dec 2023 00:00:18 GMT Subject: RFR: 8319850: PrintInlining should print which methods are late inlines [v2] In-Reply-To: <42h7t16pyeYV2jszIztjGu0JE2ZZWnnJCiyRd2s2oLg=.fffb35a5-e208-442c-9157-ec5d3fcaa31d@github.com> References: <42h7t16pyeYV2jszIztjGu0JE2ZZWnnJCiyRd2s2oLg=.fffb35a5-e208-442c-9157-ec5d3fcaa31d@github.com> Message-ID: On Thu, 7 Dec 2023 22:51:50 GMT, Joshua Cao wrote: >> I'm not 100% sure if this covers all case of late inlines. >> >> Passes jtreg tier1 locally on my Linux machine with a fastdebug build. With sample Java programs and -XX:+PrintInlining, I can see >> >> >> @ 15 java.lang.Float::valueOf (9 bytes) late inline (boxing method) > > Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - 8319850: PrintInlining should report late inlines > - Revert "8319850: PrintInlining should report late inlines" > > This reverts commit c5bfb832ff989261b6b2c98f26017c6491fe3067. > - 8319850: PrintInlining should report late inlines Rebased because of merge conflict against mater...guess I should not do that. I will just merge next time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16595#issuecomment-1846285272 From dnsimon at openjdk.org Fri Dec 8 00:17:13 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 8 Dec 2023 00:17:13 GMT Subject: RFR: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 18:26:57 GMT, Raphael Mosaner wrote: > This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. Marked as reviewed by dnsimon (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16980#pullrequestreview-1771325250 From kvn at openjdk.org Fri Dec 8 00:36:21 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 8 Dec 2023 00:36:21 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> References: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> Message-ID: On Wed, 6 Dec 2023 23:12:13 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Change supported intrinsic check" > > This reverts commit 9621eb045c2958582f81ec06b237789a07481ddd. Testing have only one failure in closed tests and I need to fix it before this can be pushed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1846315767 From duke at openjdk.org Fri Dec 8 00:36:22 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Dec 2023 00:36:22 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> Message-ID: On Fri, 8 Dec 2023 00:31:26 GMT, Vladimir Kozlov wrote: > Testing have only one failure in closed tests and I need to fix it before this can be pushed. Thanks Vladimir for the update. Is the test failure because of this PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1846317507 From kvn at openjdk.org Fri Dec 8 00:47:23 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 8 Dec 2023 00:47:23 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> Message-ID: On Fri, 8 Dec 2023 00:33:49 GMT, Srinivas Vamsi Parasa wrote: > > Testing have only one failure in closed tests and I need to fix it before this can be pushed. > > Thanks Vladimir for the update. Is the test failure because of this PR? Yes. One of our test, which checks integrity of built JDK, is confused by changes in libsimdsort.so. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1846326542 From dlong at openjdk.org Fri Dec 8 02:15:14 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 8 Dec 2023 02:15:14 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v2] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Thu, 7 Dec 2023 14:52:31 GMT, Andrew Haley wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Remove unused immIOffset/immLOffset >> - Merge branch 'master' into fg8319690 >> - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" >> >> On LP64 systems, if the heap can be moved into low virtual >> address space (below 4GB) and the heap size is smaller than the >> interesting threshold of 4 GB, we can use unscaled decoding >> pattern for narrow klass decoding. It means that a generic field >> reference can be decoded by: >> ``` >> cast<64> (32-bit compressed reference) + field_offset >> ``` >> >> When the `field_offset` is an immediate, on aarch64 platform, the >> unscaled decoding pattern can match perfectly with a direct >> addressing mode, i.e., `base_plus_offset`, supported by LDR/STR >> instructions. But for certain data width, not all immediates can >> be encoded in the instruction field of LDR/STR[1]. The ranges are >> different as data widths vary. >> >> For example, when we try to load a value of long type at offset of >> `1030`, the address expression is `(AddP (DecodeN base) 1030)`. >> Before the patch, the expression was matching with >> `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate >> byte offset must be in the range -256 to 255 or positive immediate >> byte offset must be a multiple of 8 in the range 0 to 32760[2]. >> `1030` can't be encoded in the instruction field. So, after >> matching, when we do checking for instruction encoding, the >> assertion would fail. >> >> In this patch, we're going to filter out invalid immediates >> when deciding if current addressing mode can be matched as >> `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and >> `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data >> type separately in the patch. E.g., for `memory4`, we remove >> the generic `indOffIN/indOffLN`, which matches wrong unscaled >> immediate range, and replace them with `indOffIN4/indOffLN4` >> instead. >> >> Since 8-bit and 16-bit LDR/STR instructions also support the >> unscaled decoding pattern, we add the addressing mode in the >> lists of `memory1` and `memory2` by introducing >> `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. >> >> ... > > I think this patch is excessive for the problem and introduces a lot of code dupiication. Maybe it would be simpler, smaller, and faster to check for what we need: > > > diff --git a/src/hotspot/cpu/aarch64/aarch64.ad b/src/hotspot/cpu/aarch64/aarch64.ad > index 233f9b6af7c..ea842912ce9 100644 > --- a/src/hotspot/cpu/aarch64/aarch64.ad > +++ b/src/hotspot/cpu/aarch64/aarch64.ad > @@ -5911,7 +5911,8 @@ operand indIndexN(iRegN reg, iRegL lreg) > > operand indOffIN(iRegN reg, immIOffset off) > %{ > - predicate(CompressedOops::shift() == 0); > + predicate(CompressedOops::shift() == 0 > + && Address::offset_ok_for_immed(n->in(3)->find_int_con(min_jint), exact_log2(sizeof(jint)))); > constraint(ALLOC_IN_RC(ptr_reg)); > match(AddP (DecodeN reg) off); > op_cost(0); > @@ -5926,7 +5927,8 @@ operand indOffIN(iRegN reg, immIOffset off) > > operand indOffLN(iRegN reg, immLoffset off) > %{ > - predicate(CompressedOops::shift() == 0); > + predicate(CompressedOops::shift() == 0 > + && Address::offset_ok_for_immed(n->in(3)->find_long_con(min_jint), exact_log2(sizeof(jlong)))); > constraint(ALLOC_IN_RC(ptr_reg)); > match(AddP (DecodeN reg) off); > op_cost(0); @theRealAph , your patch only works if when `indOffIN` is used in `memory4` and `indOffLN` is used in `memory8`, right? Introducing new operands like `indOffIN4` is consistent with how the code currently works with `indOffI4`. In fact I think the new `indOffIN` could be folded into the existing `indOffI` by using multiple `match` lines and a better predicate. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1846448522 From dlong at openjdk.org Fri Dec 8 02:27:20 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 8 Dec 2023 02:27:20 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> On Thu, 7 Dec 2023 15:53:47 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. >> - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. >> >> ### Testing (in progress) >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5 >> - Targeted and repeated tests for `TestC1Globals.java` in all tiers > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright src/hotspot/share/c1/c1_globals.hpp line 280: > 278: develop(intx, NMethodSizeLimit, (64*K)*wordSize, \ > 279: "Maximum size of a compiled method.") \ > 280: range(0, 1*M) \ Shouldn't this be defined in platform-specific code, along with a comment explaining why 1MB was chosen? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1419862893 From dlong at openjdk.org Fri Dec 8 02:32:17 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 8 Dec 2023 02:32:17 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: On Thu, 7 Dec 2023 15:53:47 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. >> - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. >> >> ### Testing (in progress) >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5 >> - Targeted and repeated tests for `TestC1Globals.java` in all tiers > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright test/hotspot/jtreg/compiler/arguments/TestC1Globals.java line 62: > 60: * Linux. > 61: * > 62: * @run main/othervm -XX:NMethodSizeLimit=351658240 What were these large sizes of NMethodSizeLimit meant to test? Removing these test cases because of a problem with aarch64 seems wrong, unless these test cases really have no value for other platforms. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1419865173 From aph at openjdk.org Fri Dec 8 08:44:19 2023 From: aph at openjdk.org (Andrew Haley) Date: Fri, 8 Dec 2023 08:44:19 GMT Subject: RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v2] In-Reply-To: References: <16J-lJ2AceGTVcRWBcP15yKcwO-1IA1XsngyOuNjf7k=.0776f081-ae2c-4279-87cf-d909806c2bc4@github.com> Message-ID: On Thu, 7 Dec 2023 14:52:31 GMT, Andrew Haley wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Remove unused immIOffset/immLOffset >> - Merge branch 'master' into fg8319690 >> - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" >> >> On LP64 systems, if the heap can be moved into low virtual >> address space (below 4GB) and the heap size is smaller than the >> interesting threshold of 4 GB, we can use unscaled decoding >> pattern for narrow klass decoding. It means that a generic field >> reference can be decoded by: >> ``` >> cast<64> (32-bit compressed reference) + field_offset >> ``` >> >> When the `field_offset` is an immediate, on aarch64 platform, the >> unscaled decoding pattern can match perfectly with a direct >> addressing mode, i.e., `base_plus_offset`, supported by LDR/STR >> instructions. But for certain data width, not all immediates can >> be encoded in the instruction field of LDR/STR[1]. The ranges are >> different as data widths vary. >> >> For example, when we try to load a value of long type at offset of >> `1030`, the address expression is `(AddP (DecodeN base) 1030)`. >> Before the patch, the expression was matching with >> `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate >> byte offset must be in the range -256 to 255 or positive immediate >> byte offset must be a multiple of 8 in the range 0 to 32760[2]. >> `1030` can't be encoded in the instruction field. So, after >> matching, when we do checking for instruction encoding, the >> assertion would fail. >> >> In this patch, we're going to filter out invalid immediates >> when deciding if current addressing mode can be matched as >> `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and >> `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data >> type separately in the patch. E.g., for `memory4`, we remove >> the generic `indOffIN/indOffLN`, which matches wrong unscaled >> immediate range, and replace them with `indOffIN4/indOffLN4` >> instead. >> >> Since 8-bit and 16-bit LDR/STR instructions also support the >> unscaled decoding pattern, we add the addressing mode in the >> lists of `memory1` and `memory2` by introducing >> `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`. >> >> ... > > I think this patch is excessive for the problem and introduces a lot of code dupiication. Maybe it would be simpler, smaller, and faster to check for what we need: > > > diff --git a/src/hotspot/cpu/aarch64/aarch64.ad b/src/hotspot/cpu/aarch64/aarch64.ad > index 233f9b6af7c..ea842912ce9 100644 > --- a/src/hotspot/cpu/aarch64/aarch64.ad > +++ b/src/hotspot/cpu/aarch64/aarch64.ad > @@ -5911,7 +5911,8 @@ operand indIndexN(iRegN reg, iRegL lreg) > > operand indOffIN(iRegN reg, immIOffset off) > %{ > - predicate(CompressedOops::shift() == 0); > + predicate(CompressedOops::shift() == 0 > + && Address::offset_ok_for_immed(n->in(3)->find_int_con(min_jint), exact_log2(sizeof(jint)))); > constraint(ALLOC_IN_RC(ptr_reg)); > match(AddP (DecodeN reg) off); > op_cost(0); > @@ -5926,7 +5927,8 @@ operand indOffIN(iRegN reg, immIOffset off) > > operand indOffLN(iRegN reg, immLoffset off) > %{ > - predicate(CompressedOops::shift() == 0); > + predicate(CompressedOops::shift() == 0 > + && Address::offset_ok_for_immed(n->in(3)->find_long_con(min_jint), exact_log2(sizeof(jlong)))); > constraint(ALLOC_IN_RC(ptr_reg)); > match(AddP (DecodeN reg) off); > op_cost(0); > @theRealAph , your patch only works if when `indOffIN` is used in `memory4` and `indOffLN` is used in `memory8`, right? Introducing new operands like `indOffIN4` is consistent with how the code currently works with `indOffI4`. Yes, it is, but clearly that does not scale, leading to a great profusion of operand kinds. > In fact I think the new `indOffIN` could be folded into the existing `indOffI` by using multiple `match` lines and a better predicate. Sure, that would be better still. Best of all would IMO be for each kind of memory access to check its offset operand by calling `ok_for_immed`, but let's see what folding and the use of better predicates does. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-1846776665 From aph at openjdk.org Fri Dec 8 08:47:22 2023 From: aph at openjdk.org (Andrew Haley) Date: Fri, 8 Dec 2023 08:47:22 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> Message-ID: <0kWPYTh54_hsm2DQoY9M3BgIwMVpvBG2mLTJWjlTVAc=.d198fbd2-e128-4a61-9108-7001cb17258c@github.com> On Fri, 8 Dec 2023 02:24:45 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Update copyright > > src/hotspot/share/c1/c1_globals.hpp line 280: > >> 278: develop(intx, NMethodSizeLimit, (64*K)*wordSize, \ >> 279: "Maximum size of a compiled method.") \ >> 280: range(0, 1*M) \ > > Shouldn't this be defined in platform-specific code, along with a comment explaining why 1MB was chosen? It could be, and I would have suggested doing so, but I am unaware of any circumstances in which ginormous C1-compiled methods are of any benefit to any port. > What were these large sizes of NMethodSizeLimit meant to test? Removing these test cases because of a problem with aarch64 seems wrong, unless these test cases really have no value for other platforms. That would be my guess. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1420111316 PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1420112021 From duke at openjdk.org Fri Dec 8 10:50:15 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 8 Dec 2023 10:50:15 GMT Subject: RFR: 8310524: C2: record parser-generated LoadN nodes for IGVN [v3] In-Reply-To: References: Message-ID: On Wed, 6 Dec 2023 12:57:05 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). >> >> Changes: >> - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. >> - Add a regression test. >> >> ### Testing >> - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Switch to @DontInline Tests now rerun, integrating. Please sponsor! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16967#issuecomment-1846948822 From duke at openjdk.org Fri Dec 8 11:00:23 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 8 Dec 2023 11:00:23 GMT Subject: RFR: 8295166: IGV: dump graph at more locations [v7] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 15:52:51 GMT, Daniel Lund?n wrote: >> This changeset >> 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): >> - Before conditional constant propagation >> - After register allocation >> - After block ordering >> - After peephole optimization >> - After post-allocation expansion >> - Before and after >> - loop predication >> - loop peeling >> - pre/main/post loops >> - loop unrolling >> - range check elimination >> - loop unswitching >> - partial peeling >> - split if >> - superword >> 2. adds support for enumeration of repeated IGV graph dumps. >> 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. >> >> Example phase list screenshots in IGV (first at level 6, second at level 4) >> ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) >> >> >> Some notes: >> - While discussing the above changes, a separate question was brought up by @chhagedorn: >> > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? >> - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). >> >> ### Testing >> #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5. >> - Check that optimized builds (`--with-debug-level optimized`) still work. >> >> #### Platforms: linux-x64 >> - Tested that thousands of graphs are correctly opened and visualized with IGV. > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright Thanks for the comments and reviews! The tests are now finished, integrating (and I need a sponsor). ------------- PR Comment: https://git.openjdk.org/jdk/pull/16120#issuecomment-1846961804 From duke at openjdk.org Fri Dec 8 11:07:30 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 8 Dec 2023 11:07:30 GMT Subject: Integrated: 8310524: C2: record parser-generated LoadN nodes for IGVN In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 09:05:35 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where LoadN nodes were not recorded during bytecode parsing for later revisit in IGVN, in some cases resulting in missed optimization opportunities (see, e.g., the included new regression test). > > Changes: > - Make sure to record newly added LoadN-nodes for IGVN in `GraphKit::make_load`. > - Add a regression test. > > ### Testing > - tier1, tier2, tier3, tier4, tier5 (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64) This pull request has now been integrated. Changeset: 9e48b90c Author: Daniel Lund?n Committer: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/9e48b90c7fd349195a1389c480c66dfd9b1a7f75 Stats: 68 lines in 2 files changed: 68 ins; 0 del; 0 mod 8310524: C2: record parser-generated LoadN nodes for IGVN Reviewed-by: chagedorn, rcastanedalo, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/16967 From duke at openjdk.org Fri Dec 8 11:11:30 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 8 Dec 2023 11:11:30 GMT Subject: Integrated: 8295166: IGV: dump graph at more locations In-Reply-To: References: Message-ID: On Tue, 10 Oct 2023 13:31:00 GMT, Daniel Lund?n wrote: > This changeset > 1. adds a number of new graph dumps for IdealGraphVisualizer (IGV): > - Before conditional constant propagation > - After register allocation > - After block ordering > - After peephole optimization > - After post-allocation expansion > - Before and after > - loop predication > - loop peeling > - pre/main/post loops > - loop unrolling > - range check elimination > - loop unswitching > - partial peeling > - split if > - superword > 2. adds support for enumeration of repeated IGV graph dumps. > 3. adjusts IGV print levels to encompass the new graph dumps. The old levels 4 and 5 are now levels 5 and 6. The new level 4 is for loop optimization dumps. > > Example phase list screenshots in IGV (first at level 6, second at level 4) > ![Screenshot from 2023-12-04 13-55-38](https://github.com/openjdk/jdk/assets/4222397/6759dc5a-9c9a-42b9-8d9e-2d0b53e76ab4) ![Screenshot from 2023-12-04 13-56-29](https://github.com/openjdk/jdk/assets/4222397/44d6a239-587b-4f7c-8ce1-f7613cb2fa35) > > > Some notes: > - While discussing the above changes, a separate question was brought up by @chhagedorn: > > On a separate note, I'm wondering how useful it is to always dump all JFR events when calling print_method(). Should this be revisited again in general? > - The new IGV graph dump enumeration enables a number of cleanups. There is now another RFE for IGV cleanup: [JDK-8319599](https://bugs.openjdk.org/browse/JDK-8319599). > > ### Testing > #### Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5. > - Check that optimized builds (`--with-debug-level optimized`) still work. > > #### Platforms: linux-x64 > - Tested that thousands of graphs are correctly opened and visualized with IGV. This pull request has now been integrated. Changeset: 701bc3bb Author: Daniel Lund?n Committer: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/701bc3bbbe49a46aea7efc195463cc2efd64a785 Stats: 180 lines in 15 files changed: 118 ins; 7 del; 55 mod 8295166: IGV: dump graph at more locations Reviewed-by: thartmann, rcastanedalo, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/16120 From duke at openjdk.org Fri Dec 8 14:00:23 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 8 Dec 2023 14:00:23 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: On Fri, 8 Dec 2023 02:29:29 GMT, Dean Long wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Update copyright > > test/hotspot/jtreg/compiler/arguments/TestC1Globals.java line 62: > >> 60: * Linux. >> 61: * >> 62: * @run main/othervm -XX:NMethodSizeLimit=351658240 > > What were these large sizes of NMethodSizeLimit meant to test? Removing these test cases because of a problem with aarch64 seems wrong, unless these test cases really have no value for other platforms. Thanks for the review @dean-long. I added these tests recently for [JDK-8318817](https://bugs.openjdk.org/browse/JDK-8318817) and [JDK-8316653](https://bugs.openjdk.org/browse/JDK-8316653). From what I've gathered since then, there is little reason to allow such large values for NMethodSizeLimit. Do you know of any use case that requires more than 1MB? If not, I would say 1MB is probably a sensible upper bound, as it deals with this issue and also other potential issues due to code cache size assumptions (not only on aarch64). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1420477576 From duke at openjdk.org Fri Dec 8 14:03:22 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 8 Dec 2023 14:03:22 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <0kWPYTh54_hsm2DQoY9M3BgIwMVpvBG2mLTJWjlTVAc=.d198fbd2-e128-4a61-9108-7001cb17258c@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> <0kWPYTh54_hsm2DQoY9M3BgIwMVpvBG2mLTJWjlTVAc=.d198fbd2-e128-4a61-9108-7001cb17258c@github.com> Message-ID: On Fri, 8 Dec 2023 08:43:45 GMT, Andrew Haley wrote: >> src/hotspot/share/c1/c1_globals.hpp line 280: >> >>> 278: develop(intx, NMethodSizeLimit, (64*K)*wordSize, \ >>> 279: "Maximum size of a compiled method.") \ >>> 280: range(0, 1*M) \ >> >> Shouldn't this be defined in platform-specific code, along with a comment explaining why 1MB was chosen? > > It could be, and I would have suggested doing so, but I am unaware of any circumstances in which ginormous C1-compiled methods are of any benefit to any port. I think a 1MB upper bound is sensible for all platforms (see my other comment below). I'll add a comment explaining the choice, thanks. Please let me know if you think a larger bound is more suitable. In that case, we should also apply @theRealAph's improved aarch64 fix above for `const2reg`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1420482258 From rcastanedalo at openjdk.org Fri Dec 8 14:56:20 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 8 Dec 2023 14:56:20 GMT Subject: RFR: 8275202: C2: optimize out more redundant conditions In-Reply-To: References: <978cgwy3Nb_x7yU6jZz0f6zhTBZfphstisAkBf1Vktc=.283d06eb-4f79-40cf-b8dd-a9c230e59902@github.com> Message-ID: On Tue, 28 Nov 2023 14:30:09 GMT, Roberto Casta?eda Lozano wrote: > Thanks, I have not, will run some Renaissance benchmarks and report back. Here are results from running a few Renaissance benchmarks on x64 and aarch64 (lower is better): [score-baseline-vs-JDK-8275202-normalized-to-baseline.pdf](https://github.com/openjdk/jdk/files/13615435/score-baseline-vs-JDK-8275202-normalized-to-baseline.pdf) To summarize, there seems to be a slight improvement on ScalaKmeans (x64 and aarch64) and a slight regression on Mnemonics/ParMnemonics (x64), the remaining differences are probably in the noise. ------------- PR Comment: https://git.openjdk.org/jdk/pull/14586#issuecomment-1847322407 From eastigeevich at openjdk.org Fri Dec 8 15:19:44 2023 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Fri, 8 Dec 2023 15:19:44 GMT Subject: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v8] In-Reply-To: References: Message-ID: On Thu, 24 Jun 2021 17:02:03 GMT, Scott Gibbons wrote: >> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. Also allows for performance improvement for non-AVX-512 enabled platforms. Due to the nature of MIME-encoded inputs, modify the intrinsic signature to accept an additional parameter (isMIME) for fast-path MIME decoding. >> >> A change was made to the signature of DecodeBlock in Base64.java to provide the intrinsic information as to whether MIME decoding was being done. This allows for the intrinsic to bypass the expensive setup of zmm registers from AVX tables, knowing there may be invalid Base64 characters every 76 characters or so. A change was also made here removing the restriction that the intrinsic must return an even multiple of 3 bytes decoded. This implementation handles the pad characters at the end of the string and will return the actual number of characters decoded. >> >> The AVX portion of this code will decode in blocks of 256 bytes per loop iteration, then in chunks of 64 bytes, followed by end fixup decoding. The non-AVX code is an assembly-optimized version of the java DecodeBlock and behaves identically. >> >> Running the Base64Decode benchmark, this change increases decode performance by an average of 2.6x with a maximum 19.7x for buffers > ~20k. The numbers are given in the table below. >> >> **Base Score** is without intrinsic support, **Optimized Score** is using this intrinsic, and **Gain** is **Base** / **Optimized**. >> >> >> Benchmark Name | Base Score | Optimized Score | Gain >> -- | -- | -- | -- >> testBase64Decode size 1 | 15.36 | 15.32 | 1.00 >> testBase64Decode size 3 | 17.00 | 16.72 | 1.02 >> testBase64Decode size 7 | 20.60 | 18.82 | 1.09 >> testBase64Decode size 32 | 34.21 | 26.77 | 1.28 >> testBase64Decode size 64 | 54.43 | 38.35 | 1.42 >> testBase64Decode size 80 | 66.40 | 48.34 | 1.37 >> testBase64Decode size 96 | 73.16 | 52.90 | 1.38 >> testBase64Decode size 112 | 84.93 | 51.82 | 1.64 >> testBase64Decode size 512 | 288.81 | 32.04 | 9.01 >> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74 >> testBase64Decode size 20000 | 9530.28 | 483.37 | 19.72 >> testBase64Decode size 50000 | 24552.24 | 1735.07 | 14.15 >> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07 >> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10 >> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02 >> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10 >> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05 >> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00 >> testBase64MIMEDecode size... > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Fixed Windows register stomping. We found this optimization causes https://bugs.openjdk.org/browse/JDK-8321599 ------------- PR Comment: https://git.openjdk.org/jdk/pull/4368#issuecomment-1847357495 From roland at openjdk.org Fri Dec 8 16:48:18 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 8 Dec 2023 16:48:18 GMT Subject: RFR: 8275202: C2: optimize out more redundant conditions In-Reply-To: References: <978cgwy3Nb_x7yU6jZz0f6zhTBZfphstisAkBf1Vktc=.283d06eb-4f79-40cf-b8dd-a9c230e59902@github.com> Message-ID: On Fri, 8 Dec 2023 14:53:34 GMT, Roberto Casta?eda Lozano wrote: >>> I'm not. Out of curiosity, have you tried anything from Renaissance? >> >> Thanks, I have not, will run some Renaissance benchmarks and report back. >> >>> I will propose a new change where the slow down is not as big (the new pass is run less often). >> >> Sounds good! I am happy to re-evaluate C2 speed when the new change is ready. > >> Thanks, I have not, will run some Renaissance benchmarks and report back. > > Here are results from running a few Renaissance benchmarks on x64 and aarch64 (lower is better): > > [score-baseline-vs-JDK-8275202-normalized-to-baseline.pdf](https://github.com/openjdk/jdk/files/13615435/score-baseline-vs-JDK-8275202-normalized-to-baseline.pdf) > > To summarize, there seems to be a slight improvement on ScalaKmeans (x64 and aarch64) and a slight regression on Mnemonics/ParMnemonics (x64), the remaining differences are probably in the noise. @robcasloz thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/14586#issuecomment-1847508317 From dlong at openjdk.org Fri Dec 8 19:59:20 2023 From: dlong at openjdk.org (Dean Long) Date: Fri, 8 Dec 2023 19:59:20 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> <0kWPYTh54_hsm2DQoY9M3BgIwMVpvBG2mLTJWjlTVAc=.d198fbd2-e128-4a61-9108-7001cb17258c@github.com> Message-ID: On Fri, 8 Dec 2023 14:00:30 GMT, Daniel Lund?n wrote: >> It could be, and I would have suggested doing so, but I am unaware of any circumstances in which ginormous C1-compiled methods are of any benefit to any port. > > I think a 1MB upper bound is sensible for all platforms (see my other comment below). I'll add a comment explaining the choice, thanks. Please let me know if you think a larger bound is more suitable. In that case, we should also apply @theRealAph's improved aarch64 fix above for `const2reg` (or set a platform-specific bound of 1MB for aarch64). I don't know what a typical nmethod size is for C1, but I can imagine hitting the 1MB limit by doing something like stress testing with increased inlining limits. Or maybe very large initialization methods in the future thanks to Leyden AOT or computed constants? So I guess I'm leaning slightly towards fixing the aarch64 issue now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1420975058 From sgibbons at openjdk.org Fri Dec 8 21:02:24 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Fri, 8 Dec 2023 21:02:24 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding Message-ID: Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. Tested teir1 and with testcase supplied with JBS issue. The problem will only occur when all of the following are true: 1. The source offset of the string to be decoded is != 0. 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". 3. The string is >= 32 characters. 4. The string is not MIME encoded. If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. ------------- Commit messages: - Fix for JDK-8321599 Changes: https://git.openjdk.org/jdk/pull/17039/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321599 Stats: 4 lines in 2 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17039.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17039/head:pull/17039 PR: https://git.openjdk.org/jdk/pull/17039 From eastigeevich at openjdk.org Fri Dec 8 21:17:12 2023 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Fri, 8 Dec 2023 21:17:12 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: <9eAWrrDqD4dvhsYCNDPu_Ek7fCkTFV_ouiuFIDxTYB8=.9f1d5f64-1192-4e6e-bd1b-841e93520181@github.com> On Fri, 8 Dec 2023 20:56:52 GMT, Scott Gibbons wrote: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. @asgibbons, thank you for the quick fix. I think it's worth to add the reproducer for the JBS issue as a test. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1847850653 From kvn at openjdk.org Fri Dec 8 22:40:21 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 8 Dec 2023 22:40:21 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> References: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> Message-ID: On Wed, 6 Dec 2023 23:12:13 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. >> >> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. >> >> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. >> >> **Note:** This PR also improves the performance of AVX512 sort by upto 35%. >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> >> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> >> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> >> >> >> >> >> >> >> >> >> Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup >> -- | -- | -- | -- | -- >> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 >> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 >> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 >> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 >> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 >> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 >> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 >> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 >> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 >> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 >> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 >> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 >> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 >> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 >> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 >> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 >> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 >> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 >> >> >> >> >> >> > xmlns:o="urn:schemas-microsoft-com:office:office" >> xmlns:x="urn:schemas-microsoft-com:office:excel" >> xmlns="http://www.w3.org/TR/REC-html40"> >> >> >> >> > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Change supported intrinsic check" > > This reverts commit 9621eb045c2958582f81ec06b237789a07481ddd. I pushed closed changes. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16534#pullrequestreview-1773255608 From duke at openjdk.org Fri Dec 8 22:51:21 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Dec 2023 22:51:21 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: <_L--_bl81TgfP1_0bLld6-xzcNqsopfL3HX3bmlbqgE=.923709b1-2e65-4659-a9cd-db4a6ac375c0@github.com> Message-ID: On Fri, 8 Dec 2023 22:37:26 GMT, Vladimir Kozlov wrote: > I pushed closed changes. Thanks Vladimir! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1847939767 From eastigeevich at openjdk.org Fri Dec 8 22:54:12 2023 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Fri, 8 Dec 2023 22:54:12 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 20:56:52 GMT, Scott Gibbons wrote: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. @asgibbons, am I correct the problem is that padding '=' characters were not found and not processed. This happens because a source offset is not taken into account. A test is: A, B: String Buf: ByteBuffer C := base64_encode(A) + base64_encode(B) # encode(B) should have '=' or '==' put C in Buf A' := base64_decode(Buf) B' := base64_decode(Buf) assert(A.equals(A')) assert(B.equals(B')) ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1847942366 From duke at openjdk.org Fri Dec 8 22:55:37 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Fri, 8 Dec 2023 22:55:37 GMT Subject: Integrated: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) In-Reply-To: References: Message-ID: On Tue, 7 Nov 2023 00:12:41 GMT, Srinivas Vamsi Parasa wrote: > The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays. > > For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the performance data below. > > For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) as shown below. > > **Note:** This PR also improves the performance of AVX512 sort by upto 35%. > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> > > > > > > > > > Benchmark (Serial Sort) | Size | Baseline (us/op) | AVX2 (us/op) | Speedup > -- | -- | -- | -- | -- > ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2 > ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0 > ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5 > ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5 > ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2 > ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4 > ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6 > ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3 > ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1 > ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0 > ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4 > ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3 > ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0 > ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9 > ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4 > ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7 > ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0 > ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6 > > > > > > xmlns:o="urn:schemas-microsoft-com:office:office" > xmlns:x="urn:schemas-microsoft-com:office:excel" > xmlns="http://www.w3.org/TR/REC-html40"> > > > > > > href="file:///C:/Users/... This pull request has now been integrated. Changeset: ce108446 Author: vamsi-parasa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/ce108446ca1fe604ecc24bbefb0bf1c6318271c7 Stats: 4026 lines in 24 files changed: 2311 ins; 1560 del; 155 mod 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) Reviewed-by: sviswanathan, ihse, jbhateja, kvn ------------- PR: https://git.openjdk.org/jdk/pull/16534 From sgibbons at openjdk.org Fri Dec 8 23:14:11 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Fri, 8 Dec 2023 23:14:11 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 22:51:29 GMT, Evgeny Astigeevich wrote: > @asgibbons, am I correct the problem is that padding '=' characters were not found and not processed. This happens because a source offset is not taken into account. A test is: > > ``` > A, B: String > Buf: ByteBuffer > C := base64_encode(A) + base64_encode(B) # encode(B) should have '=' or '==' > put C in Buf > A' := base64_decode(Buf) > B' := base64_decode(Buf) > assert(A.equals(A')) > assert(B.equals(B')) > ``` No. The padding '=' character was found and terminated the decoding, which is expected. The issue is that the input string (encoded) is quite long in this case and the test is decoding a substring of the full string. The parameters passed to Decode are a pointer to the start of the (long) string and a (large) offset. I was looking for padding characters relative to the start of the long string instead of the substring (start plus the starting offset). Example: Encoded string: . . . = = . . . a a a a a a a ... a a a a ^ ^ | | start start + offset I was asked to decode the bytes at ```(start + offset)```. When the algorithm gets to the last 31 bytes of ```a a a a ... a a a a```, it looks for padding at ```(start + remaining_length - 1)``` instead of ```(start + start_offset + remaining_length - 1)```. It actually found a padding byte at ```(start + remaining_length - 1)``` and decided that the output length should be reduced by one character (or 2 if there were 2 padding bytes found). A very specific edge case (so good catch by testers). ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1847958191 From duke at openjdk.org Sat Dec 9 00:36:11 2023 From: duke at openjdk.org (James Petty) Date: Sat, 9 Dec 2023 00:36:11 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 20:56:52 GMT, Scott Gibbons wrote: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. I was one of the engineers investigating the issue and wrote the original form of reproducer submitted on the ticket. The use case that was failing that I tried to mimic in the reproducer was decoding base64 data in a column oriented analytics engine- so the fact that the backing buffer is (much) larger than the subset being decoded on any invocation, has non zero starting offsets, and contains padded base64 strings earlier in the source buffer isn?t an exceptional scenario given that use case. Thanks again for the quick fix! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1848005626 From kvn at openjdk.org Sat Dec 9 00:44:21 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 9 Dec 2023 00:44:21 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Sat, 9 Dec 2023 00:30:38 GMT, James Petty wrote: >> Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. >> >> Tested teir1 and with testcase supplied with JBS issue. >> >> The problem will only occur when all of the following are true: >> 1. The source offset of the string to be decoded is != 0. >> 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". >> 3. The string is >= 32 characters. >> 4. The string is not MIME encoded. >> >> If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. > > I was one of the engineers investigating the issue and wrote the original form of reproducer submitted on the ticket. The use case that was failing that I tried to mimic in the reproducer was decoding base64 data in a column oriented analytics engine- so the fact that the backing buffer is (much) larger than the subset being decoded on any invocation, has non zero starting offsets, and contains padded base64 strings earlier in the source buffer isn?t an exceptional scenario given that use case. Thanks again for the quick fix! @pettyjamesm did you verified this fix with your case? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1848014510 From duke at openjdk.org Sat Dec 9 00:50:16 2023 From: duke at openjdk.org (James Petty) Date: Sat, 9 Dec 2023 00:50:16 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Sat, 9 Dec 2023 00:30:38 GMT, James Petty wrote: >> Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. >> >> Tested teir1 and with testcase supplied with JBS issue. >> >> The problem will only occur when all of the following are true: >> 1. The source offset of the string to be decoded is != 0. >> 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". >> 3. The string is >= 32 characters. >> 4. The string is not MIME encoded. >> >> If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. > > I was one of the engineers investigating the issue and wrote the original form of reproducer submitted on the ticket. The use case that was failing that I tried to mimic in the reproducer was decoding base64 data in a column oriented analytics engine- so the fact that the backing buffer is (much) larger than the subset being decoded on any invocation, has non zero starting offsets, and contains padded base64 strings earlier in the source buffer isn?t an exceptional scenario given that use case. Thanks again for the quick fix! > @pettyjamesm did you verified this fix with your case? Unfortunately not, we don?t currently have any workflow that builds a test artifact with a JDK/JVM built from source- so it would be a big lift to get to that point for us. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1848016975 From aph at openjdk.org Sat Dec 9 10:42:20 2023 From: aph at openjdk.org (Andrew Haley) Date: Sat, 9 Dec 2023 10:42:20 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> <0kWPYTh54_hsm2DQoY9M3BgIwMVpvBG2mLTJWjlTVAc=.d198fbd2-e128-4a61-9108-7001cb17258c@github.com> Message-ID: On Fri, 8 Dec 2023 19:56:24 GMT, Dean Long wrote: >> I think a 1MB upper bound is sensible for all platforms (see my other comment below). I'll add a comment explaining the choice, thanks. Please let me know if you think a larger bound is more suitable. In that case, we should also apply @theRealAph's improved aarch64 fix above for `const2reg` (or set a platform-specific bound of 1MB for aarch64). > > I don't know what a typical nmethod size is for C1, but I can imagine hitting the 1MB limit by doing something like stress testing with increased inlining limits. Or maybe very large initialization methods in the future thanks to Leyden AOT or computed constants? So I guess I'm leaning slightly towards fixing the aarch64 issue now. it's not a single AArch64 issue, though, it's a few places. I have a particular hatred for dead code (or in this case nearly-dead code) paths that only get exercised in weird test cases. If we insist that methods may be > 1MB we can't use the compact ADR and LDR forms of relative addressing. To solve this problem, if it is one, we either put up with inefficiencies for no reason other than to make test cases work, or we generate different code especially for such cases. We could use something like relaxation, where we first generate methods optimistically then fall back to less-efficient forms, but again that's make-work for test cases. I know it's "only C1" but every little helps. Efficient systems are made of thousands of tiny optimizations, each one of which is to small to make a measurable difference on its own. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1421394837 From aph at openjdk.org Sat Dec 9 14:09:16 2023 From: aph at openjdk.org (Andrew Haley) Date: Sat, 9 Dec 2023 14:09:16 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> <0kWPYTh54_hsm2DQoY9M3BgIwMVpvBG2mLTJWjlTVAc=.d198fbd2-e128-4a61-9108-7001cb17258c@github.com> Message-ID: On Sat, 9 Dec 2023 10:40:01 GMT, Andrew Haley wrote: >> I don't know what a typical nmethod size is for C1, but I can imagine hitting the 1MB limit by doing something like stress testing with increased inlining limits. Or maybe very large initialization methods in the future thanks to Leyden AOT or computed constants? So I guess I'm leaning slightly towards fixing the aarch64 issue now. > > it's not a single AArch64 issue, though, it's a few places. I have a particular hatred for dead code (or in this case nearly-dead code) paths that only get exercised in weird test cases. If we insist that methods may be > 1MB we can't use the compact ADR and LDR forms of relative addressing. To solve this problem, if it is one, we either put up with inefficiencies for no reason other than to make test cases work, or we generate different code especially for such cases. > > We could use something like relaxation, where we first generate methods optimistically then fall back to less-efficient forms, but again that's make-work for test cases. > > I know it's "only C1" but every little helps. Efficient systems are made of thousands of tiny optimizations, each one of which is to small to make a measurable difference on its own. Sorry, that was a bit of a rant. I guess I'm happy that AArch64 will be restricted, if that is the word, to megabyte-long compiled methods, for the aforementioned reasons. Every RISC processor, or at least every processor with fixed-length instructions, will have similar issues. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1421425825 From kvn at openjdk.org Sat Dec 9 23:39:11 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 9 Dec 2023 23:39:11 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 23:11:43 GMT, Scott Gibbons wrote: >> @asgibbons, am I correct the problem is that padding '=' characters were not found and not processed. This happens because a source offset is not taken into account. >> A test is: >> >> A, B: String >> Buf: ByteBuffer >> C := base64_encode(A) + base64_encode(B) # encode(B) should have '=' or '==' >> put C in Buf >> A' := base64_decode(Buf) >> B' := base64_decode(Buf) >> assert(A.equals(A')) >> assert(B.equals(B')) > >> @asgibbons, am I correct the problem is that padding '=' characters were not found and not processed. This happens because a source offset is not taken into account. A test is: >> >> ``` >> A, B: String >> Buf: ByteBuffer >> C := base64_encode(A) + base64_encode(B) # encode(B) should have '=' or '==' >> put C in Buf >> A' := base64_decode(Buf) >> B' := base64_decode(Buf) >> assert(A.equals(A')) >> assert(B.equals(B')) >> ``` > > No. The padding '=' character was found and terminated the decoding, which is expected. The issue is that the input string (encoded) is quite long in this case and the test is decoding a substring of the full string. The parameters passed to Decode are a pointer to the start of the (long) string and a (large) offset. I was looking for padding characters relative to the start of the long string instead of the substring (start plus the starting offset). Example: > > > Encoded string: > . . . = = . . . a a a a a a a ... a a a a > ^ ^ > | | > start start + offset > > I was asked to decode the bytes at ```(start + offset)```. When the algorithm gets to the last 31 bytes of ```a a a a ... a a a a```, it looks for padding at ```(start + remaining_length - 1)``` instead of ```(start + start_offset + remaining_length - 1)```. It actually found a padding byte at ```(start + remaining_length - 1)``` and decided that the output length should be reduced by one character (or 2 if there were 2 padding bytes found). A very specific edge case (so good catch by testers). > @asgibbons, thank you for the quick fix. > I think it's worth to add the reproducer for the JBS issue as a test. Yes, we need regression test with this changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1848781491 From duke at openjdk.org Mon Dec 11 02:06:30 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Mon, 11 Dec 2023 02:06:30 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic Message-ID: Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. ### Correctness checks Tier 1/2 tests are ok. ### Performance results on T-Head board #### Results for enabled intrinsic: Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | --- | ---- | ----- | --- | ---- | --- | ---- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | #### Results for disabled intrinsic: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ------------- Commit messages: - 8317721: RISC-V: Implement CRC32 intrinsic Changes: https://git.openjdk.org/jdk/pull/17046/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8317721 Stats: 524 lines in 8 files changed: 519 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From jbhateja at openjdk.org Mon Dec 11 07:36:30 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Dec 2023 07:36:30 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. Message-ID: Hi, This bug fix patch optimizes integral gather mask computation using cheaper instruction and fixes incorrect instruction attributes in legacy integral gather instructions. All Vector API JTREG tests are passing with this at various AVX levels. Kindly review and share feedback. Best Regards, Jatin ------------- Commit messages: - 8321648: Integral gather optimized mask computation. Changes: https://git.openjdk.org/jdk/pull/17048/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17048&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321648 Stats: 31 lines in 3 files changed: 11 ins; 14 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/17048.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17048/head:pull/17048 PR: https://git.openjdk.org/jdk/pull/17048 From qamai at openjdk.org Mon Dec 11 08:56:16 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 11 Dec 2023 08:56:16 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 07:26:31 GMT, Jatin Bhateja wrote: > Hi, > > This bug fix patch optimizes integral gather mask computation using cheaper instruction and fixes incorrect instruction attributes in legacy integral gather instructions. > > All Vector API JTREG tests are passing with this at various AVX levels. > > Kindly review and share feedback. > > Best Regards, > Jatin While you are at it, you can change the `address` operand of these to only accept no-index ones, removing the need of the `lea` instruction. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17048#issuecomment-1849579917 From duke at openjdk.org Mon Dec 11 09:19:20 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 11 Dec 2023 09:19:20 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> <5saRAOBSaCJZfAFXG2QzVzPi5k8sIuKgk_4gUYXkYcA=.fc8912ac-ba5c-46c1-a347-c0639d08c463@github.com> <0kWPYTh54_hsm2DQoY9M3BgIwMVpvBG2mLTJWjlTVAc=.d198fbd2-e128-4a61-9108-7001cb17258c@github.com> Message-ID: On Sat, 9 Dec 2023 14:06:46 GMT, Andrew Haley wrote: >> it's not a single AArch64 issue, though, it's a few places. I have a particular hatred for dead code (or in this case nearly-dead code) paths that only get exercised in weird test cases. If we insist that methods may be > 1MB we can't use the compact ADR and LDR forms of relative addressing. To solve this problem, if it is one, we either put up with inefficiencies for no reason other than to make test cases work, or we generate different code especially for such cases. >> >> We could use something like relaxation, where we first generate methods optimistically then fall back to less-efficient forms, but again that's make-work for test cases. >> >> I know it's "only C1" but every little helps. Efficient systems are made of thousands of tiny optimizations, each one of which is to small to make a measurable difference on its own. > > Sorry, that was a bit of a rant. I guess I'm happy that AArch64 will be restricted, if that is the word, to megabyte-long compiled methods, for the aforementioned reasons. Every RISC processor, or at least every processor with fixed-length instructions, will have similar issues. Making the current assumptions explicit by limiting to 1MB on aarch64 seems like a good solution for the moment. I would advise that we also set a default upper limit for all platforms, to avoid other issues in nearly-dead code paths exercised by huge flag values. See, e.g., [JDK-8320302](https://bugs.openjdk.org/browse/JDK-8320302). I've also seen some other potential issues when doing stress testing. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16951#discussion_r1422149213 From jbhateja at openjdk.org Mon Dec 11 10:36:15 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Dec 2023 10:36:15 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 08:53:19 GMT, Quan Anh Mai wrote: > While you are at it, you can change the `address` operand of these to only accept no-index ones, removing the need of the `lea` instruction. Hi @merykitty , Memory patterns fold address generation components (base , index, scale) into instruction encoding thus eliminating a need to emit explicit ADD, MUL instruction sequence to compute address, saving lea may prevent folding memory patterns and may prove to be costly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17048#issuecomment-1849771671 From sgibbons at openjdk.org Mon Dec 11 15:15:31 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Mon, 11 Dec 2023 15:15:31 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 20:56:52 GMT, Scott Gibbons wrote: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. Closing this PR by @TobiHartmann request. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1850276838 From sgibbons at openjdk.org Mon Dec 11 15:15:32 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Mon, 11 Dec 2023 15:15:32 GMT Subject: Withdrawn: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 20:56:52 GMT, Scott Gibbons wrote: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/17039 From duke at openjdk.org Mon Dec 11 15:59:16 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Mon, 11 Dec 2023 15:59:16 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 01:59:33 GMT, ArsenyBochkarev wrote: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board: `-XX:-UseZba`: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | --------------------------------------------------- | ---------- | ------- | ----- | ---------- | -------- | --------- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 3563.320 | 3.326 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 1928.837 | 2.234 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1005.273 | 1.953 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 512.550 | 1.718 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 130.396 | 0.341 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 16.319 | 0.073 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 3.913 | 0.011 | ops/ms | `-XX:+UseZba`: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | --------------------------------------------------- | ---------- | ------- | -------- | -------- | -------- | ---------- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4206.654 | 0.547 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2308.843 | 3.565 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1214.727 | 0.305 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 623.173 | 0.651 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 158.965 | 0.376 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 19.934 | 0.055 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 4.730 | 0.007 | ops/ms | ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-1850364667 From jvernee at openjdk.org Mon Dec 11 18:38:55 2023 From: jvernee at openjdk.org (Jorn Vernee) Date: Mon, 11 Dec 2023 18:38:55 GMT Subject: RFR: 8320310: CompiledMethod::has_monitors flag can be incorrect [v4] In-Reply-To: <4efExybeWDkEbcsckI1Qdz8kpYFqd-Rbmt7oiWz5qlo=.d8d38d0e-affa-48dc-b963-45f958041c4e@github.com> References: <4efExybeWDkEbcsckI1Qdz8kpYFqd-Rbmt7oiWz5qlo=.d8d38d0e-affa-48dc-b963-45f958041c4e@github.com> Message-ID: > Currently, the `CompiledMethod::has_monitors` flag is set when either a `monitorenter` is parsed by C1, and `monitorexit` is parsed by C1 or C2 during method compilation. However, not necessarily every bytecode of a method is parsed, which means that we could miss all `monitorenter`/`monitorexit` byte codes in a method, while it actually does use monitors. This can lead to situations where a thread holds a monitor, but `has_monitors` for all frames is set to `false`, leading to an assertion failure in 'freeze_internal' in continuationFreezeThaw.cpp: > > assert(monitors_on_stack(current) == ((current->held_monitor_count() - current->jni_monitor_count()) > 0), > "Held monitor count and locks on stack invariant: " INT64_FORMAT " JNI: " INT64_FORMAT, (int64_t)current->held_monitor_count(), (int64_t)current->jni_monitor_count()); > > The proposed fix is to rely on `Method::has_monitor_bytecodes` to set the `has_monitors` flag when compiling, which is immune to issues where not all byte codes of a method are parsed during compilation. We can follow the pattern established for `has_reserved_stack_access`, which is similar. > > Note that this PR is based on: https://github.com/openjdk/jdk/pull/16416 which disables the assertion. The goal of this PR is to fix the issue, and then re-enable the assertion. > > Testing: Tier 1-4, `java/lang/Thread/virtual/stress/PinALot.java` Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: re-enable assert again ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16799/files - new: https://git.openjdk.org/jdk/pull/16799/files/85b2d662..eb7f0f5a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16799&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16799&range=02-03 Stats: 17 lines in 1 file changed: 0 ins; 0 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/16799.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16799/head:pull/16799 PR: https://git.openjdk.org/jdk/pull/16799 From never at openjdk.org Mon Dec 11 22:01:58 2023 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 11 Dec 2023 22:01:58 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes [v2] In-Reply-To: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: > HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. > > This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. Tom Rodriguez has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge remote-tracking branch 'origin/master' into tkr-clean-weak - Comment and types improvements - 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16981/files - new: https://git.openjdk.org/jdk/pull/16981/files/e6a60ed0..34c575a7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16981&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16981&range=00-01 Stats: 47685 lines in 697 files changed: 21763 ins; 23717 del; 2205 mod Patch: https://git.openjdk.org/jdk/pull/16981.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16981/head:pull/16981 PR: https://git.openjdk.org/jdk/pull/16981 From never at openjdk.org Mon Dec 11 22:01:58 2023 From: never at openjdk.org (Tom Rodriguez) Date: Mon, 11 Dec 2023 22:01:58 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes In-Reply-To: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: On Tue, 5 Dec 2023 19:00:51 GMT, Tom Rodriguez wrote: > HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. > > This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. I pushed some expanded comments and renames. I also adjusted some types that were weaker than the actual type which removed a bit of casting. Testing was clean. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16981#issuecomment-1850957119 From vlivanov at openjdk.org Tue Dec 12 00:17:14 2023 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Tue, 12 Dec 2023 00:17:14 GMT Subject: RFR: 8320310: CompiledMethod::has_monitors flag can be incorrect [v4] In-Reply-To: References: <4efExybeWDkEbcsckI1Qdz8kpYFqd-Rbmt7oiWz5qlo=.d8d38d0e-affa-48dc-b963-45f958041c4e@github.com> Message-ID: On Mon, 11 Dec 2023 18:38:55 GMT, Jorn Vernee wrote: >> Currently, the `CompiledMethod::has_monitors` flag is set when either a `monitorenter` is parsed by C1, and `monitorexit` is parsed by C1 or C2 during method compilation. However, not necessarily every bytecode of a method is parsed, which means that we could miss all `monitorenter`/`monitorexit` byte codes in a method, while it actually does use monitors. This can lead to situations where a thread holds a monitor, but `has_monitors` for all frames is set to `false`, leading to an assertion failure in 'freeze_internal' in continuationFreezeThaw.cpp: >> >> assert(monitors_on_stack(current) == ((current->held_monitor_count() - current->jni_monitor_count()) > 0), >> "Held monitor count and locks on stack invariant: " INT64_FORMAT " JNI: " INT64_FORMAT, (int64_t)current->held_monitor_count(), (int64_t)current->jni_monitor_count()); >> >> The proposed fix is to rely on `Method::has_monitor_bytecodes` to set the `has_monitors` flag when compiling, which is immune to issues where not all byte codes of a method are parsed during compilation. We can follow the pattern established for `has_reserved_stack_access`, which is similar. >> >> Note that this PR is based on: https://github.com/openjdk/jdk/pull/16416 which disables the assertion. The goal of this PR is to fix the issue, and then re-enable the assertion. >> >> Testing: Tier 1-4, `java/lang/Thread/virtual/stress/PinALot.java` > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > re-enable assert again Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16799#pullrequestreview-1776382808 From duke at openjdk.org Tue Dec 12 00:42:46 2023 From: duke at openjdk.org (Joshua Cao) Date: Tue, 12 Dec 2023 00:42:46 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform and transform_no_reclaim Message-ID: `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. Passes tier1 locally on my Linux machine. ------------- Commit messages: - 8321823: Remove redundant PhaseGVN transform and transform_no_reclaim Changes: https://git.openjdk.org/jdk/pull/17071/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17071&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321823 Stats: 20 lines in 5 files changed: 0 ins; 8 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/17071.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17071/head:pull/17071 PR: https://git.openjdk.org/jdk/pull/17071 From chagedorn at openjdk.org Tue Dec 12 08:22:34 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 12 Dec 2023 08:22:34 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform and transform_no_reclaim In-Reply-To: References: Message-ID: On Tue, 12 Dec 2023 00:37:19 GMT, Joshua Cao wrote: > `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. > > Passes tier1 locally on my Linux machine. The title of the RFE is a little bit misleading as it suggests that both are redundant. Maybe you can change that into just mentioning that `PhaseGVN::transform_no_reclaim()` is redundant. Otherwise, looks good! src/hotspot/share/opto/phaseX.cpp line 676: > 674: // Return a node which computes the same function as this node, but > 675: // in a faster or cheaper fashion. > 676: Node *PhaseGVN::transform(Node *n) { While at it, you could also fix the asterisk positions: Suggestion: Node* PhaseGVN::transform(Node* n) { src/hotspot/share/opto/phaseX.hpp line 418: > 416: // Return a node which computes the same function as this node, but > 417: // in a faster or cheaper fashion. > 418: Node *transform(Node *n); Same here and maybe add a new line afterward to separate it better from `record_for_igvn()`. Suggestion: Node* transform(Node* n); ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17071#pullrequestreview-1776879595 PR Review Comment: https://git.openjdk.org/jdk/pull/17071#discussion_r1423604333 PR Review Comment: https://git.openjdk.org/jdk/pull/17071#discussion_r1423606056 From rehn at openjdk.org Tue Dec 12 09:43:33 2023 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 12 Dec 2023 09:43:33 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 01:59:33 GMT, ArsenyBochkarev wrote: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | Thanks! (not a review, just an ack) There are two other version we probably need also, using carry-less-multiplication. There is a scalar clmul in Zbc and there is a vclmul in vector. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-1851659207 From vkempik at openjdk.org Tue Dec 12 09:58:24 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Tue, 12 Dec 2023 09:58:24 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 01:59:33 GMT, ArsenyBochkarev wrote: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3753: > 3751: mv(tmp5, bits32); > 3752: notr(crc, crc); > 3753: andr(crc, crc, tmp5); can use andn(crc, tmp, crc); here, so get some accel when Zbb present ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1423747972 From dnsimon at openjdk.org Tue Dec 12 10:19:25 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 12 Dec 2023 10:19:25 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes [v2] In-Reply-To: References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: On Mon, 11 Dec 2023 22:01:58 GMT, Tom Rodriguez wrote: >> HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. >> >> This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. > > Tom Rodriguez has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into tkr-clean-weak > - Comment and types improvements > - 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes Marked as reviewed by dnsimon (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16981#pullrequestreview-1777123297 From ihse at openjdk.org Tue Dec 12 15:45:57 2023 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Tue, 12 Dec 2023 15:45:57 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> <7_T6sM3wjbSzZ0ab9FsptbpPnlQ2J4NNctQNkdbDFdI=.b595a8cc-4b14-44c6-8319-00e68fea21c3@github.com> Message-ID: On Wed, 6 Dec 2023 17:20:10 GMT, Srinivas Vamsi Parasa wrote: >> Okay, then I guess I am fine with this. > > Thank you Magnus! @vamsi-parasa You said: > Made sure that OpenJDK builds without errors using both GCC 7.5 and GCC 6.4. but now we have https://bugs.openjdk.org/browse/JDK-8321688. Did you introduce any changes after you tested with GCC 7.5? It seems strange to me that the code simultaneously both works and not works with gcc 7.5. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1424200521 From qamai at openjdk.org Tue Dec 12 16:09:02 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:09:02 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v36] In-Reply-To: References: Message-ID: <8jAVTKPhBgICFteWb4xcrQTDE0tr174sSuLTZzZbEfU=.1ef8eaa1-11dd-43f7-b180-1509a14c5cb9@github.com> > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: address reviews ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/e8b54dad..75a2c172 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=35 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=34-35 Stats: 85 lines in 3 files changed: 30 ins; 5 del; 50 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Tue Dec 12 16:09:06 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:09:06 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v35] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 22:26:54 GMT, Kim Barrett wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> missing include > > src/hotspot/share/utilities/globalDefinitions.hpp line 1107: > >> 1105: using U = std::make_unsigned_t; >> 1106: return (x >= 0) ? x : U(0) - U(x); >> 1107: } > > I understand what this to change to ABS is doing, though it's not obvious. (Dodging overflow UB > for -x when x is the minimum value of a signed integral type.) I'm not entirely sure that's a wise move. > > As written this will trigger `-Wconversion` warnings someday (maybe). static_casting the subtraction > result to T will eliminate that concern. > > However, this is an API change. The previous definition worked for floating point types, while this > change does not. (std::make_unsigned requires T be an integral or enum, but not bool, type.) > > I also don't understand why this change is part of this PR. > > So I'm inclined to say no to this change without some compelling rationale. Sure I have reverted that also. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1424233574 From qamai at openjdk.org Tue Dec 12 16:15:19 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:15:19 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v33] In-Reply-To: References: Message-ID: On Tue, 28 Nov 2023 19:55:16 GMT, Kim Barrett wrote: >> Not a review - mostly a comment for future development. I only became aware of >> this PR recently (I don't track compiler PRs that closely), so didn't notice >> the embedded separation of Java-semantics arithmetic from globalDefintions. I >> strongly approve of that separation, but wish it had been done in it's own PR. >> Kind of late for suggesting that now though. That change can (and hopefully >> will) be followed by changes to remove the need to #include javaArithmetic.hpp >> from globalDefinitons.hpp, since most code doesn't need javaArithmetic.hpp. >> That's likely to take a while to accomplish, but would be a nice goal to >> achieve. This goes along with the ongoing cleanup of replacing >> unnecessary/inappropriate uses of JNI integral types with "native" types. > >> @kimbarrett Thanks for the suggestion, I will create a tracking issue right after the integration of this. > > That's not what @stefank and I were suggesting. Instead we'd like the extraction of javaArithmetic to be a separate > change that is integrated first, with the uses here coming after. Alternatively, leave out that extraction from this > change (with corresponding adjustments of includes and some additions to (probably) globalDefinitions), and do > the extraction as a followup. The point being that the two parts are largely independent. @kimbarrett Thanks for your review, hope I have addressed your concerns. ------------- PR Comment: https://git.openjdk.org/jdk/pull/9947#issuecomment-1852349886 From qamai at openjdk.org Tue Dec 12 16:15:23 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:15:23 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v35] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 21:38:11 GMT, Kim Barrett wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> missing include > > src/hotspot/share/opto/divnode.hpp line 40: > >> 38: template >> 39: void magic_divide_constants(T d, T N_neg, T N_pos, juint min_s, T& c, bool& c_ovf, juint& s); >> 40: void magic_divide_constants_round_down(juint d, juint& c, juint& s); > > The definitions of these are in new file divconstants.cpp. So I think these should be in divconstants.hpp. I see there are multiple cases where a header is defined in multiple source files, and these are used exclusively for `DivNode`s so putting them here seems logical. > test/hotspot/gtest/opto/test_constant_division.cpp line 29: > >> 27: #include >> 28: #include >> 29: #include > > We mostly don't use C++ standard library facilities in HotSpot, even in tests; see the style guide. > is permitted. We have GrowableArray instead of . And we have os::random, > unless this really needs something better (seems unlikely, other than needing to deal with wider > types than int.) Also, stdlib includes at the end again (except I think unittest.hpp is supposed to > _really_ be last.) I see, for `vector` I did not look it up carefully and thought that we need a VM for `GrowableArray`s, and `` is used mainly for wider types and custom bounds. I have changed them to use what are offered by OpenJDK instead. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1424240505 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1424238521 From qamai at openjdk.org Tue Dec 12 16:15:17 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:15:17 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v37] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: remove static ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/75a2c172..567eed97 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=36 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=35-36 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Tue Dec 12 16:39:17 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:39:17 GMT Subject: RFR: 8319451: PhaseIdealLoop::conditional_move is too conservative [v2] In-Reply-To: References: Message-ID: > Hi, > > When transforming a Phi into a CMove, the threshold is set to be approximately BlockLayoutMinDiamondPercentage, the reason is given: > > // BlockLayoutByFrequency optimization moves infrequent branch > // from hot path. No point in CMOV'ing in such case > > This sets the default value of the threshold to be around 18%, which is too conservative. The reason also does not make a lot of sense since the important property which makes jumping expensive is not code layout. We should remove this. > > Please kindly review, thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - change freq to double - Merge branch 'master' into cmovethreshold - adjust threshold, add benchmark ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16524/files - new: https://git.openjdk.org/jdk/pull/16524/files/4513cbef..e494d6df Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16524&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16524&range=00-01 Stats: 815599 lines in 4990 files changed: 184434 ins; 547191 del; 83974 mod Patch: https://git.openjdk.org/jdk/pull/16524.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16524/head:pull/16524 PR: https://git.openjdk.org/jdk/pull/16524 From qamai at openjdk.org Tue Dec 12 16:39:20 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:39:20 GMT Subject: RFR: 8319451: PhaseIdealLoop::conditional_move is too conservative [v2] In-Reply-To: References: Message-ID: On Thu, 9 Nov 2023 14:29:06 GMT, Claes Redestad wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - change freq to double >> - Merge branch 'master' into cmovethreshold >> - adjust threshold, add benchmark > > test/micro/org/openjdk/bench/vm/compiler/CMove.java line 40: > >> 38: >> 39: @Param({"3", "6", "10", "20", "30", "60", "100", "200", "300", "600"}) >> 40: int freq; > > That `freq` is expressed in "occurrences per thousand" was only obvious after reading the code - perhaps a probability between 0 and 1 (with the appropriate adjustment to `r.nextFloat() < freq` below) would be slightly more intuitive? > Suggestion: > > @Param({"0.003", "0.006", "0.01", "0.02", "0.03", "0.06", "0.1", "0.2", "0.3", "0.6"}) > float freq; @cl4es You are right, I have fixed that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16524#discussion_r1424279741 From vkempik at openjdk.org Tue Dec 12 16:41:33 2023 From: vkempik at openjdk.org (Vladimir Kempik) Date: Tue, 12 Dec 2023 16:41:33 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 01:59:33 GMT, ArsenyBochkarev wrote: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3805: > 3803: bind(L_exit); > 3804: notr(crc, crc); > 3805: andr(crc, crc, tmp5); same, andn can be used instead of not&and ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1424284508 From qamai at openjdk.org Tue Dec 12 16:49:13 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 12 Dec 2023 16:49:13 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v38] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: missing include ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/567eed97..a8389af1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=37 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=36-37 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From duke at openjdk.org Tue Dec 12 17:29:50 2023 From: duke at openjdk.org (Joshua Cao) Date: Tue, 12 Dec 2023 17:29:50 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform and transform_no_reclaim [v2] In-Reply-To: References: Message-ID: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> > `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. > > Passes tier1 locally on my Linux machine. Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Fix formatting ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17071/files - new: https://git.openjdk.org/jdk/pull/17071/files/5a987f69..09225453 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17071&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17071&range=00-01 Stats: 3 lines in 2 files changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/17071.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17071/head:pull/17071 PR: https://git.openjdk.org/jdk/pull/17071 From duke at openjdk.org Tue Dec 12 17:36:55 2023 From: duke at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 12 Dec 2023 17:36:55 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> <7_T6sM3wjbSzZ0ab9FsptbpPnlQ2J4NNctQNkdbDFdI=.b595a8cc-4b14-44c6-8319-00e68fea21c3@github.com> Message-ID: On Tue, 12 Dec 2023 15:42:09 GMT, Magnus Ihse Bursie wrote: >> Thank you Magnus! > > @vamsi-parasa You said: >> Made sure that OpenJDK builds without errors using both GCC 7.5 and GCC 6.4. > > but now we have https://bugs.openjdk.org/browse/JDK-8321688. Did you introduce any changes after you tested with GCC 7.5? It seems strange to me that the code simultaneously both works and not works with gcc 7.5. Hi Magnus (@magicus), did a fresh pull of the OpenJDK and was able to build it successfully (without any errors) using GCC 7.5.0 on Ubuntu Linux machine. (I am on vacation till Jan7th, 2024. Our team will look into this issue) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1424352122 From epeter at openjdk.org Tue Dec 12 18:27:55 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 12 Dec 2023 18:27:55 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v41] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: - Merge branch 'master' into JDK-8311586 - Refactoring: AlignmentSolution with nice subclasses (Christian's idea) - remove dead code - Merge branch 'master' into JDK-8311586 - finished creating AlignmentSolver - v2 WIP AlignmentSolver - wip AlignmentSolver - aw -> alignment_vector in AlignmentSolution - more const - made invar const - ... and 85 more: https://git.openjdk.org/jdk/compare/df4ed7ef...690acf9a ------------- Changes: https://git.openjdk.org/jdk/pull/14785/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=40 Stats: 8448 lines in 23 files changed: 7157 ins; 376 del; 915 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From fjiang at openjdk.org Wed Dec 13 01:12:46 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 13 Dec 2023 01:12:46 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic In-Reply-To: References: Message-ID: <9TEIo2dROB23lR_5-zKZlnsCpMgFamUPOSL0bxBZu5Q=.3dcfe283-5b3e-4798-b781-100836b451b6@github.com> On Mon, 11 Dec 2023 01:59:33 GMT, ArsenyBochkarev wrote: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | Changes requested by fjiang (Committer). src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp line 1642: > 1640: __ xori(crc, crc, -1); // ~crc > 1641: __ slli(crc, crc, 32); > 1642: __ srli(crc, crc, 32); Suggestion: __ notr(crc, crc); // ~crc __ zero_extend(crc, crc, 32); src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp line 1646: > 1644: __ xori(res, crc, -1); // ~crc > 1645: __ slli(crc, crc, 32); > 1646: __ srli(crc, crc, 32); Suggestion: __ notr(res, crc); // ~crc __ zero_extend(crc, crc, 32); ------------- PR Review: https://git.openjdk.org/jdk/pull/17046#pullrequestreview-1778668061 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1424730144 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1424729897 From chagedorn at openjdk.org Wed Dec 13 07:11:57 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 13 Dec 2023 07:11:57 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform_no_reclaim [v2] In-Reply-To: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> References: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> Message-ID: On Tue, 12 Dec 2023 17:29:50 GMT, Joshua Cao wrote: >> `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. >> >> Passes tier1 locally on my Linux machine. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix formatting Thanks for the update, still looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17071#pullrequestreview-1778968853 From thartmann at openjdk.org Wed Dec 13 07:33:37 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 13 Dec 2023 07:33:37 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 10:33:05 GMT, Jatin Bhateja wrote: >> While you are at it, you can change the `address` operand of these to only accept no-index ones, removing the need of the `lea` instruction. > >> While you are at it, you can change the `address` operand of these to only accept no-index ones, removing the need of the `lea` instruction. > > Hi @merykitty , Memory patterns fold address generation components (base , index, scale) into instruction encoding thus eliminating a need to emit explicit ADD, MUL instruction sequence to compute address, saving lea may prevent folding memory patterns and may prove to be costly. @jatin-bhateja Could you elaborate on what the failure mode for the incorrect instruction attribution would look like? Is this just inefficient execution or would it lead to a crash? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17048#issuecomment-1853399018 From epeter at openjdk.org Wed Dec 13 08:42:14 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 13 Dec 2023 08:42:14 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v42] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - expose less from AlignmentSolutionConstrained - fix up some virtual functions for Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/690acf9a..f118744f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=41 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=40-41 Stats: 49 lines in 2 files changed: 6 ins; 12 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From jbhateja at openjdk.org Wed Dec 13 08:53:48 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Dec 2023 08:53:48 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 10:33:05 GMT, Jatin Bhateja wrote: >> While you are at it, you can change the `address` operand of these to only accept no-index ones, removing the need of the `lea` instruction. > >> While you are at it, you can change the `address` operand of these to only accept no-index ones, removing the need of the `lea` instruction. > > Hi @merykitty , Memory patterns fold address generation components (base , index, scale) into instruction encoding thus eliminating a need to emit explicit ADD, MUL instruction sequence to compute address, saving lea may prevent folding memory patterns and may prove to be costly. > @jatin-bhateja Could you elaborate on what the failure mode for the incorrect instruction attribution would look like? Is this just inefficient execution or would it lead to a crash? Hi @TobiHartmann , These gather instruction are strictly applicable for AVX2 targets and will always be VEX encoded, instruction patterns corresponding to them operate on legacy vector register mask operands. Thus, this looks more of a typo error to set VL as true. Other change is for strength reduction and replacing a memory operand instruction. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17048#issuecomment-1853497698 From roland at openjdk.org Wed Dec 13 08:55:25 2023 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 13 Dec 2023 08:55:25 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v2] In-Reply-To: References: Message-ID: > Range check smearing and range check predication make an array access > dependent on 2 (or more in the case of RC smearing) conditions. As a > consequence, if a range check can be eliminated because there's an > identical dominating range check, the control dependent nodes that > could float and become dependent on the dominating range check cannot > be allowed to float because there's a risk that they would then bypass > one of the checks that make the access legal. > > `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have > logic to prevent this: nodes that are control dependent on a range > check or predicate are not allowed to float. This is however not > sufficient as demonstrated by the test cases. > > In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: > > > v += array[i]; > if (flag2) { > if (flag3) { > field = 0x42; > } > } > if (flagField == 1) { > v += array[i]; > } > > > The range check for the second `array[i]` load is replaced by the > dominating range check for the first `array[i]` but because the second > `array[i]` load could really be dependent on multiple range checks (in > case smearing happened which is not the case here), c2 doesn't allow > the second `array[i]` to float when the second range check is > removed. The second `array[i]` is then control dependent on: > > > if (flagField == 1) { > > > which is next found to be dominated by the same test: > > > if (flag == 1) { > > > and is removed. However nothing in `dominated_by()` treats node > dependent on tests that are not range check or predicates > specially. So the second `array[i]` is allowed to float and become > dependent on: > > > if (flag == 1) { > > > which is above the range check for that access. The test method in its > last invocation is passed an index for the array access that's widely > out of range. The array load happens before the range check and > crashes the VM. `testLoopPredication()` is a similar test where array > loads become dependent on predicates and end up above range checks. > > `TestArrayAccessCastIIAboveRC.java` is the test case from the bug > where for similar reasons a range check `CastII` ends up above its > range check, becomes top because its input becomes some integer that > conflicts with its type (but there's no condition to catch it). The > graph becomes broken and c2 crashes. > > Logic in the `dominated_by()` methods ... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge branch 'master' into JDK-8319793 - fix & test ------------- Changes: https://git.openjdk.org/jdk/pull/16886/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=01 Stats: 361 lines in 14 files changed: 303 ins; 27 del; 31 mod Patch: https://git.openjdk.org/jdk/pull/16886.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16886/head:pull/16886 PR: https://git.openjdk.org/jdk/pull/16886 From duke at openjdk.org Wed Dec 13 09:51:48 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 13 Dec 2023 09:51:48 GMT Subject: RFR: 8321820: TestLoadNIdeal fails on 32-bit because -XX:+UseCompressedOops is not recognized Message-ID: This changeset fixes an issue where `TestLoadNIdeal.java` fails on 32-bit, where `-XX:+UseCompressedOops` is not available. Changes: - Only run the test on 64-bit platforms. ### Testing windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64: - tier1, HotSpot parts of tier2 and tier3 linux-x86 (32-bit) - tier1 ------------- Commit messages: - Require 64-bit for TestLoadNIdeal Changes: https://git.openjdk.org/jdk/pull/17083/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17083&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321820 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17083.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17083/head:pull/17083 PR: https://git.openjdk.org/jdk/pull/17083 From ihse at openjdk.org Wed Dec 13 09:52:02 2023 From: ihse at openjdk.org (Magnus Ihse Bursie) Date: Wed, 13 Dec 2023 09:52:02 GMT Subject: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12] In-Reply-To: References: <_gSNXk0qGAtpY-WJ5OCHk_3-nuGrwwSn-ffK9f2TEcs=.40f785ba-83dd-40fe-8075-a7a7872ea600@github.com> <7_T6sM3wjbSzZ0ab9FsptbpPnlQ2J4NNctQNkdbDFdI=.b595a8cc-4b14-44c6-8319-00e68fea21c3@github.com> Message-ID: On Tue, 12 Dec 2023 17:33:12 GMT, Srinivas Vamsi Parasa wrote: >> @vamsi-parasa You said: >>> Made sure that OpenJDK builds without errors using both GCC 7.5 and GCC 6.4. >> >> but now we have https://bugs.openjdk.org/browse/JDK-8321688. Did you introduce any changes after you tested with GCC 7.5? It seems strange to me that the code simultaneously both works and not works with gcc 7.5. > > Hi Magnus (@magicus), did a fresh pull of the OpenJDK and was able to build it successfully (without any errors) using GCC 7.5.0 on Ubuntu Linux machine. > (I am on vacation till Jan7th, 2024. Our team will look into this issue) New information in JDK-8321688 says it is only happening on slowdebug. This is probably why it was missed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1425106097 From rcastanedalo at openjdk.org Wed Dec 13 11:00:40 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 13 Dec 2023 11:00:40 GMT Subject: RFR: 8321820: TestLoadNIdeal fails on 32-bit because -XX:+UseCompressedOops is not recognized In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 09:45:29 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where `TestLoadNIdeal.java` fails on 32-bit, where `-XX:+UseCompressedOops` is not available. > > Changes: > - Only run the test on 64-bit platforms. > > ### Testing > windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64: > - tier1, HotSpot parts of tier2 and tier3 > > linux-x86 (32-bit) > - tier1 Looks good. ------------- Marked as reviewed by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17083#pullrequestreview-1779373042 From chagedorn at openjdk.org Wed Dec 13 11:32:38 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 13 Dec 2023 11:32:38 GMT Subject: RFR: 8321820: TestLoadNIdeal fails on 32-bit because -XX:+UseCompressedOops is not recognized In-Reply-To: References: Message-ID: <0vBhL_mK45Ir_r-yBl3cKyuis3cfr4J8wst7MfEwYy8=.1c0f9d28-eb79-4615-ba33-3991e0b1f730@github.com> On Wed, 13 Dec 2023 09:45:29 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where `TestLoadNIdeal.java` fails on 32-bit, where `-XX:+UseCompressedOops` is not available. > > Changes: > - Only run the test on 64-bit platforms. > > ### Testing > windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64: > - tier1, HotSpot parts of tier2 and tier3 > > linux-x86 (32-bit) > - tier1 Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17083#pullrequestreview-1779432646 From thartmann at openjdk.org Wed Dec 13 12:02:53 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 13 Dec 2023 12:02:53 GMT Subject: RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized Message-ID: [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292) added code that relies on lazy initialization of the `TypeAryPtr::_klass` field. However, there are cases when the field is not yet initialized, leading to a null pointer dereference at C2 compilation time. In the failing case we process a CmpP: 116 Phi === 109 160 57 [[ 120 128 128 ]] #long[int:1..2] (java/lang/Cloneable,java/io/Serializable):NotNull:exact * !jvms: TestSimple::test @ bci:11 (line 32) 10 Parm === 3 [[ 173 143 128 40 120 128 94 72 83 ]] Parm0: long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact * !jvms: TestSimple::test @ bci:-1 (line 29) 120 CmpP === _ 10 116 [[ 121 ]] !jvms: TestSimple::test @ bci:13 (line 32) `CmpPNode::sub` performs a subtype check to check if the klasses of its two operands are unrelated. We crash in `ciKlass::is_subtype_of` because the `TypeAryPtr::_klass` field is not initialized ( `= nullptr`) for the `116 Phi` operand. The issue only reproduces with release builds because [additional verification code](https://github.com/openjdk/jdk/blob/21cda19d05b688148f023f6d92778b5da210b709/src/hotspot/share/opto/type.cpp#L996-L1007) in `Type::meet_helper` in debug builds calls `klass()` which leads to eager initialization of the `_klass` field. When disabling the verification code, the issue also reproduces with debug builds and we hit the `this_one->_klass != nullptr && other->_klass != nullptr` assert in `TypePtr::is_same_java_type_as_helper_for_array`. The fix is to always use the `klass()` method for accesses which makes sure that the field is properly initialized since the overhead is negligible. The patch also includes some unrelated removal of dead code in `TypeAryPtr::compute_klass` (after [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292), the verify argument is always false). Thanks, Tobias ------------- Commit messages: - 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized Changes: https://git.openjdk.org/jdk/pull/17085/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17085&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321974 Stats: 89 lines in 3 files changed: 54 ins; 22 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/17085.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17085/head:pull/17085 PR: https://git.openjdk.org/jdk/pull/17085 From roland at openjdk.org Wed Dec 13 12:06:39 2023 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 13 Dec 2023 12:06:39 GMT Subject: RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 11:57:42 GMT, Tobias Hartmann wrote: > [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292) added code that relies on lazy initialization of the `TypeAryPtr::_klass` field. However, there are cases when the field is not yet initialized, leading to a null pointer dereference at C2 compilation time. > > In the failing case we process a CmpP: > > 116 Phi === 109 160 57 [[ 120 128 128 ]] #long[int:1..2] (java/lang/Cloneable,java/io/Serializable):NotNull:exact * !jvms: TestSimple::test @ bci:11 (line 32) > 10 Parm === 3 [[ 173 143 128 40 120 128 94 72 83 ]] Parm0: long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact * !jvms: TestSimple::test @ bci:-1 (line 29) > 120 CmpP === _ 10 116 [[ 121 ]] !jvms: TestSimple::test @ bci:13 (line 32) > > `CmpPNode::sub` performs a subtype check to check if the klasses of its two operands are unrelated. We crash in `ciKlass::is_subtype_of` because the `TypeAryPtr::_klass` field is not initialized ( `= nullptr`) for the `116 Phi` operand. > > The issue only reproduces with release builds because [additional verification code](https://github.com/openjdk/jdk/blob/21cda19d05b688148f023f6d92778b5da210b709/src/hotspot/share/opto/type.cpp#L996-L1007) in `Type::meet_helper` in debug builds calls `klass()` which leads to eager initialization of the `_klass` field. When disabling the verification code, the issue also reproduces with debug builds and we hit the `this_one->_klass != nullptr && other->_klass != nullptr` assert in `TypePtr::is_same_java_type_as_helper_for_array`. > > The fix is to always use the `klass()` method for accesses which makes sure that the field is properly initialized since the overhead is negligible. The patch also includes some unrelated removal of dead code in `TypeAryPtr::compute_klass` (after [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292), the verify argument is always false). > > Thanks, > Tobias Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17085#pullrequestreview-1779489428 From aph-open at littlepinkcloud.com Wed Dec 13 12:07:10 2023 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Wed, 13 Dec 2023 12:07:10 +0000 Subject: ABS usage in HotSpot [Was: RFR: 8282365: Consolidate and improve division by constant idealizations [v35]] In-Reply-To: References: Message-ID: <96035774-9883-46e1-8396-e3503e6fdc9b@littlepinkcloud.com> On 12/7/23 22:54, Kim Barrett wrote: > src/hotspot/share/utilities/globalDefinitions.hpp line 1107: > >> 1105: using U = std::make_unsigned_t; >> 1106: return (x >= 0) ? x : U(0) - U(x); >> 1107: } > I understand what this to change to ABS is doing, though it's not obvious. (Dodging overflow UB > for -x when x is the minimum value of a signed integral type.) I'm not entirely sure that's a wise move. Me either. In the case of integer types we have uabs(), which returns an unsigned value and cannot overflow, so we don't need to dodge any UB. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From thartmann at openjdk.org Wed Dec 13 12:09:39 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 13 Dec 2023 12:09:39 GMT Subject: RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 11:57:42 GMT, Tobias Hartmann wrote: > [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292) added code that relies on lazy initialization of the `TypeAryPtr::_klass` field. However, there are cases when the field is not yet initialized, leading to a null pointer dereference at C2 compilation time. > > In the failing case we process a CmpP: > > 116 Phi === 109 160 57 [[ 120 128 128 ]] #long[int:1..2] (java/lang/Cloneable,java/io/Serializable):NotNull:exact * !jvms: TestSimple::test @ bci:11 (line 32) > 10 Parm === 3 [[ 173 143 128 40 120 128 94 72 83 ]] Parm0: long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact * !jvms: TestSimple::test @ bci:-1 (line 29) > 120 CmpP === _ 10 116 [[ 121 ]] !jvms: TestSimple::test @ bci:13 (line 32) > > `CmpPNode::sub` performs a subtype check to check if the klasses of its two operands are unrelated. We crash in `ciKlass::is_subtype_of` because the `TypeAryPtr::_klass` field is not initialized ( `= nullptr`) for the `116 Phi` operand. > > The issue only reproduces with release builds because [additional verification code](https://github.com/openjdk/jdk/blob/21cda19d05b688148f023f6d92778b5da210b709/src/hotspot/share/opto/type.cpp#L996-L1007) in `Type::meet_helper` in debug builds calls `klass()` which leads to eager initialization of the `_klass` field. When disabling the verification code, the issue also reproduces with debug builds and we hit the `this_one->_klass != nullptr && other->_klass != nullptr` assert in `TypePtr::is_same_java_type_as_helper_for_array`. > > The fix is to always use the `klass()` method for accesses which makes sure that the field is properly initialized since the overhead is negligible. The patch also includes some unrelated removal of dead code in `TypeAryPtr::compute_klass` (after [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292), the verify argument is always false). > > Thanks, > Tobias Thanks for the quick review, Roland! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17085#issuecomment-1853797822 From thartmann at openjdk.org Wed Dec 13 14:31:47 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 13 Dec 2023 14:31:47 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: On Thu, 7 Dec 2023 15:53:47 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. >> - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. >> >> ### Testing (in progress) >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5 >> - Targeted and repeated tests for `TestC1Globals.java` in all tiers > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright The `1M` limit seems reasonable to me given that `NMethodSizeLimit` is a develop flag and the default of `64K` is much lower. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16951#pullrequestreview-1779770946 From thartmann at openjdk.org Wed Dec 13 14:33:39 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 13 Dec 2023 14:33:39 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 07:26:31 GMT, Jatin Bhateja wrote: > Hi, > > This bug fix patch optimizes integral gather mask computation using cheaper instruction and fixes incorrect instruction attributes in legacy integral gather instructions. > > All Vector API JTREG tests are passing with this at various AVX levels. > > Kindly review and share feedback. > > Best Regards, > Jatin Thanks for the clarification, Jatin. So the incorrect encoding has no real (negative) effect on code generation? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17048#issuecomment-1854024788 From jbhateja at openjdk.org Wed Dec 13 15:05:40 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Dec 2023 15:05:40 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 14:31:14 GMT, Tobias Hartmann wrote: > Thanks for the clarification, Jatin. So the incorrect encoding has no real (negative) effect on code generation? Instruction attributes VL allows EVEX to VEX demotions, iff participating vectors are allocated from lower register bank and are less than 512 bit wide. On the contrary, if RA makes an allocation from higher register bank then we may need EVEX bits to accomodate encoding for a reigister in higher register bank even if vectors are lesser than 512 bit wide. However, there are some instructions which are constrained to use VEX encoding and AVX2 gather belong to that class. Patch also strength reduces mask computation which is currently using memory operands. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17048#issuecomment-1854082915 From aph at openjdk.org Wed Dec 13 15:06:45 2023 From: aph at openjdk.org (Andrew Haley) Date: Wed, 13 Dec 2023 15:06:45 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: On Thu, 7 Dec 2023 15:53:47 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. >> - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. >> >> ### Testing (in progress) >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5 >> - Targeted and repeated tests for `TestC1Globals.java` in all tiers > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16951#pullrequestreview-1779851010 From shade at openjdk.org Wed Dec 13 15:09:38 2023 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 13 Dec 2023 15:09:38 GMT Subject: RFR: 8321820: TestLoadNIdeal fails on 32-bit because -XX:+UseCompressedOops is not recognized In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 09:45:29 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where `TestLoadNIdeal.java` fails on 32-bit, where `-XX:+UseCompressedOops` is not available. > > Changes: > - Only run the test on 64-bit platforms. > > ### Testing > windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64: > - tier1, HotSpot parts of tier2 and tier3 > > linux-x86 (32-bit) > - tier1 Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17083#pullrequestreview-1779859062 From qamai at openjdk.org Wed Dec 13 16:01:42 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 13 Dec 2023 16:01:42 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v39] In-Reply-To: References: Message-ID: <4aXL_qh1epRWCwufaHiKXJ3wuPqG0xZSF6i-8r6OgcU=.97a5ff7e-5a19-47e5-b14d-af16ef5c56d5@github.com> > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: missing revert ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/a8389af1..b56dc2d4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=38 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=37-38 Stats: 6 lines in 2 files changed: 0 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From kvn at openjdk.org Wed Dec 13 17:10:37 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 13 Dec 2023 17:10:37 GMT Subject: RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 11:57:42 GMT, Tobias Hartmann wrote: > [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292) added code that relies on lazy initialization of the `TypeAryPtr::_klass` field. However, there are cases when the field is not yet initialized, leading to a null pointer dereference at C2 compilation time. > > In the failing case we process a CmpP: > > 116 Phi === 109 160 57 [[ 120 128 128 ]] #long[int:1..2] (java/lang/Cloneable,java/io/Serializable):NotNull:exact * !jvms: TestSimple::test @ bci:11 (line 32) > 10 Parm === 3 [[ 173 143 128 40 120 128 94 72 83 ]] Parm0: long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact * !jvms: TestSimple::test @ bci:-1 (line 29) > 120 CmpP === _ 10 116 [[ 121 ]] !jvms: TestSimple::test @ bci:13 (line 32) > > `CmpPNode::sub` performs a subtype check to check if the klasses of its two operands are unrelated. We crash in `ciKlass::is_subtype_of` because the `TypeAryPtr::_klass` field is not initialized ( `= nullptr`) for the `116 Phi` operand. > > The issue only reproduces with release builds because [additional verification code](https://github.com/openjdk/jdk/blob/21cda19d05b688148f023f6d92778b5da210b709/src/hotspot/share/opto/type.cpp#L996-L1007) in `Type::meet_helper` in debug builds calls `klass()` which leads to eager initialization of the `_klass` field. When disabling the verification code, the issue also reproduces with debug builds and we hit the `this_one->_klass != nullptr && other->_klass != nullptr` assert in `TypePtr::is_same_java_type_as_helper_for_array`. > > The fix is to always use the `klass()` method for accesses which makes sure that the field is properly initialized since the overhead is negligible. The patch also includes some unrelated removal of dead code in `TypeAryPtr::compute_klass` (after [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292), the verify argument is always false). > > Thanks, > Tobias Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17085#pullrequestreview-1780116575 From kvn at openjdk.org Wed Dec 13 19:03:41 2023 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 13 Dec 2023 19:03:41 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes [v2] In-Reply-To: References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: On Mon, 11 Dec 2023 22:01:58 GMT, Tom Rodriguez wrote: >> HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. >> >> This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. > > Tom Rodriguez has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into tkr-clean-weak > - Comment and types improvements > - 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes Looks good as far as I can understand. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16981#pullrequestreview-1780300803 From dlong at openjdk.org Wed Dec 13 20:25:48 2023 From: dlong at openjdk.org (Dean Long) Date: Wed, 13 Dec 2023 20:25:48 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: <3jo5iedMTBXKNXlGtPBnao5BKGd73gEDtLk8I7NALl8=.ed38efe3-1316-4e9a-9c47-e807f2b92621@github.com> On Thu, 7 Dec 2023 15:53:47 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. >> - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. >> >> ### Testing (in progress) >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5 >> - Targeted and repeated tests for `TestC1Globals.java` in all tiers > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16951#pullrequestreview-1780428799 From dlong at openjdk.org Wed Dec 13 20:25:50 2023 From: dlong at openjdk.org (Dean Long) Date: Wed, 13 Dec 2023 20:25:50 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: On Wed, 13 Dec 2023 14:28:42 GMT, Tobias Hartmann wrote: > The `1M` limit seems reasonable to me given that `NMethodSizeLimit` is a develop flag and the default of `64K` is much lower. Good point about it being a developer flag. I missed that. But the default is 64K words, or half of 1M. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1854649250 From phh at openjdk.org Wed Dec 13 21:43:39 2023 From: phh at openjdk.org (Paul Hohensee) Date: Wed, 13 Dec 2023 21:43:39 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform_no_reclaim [v2] In-Reply-To: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> References: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> Message-ID: <6k8M7sUQ0HjBkRCUuYUBGM9_vbZaiJ2JXZYBruoe_iw=.4907a80f-32cb-49f6-ac1a-e5ee26eb99fa@github.com> On Tue, 12 Dec 2023 17:29:50 GMT, Joshua Cao wrote: >> `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. >> >> Passes tier1 locally on my Linux machine. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix formatting Marked as reviewed by phh (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17071#pullrequestreview-1780536069 From phh at openjdk.org Wed Dec 13 21:46:39 2023 From: phh at openjdk.org (Paul Hohensee) Date: Wed, 13 Dec 2023 21:46:39 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform_no_reclaim [v2] In-Reply-To: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> References: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> Message-ID: On Tue, 12 Dec 2023 17:29:50 GMT, Joshua Cao wrote: >> `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. >> >> Passes tier1 locally on my Linux machine. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix formatting There are a couple of GHA failures that look compiler related. I approved the PR before looking at these because the patch is trivial on its face, but now I'm suspicious. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17071#issuecomment-1854746506 From thartmann at openjdk.org Thu Dec 14 06:01:46 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 06:01:46 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: On Wed, 13 Dec 2023 20:23:10 GMT, Dean Long wrote: > But the default is 64K words, or half of 1M. Right, I missed that. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1855192051 From thartmann at openjdk.org Thu Dec 14 06:02:38 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 06:02:38 GMT Subject: RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: <98YnUa35T0xAwY5Sye2CbM5mSTfrkCgZ8NPdPk0tyRk=.ad97aab9-a92e-425a-82e4-cf52f39a9717@github.com> On Wed, 13 Dec 2023 11:57:42 GMT, Tobias Hartmann wrote: > [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292) added code that relies on lazy initialization of the `TypeAryPtr::_klass` field. However, there are cases when the field is not yet initialized, leading to a null pointer dereference at C2 compilation time. > > In the failing case we process a CmpP: > > 116 Phi === 109 160 57 [[ 120 128 128 ]] #long[int:1..2] (java/lang/Cloneable,java/io/Serializable):NotNull:exact * !jvms: TestSimple::test @ bci:11 (line 32) > 10 Parm === 3 [[ 173 143 128 40 120 128 94 72 83 ]] Parm0: long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact * !jvms: TestSimple::test @ bci:-1 (line 29) > 120 CmpP === _ 10 116 [[ 121 ]] !jvms: TestSimple::test @ bci:13 (line 32) > > `CmpPNode::sub` performs a subtype check to check if the klasses of its two operands are unrelated. We crash in `ciKlass::is_subtype_of` because the `TypeAryPtr::_klass` field is not initialized ( `= nullptr`) for the `116 Phi` operand. > > The issue only reproduces with release builds because [additional verification code](https://github.com/openjdk/jdk/blob/21cda19d05b688148f023f6d92778b5da210b709/src/hotspot/share/opto/type.cpp#L996-L1007) in `Type::meet_helper` in debug builds calls `klass()` which leads to eager initialization of the `_klass` field. When disabling the verification code, the issue also reproduces with debug builds and we hit the `this_one->_klass != nullptr && other->_klass != nullptr` assert in `TypePtr::is_same_java_type_as_helper_for_array`. > > The fix is to always use the `klass()` method for accesses which makes sure that the field is properly initialized since the overhead is negligible. The patch also includes some unrelated removal of dead code in `TypeAryPtr::compute_klass` (after [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292), the verify argument is always false). > > Thanks, > Tobias Thanks for the review, Vladimir! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17085#issuecomment-1855192910 From thartmann at openjdk.org Thu Dec 14 07:25:53 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 07:25:53 GMT Subject: Integrated: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 11:57:42 GMT, Tobias Hartmann wrote: > [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292) added code that relies on lazy initialization of the `TypeAryPtr::_klass` field. However, there are cases when the field is not yet initialized, leading to a null pointer dereference at C2 compilation time. > > In the failing case we process a CmpP: > > 116 Phi === 109 160 57 [[ 120 128 128 ]] #long[int:1..2] (java/lang/Cloneable,java/io/Serializable):NotNull:exact * !jvms: TestSimple::test @ bci:11 (line 32) > 10 Parm === 3 [[ 173 143 128 40 120 128 94 72 83 ]] Parm0: long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact * !jvms: TestSimple::test @ bci:-1 (line 29) > 120 CmpP === _ 10 116 [[ 121 ]] !jvms: TestSimple::test @ bci:13 (line 32) > > `CmpPNode::sub` performs a subtype check to check if the klasses of its two operands are unrelated. We crash in `ciKlass::is_subtype_of` because the `TypeAryPtr::_klass` field is not initialized ( `= nullptr`) for the `116 Phi` operand. > > The issue only reproduces with release builds because [additional verification code](https://github.com/openjdk/jdk/blob/21cda19d05b688148f023f6d92778b5da210b709/src/hotspot/share/opto/type.cpp#L996-L1007) in `Type::meet_helper` in debug builds calls `klass()` which leads to eager initialization of the `_klass` field. When disabling the verification code, the issue also reproduces with debug builds and we hit the `this_one->_klass != nullptr && other->_klass != nullptr` assert in `TypePtr::is_same_java_type_as_helper_for_array`. > > The fix is to always use the `klass()` method for accesses which makes sure that the field is properly initialized since the overhead is negligible. The patch also includes some unrelated removal of dead code in `TypeAryPtr::compute_klass` (after [JDK-8297933](https://bugs.openjdk.org/browse/JDK-8320292), the verify argument is always false). > > Thanks, > Tobias This pull request has now been integrated. Changeset: c8ad7b7f Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/c8ad7b7f84ead3f850f034e1db6335bbbac41589 Stats: 89 lines in 3 files changed: 54 ins; 22 del; 13 mod 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized Reviewed-by: roland, kvn ------------- PR: https://git.openjdk.org/jdk/pull/17085 From thartmann at openjdk.org Thu Dec 14 07:48:17 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 07:48:17 GMT Subject: [jdk22] RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized Message-ID: Hi all, This pull request contains a backport of commit [c8ad7b7f](https://github.com/openjdk/jdk/commit/c8ad7b7f84ead3f850f034e1db6335bbbac41589) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The commit being backported was authored by Tobias Hartmann on 14 Dec 2023 and was reviewed by Roland Westrelin and Vladimir Kozlov. Thanks! ------------- Commit messages: - Backport c8ad7b7f84ead3f850f034e1db6335bbbac41589 Changes: https://git.openjdk.org/jdk22/pull/12/files Webrev: https://webrevs.openjdk.org/?repo=jdk22&pr=12&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321974 Stats: 89 lines in 3 files changed: 54 ins; 22 del; 13 mod Patch: https://git.openjdk.org/jdk22/pull/12.diff Fetch: git fetch https://git.openjdk.org/jdk22.git pull/12/head:pull/12 PR: https://git.openjdk.org/jdk22/pull/12 From epeter at openjdk.org Thu Dec 14 08:22:41 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Dec 2023 08:22:41 GMT Subject: [jdk22] RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 07:41:03 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [c8ad7b7f](https://github.com/openjdk/jdk/commit/c8ad7b7f84ead3f850f034e1db6335bbbac41589) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Tobias Hartmann on 14 Dec 2023 and was reviewed by Roland Westrelin and Vladimir Kozlov. > > Thanks! Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk22/pull/12#pullrequestreview-1781272315 From thartmann at openjdk.org Thu Dec 14 08:27:49 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 08:27:49 GMT Subject: [jdk22] RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 07:41:03 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [c8ad7b7f](https://github.com/openjdk/jdk/commit/c8ad7b7f84ead3f850f034e1db6335bbbac41589) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Tobias Hartmann on 14 Dec 2023 and was reviewed by Roland Westrelin and Vladimir Kozlov. > > Thanks! Thanks, Emanuel! ------------- PR Comment: https://git.openjdk.org/jdk22/pull/12#issuecomment-1855390198 From thartmann at openjdk.org Thu Dec 14 08:32:38 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 08:32:38 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 07:26:31 GMT, Jatin Bhateja wrote: > Hi, > > This bug fix patch optimizes integral gather mask computation using cheaper instruction and fixes incorrect instruction attributes in legacy integral gather instructions. > > All Vector API JTREG tests are passing with this at various AVX levels. > > Kindly review and share feedback. > > Best Regards, > Jatin Okay, so if I understand correctly, since these instructions are always VEX encoded, the VL set to true does not make a difference and should not lead to failures, correct? Looks reasonable to me but someone more experienced with this should have a look as well. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17048#pullrequestreview-1781287642 From tholenstein at openjdk.org Thu Dec 14 08:46:55 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 14 Dec 2023 08:46:55 GMT Subject: [jdk22] RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 07:41:03 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [c8ad7b7f](https://github.com/openjdk/jdk/commit/c8ad7b7f84ead3f850f034e1db6335bbbac41589) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Tobias Hartmann on 14 Dec 2023 and was reviewed by Roland Westrelin and Vladimir Kozlov. > > Thanks! Marked as reviewed by tholenstein (Reviewer). Looks good. ------------- PR Review: https://git.openjdk.org/jdk22/pull/12#pullrequestreview-1781314549 PR Comment: https://git.openjdk.org/jdk22/pull/12#issuecomment-1855418767 From duke at openjdk.org Thu Dec 14 08:47:13 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 14 Dec 2023 08:47:13 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v4] In-Reply-To: References: Message-ID: > This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. > - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. > > ### Testing (in progress) > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5 > - Targeted and repeated tests for `TestC1Globals.java` in all tiers Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Add comment explaining 1MB restriction ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16951/files - new: https://git.openjdk.org/jdk/pull/16951/files/d19040e5..1d0515b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16951&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16951&range=02-03 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16951.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16951/head:pull/16951 PR: https://git.openjdk.org/jdk/pull/16951 From duke at openjdk.org Thu Dec 14 08:47:15 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 14 Dec 2023 08:47:15 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v3] In-Reply-To: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> References: <0pwWmcg5mBei8T9v-z71ogtM0YB2QIKwOM7tK8yTrSo=.1bf57583-9fae-413c-91aa-4f94d5df5bf7@github.com> Message-ID: On Thu, 7 Dec 2023 15:53:47 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. >> - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. >> >> ### Testing (in progress) >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5 >> - Targeted and repeated tests for `TestC1Globals.java` in all tiers > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright Thanks for the input everyone. Let's go with a 1MB upper bound for now then (on all platforms). I'll rerun some tests before integrating. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1855417755 From thartmann at openjdk.org Thu Dec 14 08:59:43 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 08:59:43 GMT Subject: [jdk22] RFR: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 07:41:03 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [c8ad7b7f](https://github.com/openjdk/jdk/commit/c8ad7b7f84ead3f850f034e1db6335bbbac41589) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Tobias Hartmann on 14 Dec 2023 and was reviewed by Roland Westrelin and Vladimir Kozlov. > > Thanks! Thanks, Toby! ------------- PR Comment: https://git.openjdk.org/jdk22/pull/12#issuecomment-1855437203 From duke at openjdk.org Thu Dec 14 09:25:38 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 14 Dec 2023 09:25:38 GMT Subject: RFR: 8321820: TestLoadNIdeal fails on 32-bit because -XX:+UseCompressedOops is not recognized In-Reply-To: References: Message-ID: <7ju402gVOscJh8e5YZu_Sv6jg2WsNu4vTNuvAi3hK0U=.803375ba-f3ba-49f9-90d8-b95a699eac73@github.com> On Wed, 13 Dec 2023 09:45:29 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where `TestLoadNIdeal.java` fails on 32-bit, where `-XX:+UseCompressedOops` is not available. > > Changes: > - Only run the test on 64-bit platforms. > > ### Testing > windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64: > - tier1, HotSpot parts of tier2 and tier3 > > linux-x86 (32-bit) > - tier1 Thanks for the reviews. Please sponsor! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17083#issuecomment-1855476663 From duke at openjdk.org Thu Dec 14 09:32:50 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 14 Dec 2023 09:32:50 GMT Subject: Integrated: 8321820: TestLoadNIdeal fails on 32-bit because -XX:+UseCompressedOops is not recognized In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 09:45:29 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where `TestLoadNIdeal.java` fails on 32-bit, where `-XX:+UseCompressedOops` is not available. > > Changes: > - Only run the test on 64-bit platforms. > > ### Testing > windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64: > - tier1, HotSpot parts of tier2 and tier3 > > linux-x86 (32-bit) > - tier1 This pull request has now been integrated. Changeset: d632d743 Author: Daniel Lund?n Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/d632d743e018c69ecf423af75b65354e8ffaefc8 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8321820: TestLoadNIdeal fails on 32-bit because -XX:+UseCompressedOops is not recognized Reviewed-by: rcastanedalo, chagedorn, shade ------------- PR: https://git.openjdk.org/jdk/pull/17083 From thartmann at openjdk.org Thu Dec 14 09:46:01 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 09:46:01 GMT Subject: [jdk22] Integrated: 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 07:41:03 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [c8ad7b7f](https://github.com/openjdk/jdk/commit/c8ad7b7f84ead3f850f034e1db6335bbbac41589) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Tobias Hartmann on 14 Dec 2023 and was reviewed by Roland Westrelin and Vladimir Kozlov. > > Thanks! This pull request has now been integrated. Changeset: 41b7296f Author: Tobias Hartmann URL: https://git.openjdk.org/jdk22/commit/41b7296f4988d2a3b3d212b0634ded9e59496590 Stats: 89 lines in 3 files changed: 54 ins; 22 del; 13 mod 8321974: Crash in ciKlass::is_subtype_of because TypeAryPtr::_klass is not initialized Reviewed-by: epeter, tholenstein Backport-of: c8ad7b7f84ead3f850f034e1db6335bbbac41589 ------------- PR: https://git.openjdk.org/jdk22/pull/12 From jbhateja at openjdk.org Thu Dec 14 09:47:39 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 14 Dec 2023 09:47:39 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 08:29:49 GMT, Tobias Hartmann wrote: > Okay, so if I understand correctly, since these instructions are always VEX encoded, the VL set to true does not make a difference and should not lead to failures, correct? Thanks @TobiHartmann , yes, that what my understanding is. A purely vex encoded instruction do not sufficient bits in VEX prefix to encode a register from a higher register bank. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17048#issuecomment-1855511444 From epeter at openjdk.org Thu Dec 14 11:34:19 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Dec 2023 11:34:19 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: renamings and proof improvement in adjust_pre_loop_limit_to_align_main_loop_vectors ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/f118744f..b37f0c12 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=42 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=41-42 Stats: 138 lines in 1 file changed: 59 ins; 10 del; 69 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From tholenstein at openjdk.org Thu Dec 14 11:41:56 2023 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 14 Dec 2023 11:41:56 GMT Subject: RFR: JDK-8321984: IGV: Upgrade to Netbeans Platform 20 Message-ID: Upgraded IGV and dependencies to the newest Netbeans Platform 20 which was released on December 2023. Tested that IGV still behaves as expected after the upgrade. ------------- Commit messages: - JDK-8321984: IGV: Upgrade to Netbeans Platform 20 Changes: https://git.openjdk.org/jdk/pull/17106/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17106&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8321984 Stats: 10 lines in 1 file changed: 0 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17106.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17106/head:pull/17106 PR: https://git.openjdk.org/jdk/pull/17106 From rcastanedalo at openjdk.org Thu Dec 14 12:58:38 2023 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 14 Dec 2023 12:58:38 GMT Subject: RFR: JDK-8321984: IGV: Upgrade to Netbeans Platform 20 In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 11:36:28 GMT, Tobias Holenstein wrote: > Upgraded IGV and dependencies to the newest Netbeans Platform 20 which was released on December 2023. > > Tested that IGV still behaves as expected after the upgrade. Thanks for doing this, Tobias. If we are raising the minimum JDK version to 17, it would make sense to make the dependency on `nashorn-core` unconditional (in `src/utils/IdealGraphVisualizer/Filter/pom.xml`). ------------- Changes requested by rcastanedalo (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17106#pullrequestreview-1781758306 From duke at openjdk.org Thu Dec 14 13:04:45 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 14 Dec 2023 13:04:45 GMT Subject: RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" [v4] In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 08:47:13 GMT, Daniel Lund?n wrote: >> This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. >> >> Changes: >> - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. >> - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. >> >> ### Testing (in progress) >> Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 >> - tier1, tier2, tier3, tier4, tier5 >> - Targeted and repeated tests for `TestC1Globals.java` in all tiers > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Add comment explaining 1MB restriction Ready now, please sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16951#issuecomment-1855812681 From duke at openjdk.org Thu Dec 14 13:12:50 2023 From: duke at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 14 Dec 2023 13:12:50 GMT Subject: Integrated: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Mon, 4 Dec 2023 14:19:10 GMT, Daniel Lund?n wrote: > This changeset fixes an issue where addresses for float and double constants on aarch64 were sometimes out of range for PC-relative offsets using `adr`. > > Changes: > - Set an upper bound of `1M` for the flag `NMethodSizeLimit`, ensuring that float and double constants are in range for `adr`. > - Revise tests in `TestC1Globals.java` to use the new upper bound of `1M` for `NMethodSizeLimit`. Also, remove no longer applicable tests in `TestC1Globals.java`. > > ### Testing (in progress) > Platforms: windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64 > - tier1, tier2, tier3, tier4, tier5 > - Targeted and repeated tests for `TestC1Globals.java` in all tiers This pull request has now been integrated. Changeset: 69014cd5 Author: Daniel Lund?n Committer: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/69014cd55b59a0a63f4918fad575a6887640573e Stats: 36 lines in 2 files changed: 3 ins; 23 del; 10 mod 8320682: [AArch64] C1 compilation fails with "Field too big for insn" Reviewed-by: thartmann, aph, dlong ------------- PR: https://git.openjdk.org/jdk/pull/16951 From chagedorn at openjdk.org Thu Dec 14 13:49:47 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 14 Dec 2023 13:49:47 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v2] In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 08:55:25 GMT, Roland Westrelin wrote: >> Range check smearing and range check predication make an array access >> dependent on 2 (or more in the case of RC smearing) conditions. As a >> consequence, if a range check can be eliminated because there's an >> identical dominating range check, the control dependent nodes that >> could float and become dependent on the dominating range check cannot >> be allowed to float because there's a risk that they would then bypass >> one of the checks that make the access legal. >> >> `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have >> logic to prevent this: nodes that are control dependent on a range >> check or predicate are not allowed to float. This is however not >> sufficient as demonstrated by the test cases. >> >> In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: >> >> >> v += array[i]; >> if (flag2) { >> if (flag3) { >> field = 0x42; >> } >> } >> if (flagField == 1) { >> v += array[i]; >> } >> >> >> The range check for the second `array[i]` load is replaced by the >> dominating range check for the first `array[i]` but because the second >> `array[i]` load could really be dependent on multiple range checks (in >> case smearing happened which is not the case here), c2 doesn't allow >> the second `array[i]` to float when the second range check is >> removed. The second `array[i]` is then control dependent on: >> >> >> if (flagField == 1) { >> >> >> which is next found to be dominated by the same test: >> >> >> if (flag == 1) { >> >> >> and is removed. However nothing in `dominated_by()` treats node >> dependent on tests that are not range check or predicates >> specially. So the second `array[i]` is allowed to float and become >> dependent on: >> >> >> if (flag == 1) { >> >> >> which is above the range check for that access. The test method in its >> last invocation is passed an index for the array access that's widely >> out of range. The array load happens before the range check and >> crashes the VM. `testLoopPredication()` is a similar test where array >> loads become dependent on predicates and end up above range checks. >> >> `TestArrayAccessCastIIAboveRC.java` is the test case from the bug >> where for similar reasons a range check `CastII` ends up above its >> range check, becomes top because its input becomes some integer that >> conflicts with its... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' into JDK-8319793 > - fix & test Nice summary! I have a few comments but otherwise, the fix looks reasonable. Have you also run some performance testing to check if delaying RC smearing has any impact? src/hotspot/share/opto/castnode.cpp line 326: > 324: #endif > 325: > 326: CastIINode* CastIINode::pin_for_array_access() const { Not sure if it's worth but you could sanity assert here that `_dependency == RegularDependency` since you always have checked `depends_only_on_test()` before calling this. src/hotspot/share/opto/castnode.cpp line 328: > 326: CastIINode* CastIINode::pin_for_array_access() const { > 327: if (has_range_check()) { > 328: return new CastIINode(in(0), in(1), bottom_type(), ConstraintCastNode::StrongDependency, has_range_check()); `ConstraintCastNode::` can be removed: Suggestion: return new CastIINode(in(0), in(1), bottom_type(), StrongDependency, has_range_check()); src/hotspot/share/opto/castnode.hpp line 127: > 125: } > 126: > 127: CastIINode* pin_for_array_access() const; You could add `override` to easier identify that this is an overriding method. Same in `LoadNode`. Suggestion: CastIINode* pin_for_array_access() const override; src/hotspot/share/opto/ifnode.cpp line 566: > 564: if (new_cmp == cmp) return; > 565: // Else, adjust existing check > 566: Node *new_bol = gvn->transform( new BoolNode(new_cmp, bol->as_Bool()->_test._test)); Suggestion: Node* new_bol = gvn->transform(new BoolNode(new_cmp, bol->as_Bool()->_test._test)); src/hotspot/share/opto/loopopts.cpp line 358: > 356: // Loads and range check Cast nodes that are control dependent on this range check depend on multiple dominating > 357: // range checks and can't float even if the range check they'll be control dependent on once this function > 358: // returns is replaced by a dominating range check: pin them. I suggest to add some more details from the PR description (could be done analogously for the other two comment similar comments). How about: // Loads and range check Cast nodes that are control dependent on this range check (that is about to be removed) // now depend on multiple dominating range checks. After the removal of this range check, these control dependent // nodes end up at the lowest/nearest dominating check in the graph. To ensure that these Loads/Casts do not float // above any of the dominating checks (even when the lowest dominating check is later replaced by yet another // dominating check), we need to pin them at the lowest dominating check. src/hotspot/share/opto/memnode.hpp line 295: > 293: bool has_pinned_control_dependency() const { return _control_dependency == Pinned; } > 294: > 295: LoadNode* pin_for_array_access() const; Suggestion: LoadNode* pin_for_array_access() const override; ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16886#pullrequestreview-1781358495 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1426522329 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1426522454 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1426480931 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1426437097 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1426736470 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1426481493 From qamai at openjdk.org Thu Dec 14 15:04:49 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 14 Dec 2023 15:04:49 GMT Subject: RFR: 8319451: PhaseIdealLoop::conditional_move is too conservative [v2] In-Reply-To: References: Message-ID: On Mon, 13 Nov 2023 07:49:25 GMT, Tobias Hartmann wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - change freq to double >> - Merge branch 'master' into cmovethreshold >> - adjust threshold, add benchmark > > Looks reasonable to me. All tests passed and performance results look neutral (no statistically significant improvements or regressions). @TobiHartmann @vnkozlov @cl4es Thanks a lot for your reviews and testings, if there is nothing that concerns you, I will integrate the patch, ------------- PR Comment: https://git.openjdk.org/jdk/pull/16524#issuecomment-1856012645 From roland at openjdk.org Thu Dec 14 15:09:52 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Dec 2023 15:09:52 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v3] In-Reply-To: References: Message-ID: > Range check smearing and range check predication make an array access > dependent on 2 (or more in the case of RC smearing) conditions. As a > consequence, if a range check can be eliminated because there's an > identical dominating range check, the control dependent nodes that > could float and become dependent on the dominating range check cannot > be allowed to float because there's a risk that they would then bypass > one of the checks that make the access legal. > > `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have > logic to prevent this: nodes that are control dependent on a range > check or predicate are not allowed to float. This is however not > sufficient as demonstrated by the test cases. > > In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: > > > v += array[i]; > if (flag2) { > if (flag3) { > field = 0x42; > } > } > if (flagField == 1) { > v += array[i]; > } > > > The range check for the second `array[i]` load is replaced by the > dominating range check for the first `array[i]` but because the second > `array[i]` load could really be dependent on multiple range checks (in > case smearing happened which is not the case here), c2 doesn't allow > the second `array[i]` to float when the second range check is > removed. The second `array[i]` is then control dependent on: > > > if (flagField == 1) { > > > which is next found to be dominated by the same test: > > > if (flag == 1) { > > > and is removed. However nothing in `dominated_by()` treats node > dependent on tests that are not range check or predicates > specially. So the second `array[i]` is allowed to float and become > dependent on: > > > if (flag == 1) { > > > which is above the range check for that access. The test method in its > last invocation is passed an index for the array access that's widely > out of range. The array load happens before the range check and > crashes the VM. `testLoopPredication()` is a similar test where array > loads become dependent on predicates and end up above range checks. > > `TestArrayAccessCastIIAboveRC.java` is the test case from the bug > where for similar reasons a range check `CastII` ends up above its > range check, becomes top because its input becomes some integer that > conflicts with its type (but there's no condition to catch it). The > graph becomes broken and c2 crashes. > > Logic in the `dominated_by()` methods ... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/ifnode.cpp Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16886/files - new: https://git.openjdk.org/jdk/pull/16886/files/ce5d15d5..4acc9c6e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16886.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16886/head:pull/16886 PR: https://git.openjdk.org/jdk/pull/16886 From roland at openjdk.org Thu Dec 14 15:13:01 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Dec 2023 15:13:01 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v4] In-Reply-To: References: Message-ID: > Range check smearing and range check predication make an array access > dependent on 2 (or more in the case of RC smearing) conditions. As a > consequence, if a range check can be eliminated because there's an > identical dominating range check, the control dependent nodes that > could float and become dependent on the dominating range check cannot > be allowed to float because there's a risk that they would then bypass > one of the checks that make the access legal. > > `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have > logic to prevent this: nodes that are control dependent on a range > check or predicate are not allowed to float. This is however not > sufficient as demonstrated by the test cases. > > In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: > > > v += array[i]; > if (flag2) { > if (flag3) { > field = 0x42; > } > } > if (flagField == 1) { > v += array[i]; > } > > > The range check for the second `array[i]` load is replaced by the > dominating range check for the first `array[i]` but because the second > `array[i]` load could really be dependent on multiple range checks (in > case smearing happened which is not the case here), c2 doesn't allow > the second `array[i]` to float when the second range check is > removed. The second `array[i]` is then control dependent on: > > > if (flagField == 1) { > > > which is next found to be dominated by the same test: > > > if (flag == 1) { > > > and is removed. However nothing in `dominated_by()` treats node > dependent on tests that are not range check or predicates > specially. So the second `array[i]` is allowed to float and become > dependent on: > > > if (flag == 1) { > > > which is above the range check for that access. The test method in its > last invocation is passed an index for the array access that's widely > out of range. The array load happens before the range check and > crashes the VM. `testLoopPredication()` is a similar test where array > loads become dependent on predicates and end up above range checks. > > `TestArrayAccessCastIIAboveRC.java` is the test case from the bug > where for similar reasons a range check `CastII` ends up above its > range check, becomes top because its input becomes some integer that > conflicts with its type (but there's no condition to catch it). The > graph becomes broken and c2 crashes. > > Logic in the `dominated_by()` methods ... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/castnode.cpp Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16886/files - new: https://git.openjdk.org/jdk/pull/16886/files/4acc9c6e..9420aae4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/16886.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16886/head:pull/16886 PR: https://git.openjdk.org/jdk/pull/16886 From roland at openjdk.org Thu Dec 14 15:17:56 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Dec 2023 15:17:56 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v5] In-Reply-To: References: Message-ID: > Range check smearing and range check predication make an array access > dependent on 2 (or more in the case of RC smearing) conditions. As a > consequence, if a range check can be eliminated because there's an > identical dominating range check, the control dependent nodes that > could float and become dependent on the dominating range check cannot > be allowed to float because there's a risk that they would then bypass > one of the checks that make the access legal. > > `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have > logic to prevent this: nodes that are control dependent on a range > check or predicate are not allowed to float. This is however not > sufficient as demonstrated by the test cases. > > In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: > > > v += array[i]; > if (flag2) { > if (flag3) { > field = 0x42; > } > } > if (flagField == 1) { > v += array[i]; > } > > > The range check for the second `array[i]` load is replaced by the > dominating range check for the first `array[i]` but because the second > `array[i]` load could really be dependent on multiple range checks (in > case smearing happened which is not the case here), c2 doesn't allow > the second `array[i]` to float when the second range check is > removed. The second `array[i]` is then control dependent on: > > > if (flagField == 1) { > > > which is next found to be dominated by the same test: > > > if (flag == 1) { > > > and is removed. However nothing in `dominated_by()` treats node > dependent on tests that are not range check or predicates > specially. So the second `array[i]` is allowed to float and become > dependent on: > > > if (flag == 1) { > > > which is above the range check for that access. The test method in its > last invocation is passed an index for the array access that's widely > out of range. The array load happens before the range check and > crashes the VM. `testLoopPredication()` is a similar test where array > loads become dependent on predicates and end up above range checks. > > `TestArrayAccessCastIIAboveRC.java` is the test case from the bug > where for similar reasons a range check `CastII` ends up above its > range check, becomes top because its input becomes some integer that > conflicts with its type (but there's no condition to catch it). The > graph becomes broken and c2 crashes. > > Logic in the `dominated_by()` methods ... Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/share/opto/memnode.hpp Co-authored-by: Christian Hagedorn - Update src/hotspot/share/opto/castnode.hpp Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16886/files - new: https://git.openjdk.org/jdk/pull/16886/files/9420aae4..bdb731ea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=03-04 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16886.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16886/head:pull/16886 PR: https://git.openjdk.org/jdk/pull/16886 From redestad at openjdk.org Thu Dec 14 15:22:43 2023 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 14 Dec 2023 15:22:43 GMT Subject: RFR: 8319451: PhaseIdealLoop::conditional_move is too conservative [v2] In-Reply-To: References: Message-ID: On Tue, 12 Dec 2023 16:39:17 GMT, Quan Anh Mai wrote: >> Hi, >> >> When transforming a Phi into a CMove, the threshold is set to be approximately BlockLayoutMinDiamondPercentage, the reason is given: >> >> // BlockLayoutByFrequency optimization moves infrequent branch >> // from hot path. No point in CMOV'ing in such case >> >> This sets the default value of the threshold to be around 18%, which is too conservative. The reason also does not make a lot of sense since the important property which makes jumping expensive is not code layout. We should remove this. >> >> Please kindly review, thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - change freq to double > - Merge branch 'master' into cmovethreshold > - adjust threshold, add benchmark Thank you for the work on this and for picking up on my suggestions for the microbenchmark! Looking forward to see the effects of this on a wider selection of benchmarks. ------------- Marked as reviewed by redestad (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16524#pullrequestreview-1782059077 From roland at openjdk.org Thu Dec 14 15:37:07 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Dec 2023 15:37:07 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v6] In-Reply-To: References: Message-ID: > Range check smearing and range check predication make an array access > dependent on 2 (or more in the case of RC smearing) conditions. As a > consequence, if a range check can be eliminated because there's an > identical dominating range check, the control dependent nodes that > could float and become dependent on the dominating range check cannot > be allowed to float because there's a risk that they would then bypass > one of the checks that make the access legal. > > `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have > logic to prevent this: nodes that are control dependent on a range > check or predicate are not allowed to float. This is however not > sufficient as demonstrated by the test cases. > > In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: > > > v += array[i]; > if (flag2) { > if (flag3) { > field = 0x42; > } > } > if (flagField == 1) { > v += array[i]; > } > > > The range check for the second `array[i]` load is replaced by the > dominating range check for the first `array[i]` but because the second > `array[i]` load could really be dependent on multiple range checks (in > case smearing happened which is not the case here), c2 doesn't allow > the second `array[i]` to float when the second range check is > removed. The second `array[i]` is then control dependent on: > > > if (flagField == 1) { > > > which is next found to be dominated by the same test: > > > if (flag == 1) { > > > and is removed. However nothing in `dominated_by()` treats node > dependent on tests that are not range check or predicates > specially. So the second `array[i]` is allowed to float and become > dependent on: > > > if (flag == 1) { > > > which is above the range check for that access. The test method in its > last invocation is passed an index for the array access that's widely > out of range. The array load happens before the range check and > crashes the VM. `testLoopPredication()` is a similar test where array > loads become dependent on predicates and end up above range checks. > > `TestArrayAccessCastIIAboveRC.java` is the test case from the bug > where for similar reasons a range check `CastII` ends up above its > range check, becomes top because its input becomes some integer that > conflicts with its type (but there's no condition to catch it). The > graph becomes broken and c2 crashes. > > Logic in the `dominated_by()` methods ... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16886/files - new: https://git.openjdk.org/jdk/pull/16886/files/bdb731ea..07867e6a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=04-05 Stats: 6 lines in 2 files changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/16886.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16886/head:pull/16886 PR: https://git.openjdk.org/jdk/pull/16886 From roland at openjdk.org Thu Dec 14 15:37:09 2023 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 14 Dec 2023 15:37:09 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v2] In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 13:47:17 GMT, Christian Hagedorn wrote: > Nice summary! I have a few comments but otherwise, the fix looks reasonable. Thanks for the review and the suggestions. I made the changes you proposed. > Have you also run some performance testing to check if delaying RC smearing has any impact? @TobiHartmann ran performance testing for me. There was no regression. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16886#issuecomment-1856068305 From duke at openjdk.org Thu Dec 14 17:38:04 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 14 Dec 2023 17:38:04 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) Message-ID: Hello, (~a) | (~b) => ~(a & b) is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1617C28-L1617C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. Thanks. ------------- Commit messages: - include new optimization and tests. Changes: https://git.openjdk.org/jdk/pull/16334/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16334&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322077 Stats: 158 lines in 4 files changed: 158 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16334.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16334/head:pull/16334 PR: https://git.openjdk.org/jdk/pull/16334 From duke at openjdk.org Thu Dec 14 17:38:04 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 14 Dec 2023 17:38:04 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) In-Reply-To: References: Message-ID: On Tue, 24 Oct 2023 05:02:56 GMT, Zhiqiang Zang wrote: > Hello, > > (~a) | (~b) => ~(a & b) is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1617C28-L1617C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. > > Thanks. Hi, can I get a review? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16334#issuecomment-1821175175 From thartmann at openjdk.org Thu Dec 14 17:38:05 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 14 Dec 2023 17:38:05 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) In-Reply-To: References: Message-ID: <9ghdtot7GNx0XL2U0QDhfDQhJ1aUR2z8ihymYYO8VVc=.a58eb903-3131-4dde-9b6f-d3b6229a6519@github.com> On Tue, 24 Oct 2023 05:02:56 GMT, Zhiqiang Zang wrote: > Hello, > > (~a) | (~b) => ~(a & b) is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1617C28-L1617C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. > > Thanks. A JBS issue has been created for this: [JDK-8322077](https://bugs.openjdk.org/browse/JDK-8322077). Please update the PR accordingly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16334#issuecomment-1855407243 From duke at openjdk.org Thu Dec 14 17:46:59 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 14 Dec 2023 17:46:59 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) [v2] In-Reply-To: References: Message-ID: > Hello, > > (~a) | (~b) => ~(a & b) is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1617C28-L1617C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. > > Thanks. Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: include bug id. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16334/files - new: https://git.openjdk.org/jdk/pull/16334/files/2e64889e..341f869c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16334&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16334&range=00-01 Stats: 2 lines in 2 files changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16334.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16334/head:pull/16334 PR: https://git.openjdk.org/jdk/pull/16334 From duke at openjdk.org Thu Dec 14 17:47:00 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 14 Dec 2023 17:47:00 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) In-Reply-To: <9ghdtot7GNx0XL2U0QDhfDQhJ1aUR2z8ihymYYO8VVc=.a58eb903-3131-4dde-9b6f-d3b6229a6519@github.com> References: <9ghdtot7GNx0XL2U0QDhfDQhJ1aUR2z8ihymYYO8VVc=.a58eb903-3131-4dde-9b6f-d3b6229a6519@github.com> Message-ID: On Thu, 14 Dec 2023 08:36:23 GMT, Tobias Hartmann wrote: > A JBS issue has been created for this: [JDK-8322077](https://bugs.openjdk.org/browse/JDK-8322077). Please update the PR accordingly. Thanks. Updated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16334#issuecomment-1856312363 From qamai at openjdk.org Thu Dec 14 18:09:40 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 14 Dec 2023 18:09:40 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) [v2] In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 17:46:59 GMT, Zhiqiang Zang wrote: >> Hello, >> >> (~a) | (~b) => ~(a & b) is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1617C28-L1617C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. >> >> Thanks. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > include bug id. LGTM, but may you consider having dedicated functions to check and create `Not` patterns? Thanks a lot. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16334#issuecomment-1856343282 From duke at openjdk.org Thu Dec 14 18:19:12 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 14 Dec 2023 18:19:12 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform_no_reclaim [v3] In-Reply-To: References: Message-ID: > `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. > > Passes tier1 locally on my Linux machine. Joshua Cao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into gvntransform - Fix formatting - 8321823: Remove redundant PhaseGVN transform and transform_no_reclaim ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17071/files - new: https://git.openjdk.org/jdk/pull/17071/files/09225453..5ea1c16c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17071&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17071&range=01-02 Stats: 1486 lines in 97 files changed: 756 ins; 375 del; 355 mod Patch: https://git.openjdk.org/jdk/pull/17071.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17071/head:pull/17071 PR: https://git.openjdk.org/jdk/pull/17071 From qamai at openjdk.org Thu Dec 14 18:19:13 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 14 Dec 2023 18:19:13 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform_no_reclaim [v2] In-Reply-To: References: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> Message-ID: On Wed, 13 Dec 2023 21:43:53 GMT, Paul Hohensee wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix formatting > > There are a couple of GHA failures that look compiler related. I approved the PR before looking at these because the patch is trivial on its face, but now I'm suspicious. @phohensee Those are [JDK-8321820](https://bugs.openjdk.org/browse/JDK-8321820) and [JDK-8321542](https://bugs.openjdk.org/browse/JDK-8321542). ------------- PR Comment: https://git.openjdk.org/jdk/pull/17071#issuecomment-1856356185 From duke at openjdk.org Thu Dec 14 18:19:13 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 14 Dec 2023 18:19:13 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform_no_reclaim [v2] In-Reply-To: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> References: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> Message-ID: <_TXnJxwuODAWHSWSMyvmvCwXhcxzovxBgyxtpBdqAKE=.4940a903-0d2a-46c4-942a-4079576b6e94@github.com> On Tue, 12 Dec 2023 17:29:50 GMT, Joshua Cao wrote: >> `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. >> >> Passes tier1 locally on my Linux machine. > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix formatting Merged from master. This should pass the failing test with https://github.com/openjdk/jdk/commit/d632d743e018c69ecf423af75b65354e8ffaefc8 ------------- PR Comment: https://git.openjdk.org/jdk/pull/17071#issuecomment-1856356470 From ysr at openjdk.org Fri Dec 15 00:33:40 2023 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 15 Dec 2023 00:33:40 GMT Subject: RFR: 8321823: Remove redundant PhaseGVN transform_no_reclaim [v2] In-Reply-To: References: <9TRF6bRfXq0gCy_39OitPXgejHyc_V73-sr_L-AC_gU=.803dfaa1-2549-4b86-ad1c-8e9140046006@github.com> Message-ID: <8CbSxvFNpZ_VnQpL8j2VYUCW27yNSMbh-8-hZcWyw5s=.0ca246ed-7e20-496d-b696-0a5d14f14f53@github.com> On Wed, 13 Dec 2023 21:43:53 GMT, Paul Hohensee wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix formatting > > There are a couple of GHA failures that look compiler related. I approved the PR before looking at these because the patch is trivial on its face, but now I'm suspicious. Since GHA passes, and @phohensee has reviewed, I'm happy to: ------------- PR Comment: https://git.openjdk.org/jdk/pull/17071#issuecomment-1857081243 From duke at openjdk.org Fri Dec 15 00:38:51 2023 From: duke at openjdk.org (Joshua Cao) Date: Fri, 15 Dec 2023 00:38:51 GMT Subject: Integrated: 8321823: Remove redundant PhaseGVN transform_no_reclaim In-Reply-To: References: Message-ID: On Tue, 12 Dec 2023 00:37:19 GMT, Joshua Cao wrote: > `PhaseGVN::transform` is just a one line wrapper around `PhaseGVN::transform_no_reclaim`. Looking at the history, they had different functionality in 2008, but since have become the same thing. We prefer to keep `PhaseGVN::transform` because it has hundreds of callsites, while the other only has a few callsites that are shown in the PR. > > Passes tier1 locally on my Linux machine. This pull request has now been integrated. Changeset: 6dfb8120 Author: Joshua Cao Committer: Y. Srinivas Ramakrishna URL: https://git.openjdk.org/jdk/commit/6dfb8120c270a76fcba5a5c3c9ad91da3282d5fa Stats: 20 lines in 5 files changed: 0 ins; 7 del; 13 mod 8321823: Remove redundant PhaseGVN transform_no_reclaim Reviewed-by: chagedorn, phh ------------- PR: https://git.openjdk.org/jdk/pull/17071 From chagedorn at openjdk.org Fri Dec 15 07:51:42 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 07:51:42 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v6] In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 15:37:07 GMT, Roland Westrelin wrote: >> Range check smearing and range check predication make an array access >> dependent on 2 (or more in the case of RC smearing) conditions. As a >> consequence, if a range check can be eliminated because there's an >> identical dominating range check, the control dependent nodes that >> could float and become dependent on the dominating range check cannot >> be allowed to float because there's a risk that they would then bypass >> one of the checks that make the access legal. >> >> `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have >> logic to prevent this: nodes that are control dependent on a range >> check or predicate are not allowed to float. This is however not >> sufficient as demonstrated by the test cases. >> >> In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: >> >> >> v += array[i]; >> if (flag2) { >> if (flag3) { >> field = 0x42; >> } >> } >> if (flagField == 1) { >> v += array[i]; >> } >> >> >> The range check for the second `array[i]` load is replaced by the >> dominating range check for the first `array[i]` but because the second >> `array[i]` load could really be dependent on multiple range checks (in >> case smearing happened which is not the case here), c2 doesn't allow >> the second `array[i]` to float when the second range check is >> removed. The second `array[i]` is then control dependent on: >> >> >> if (flagField == 1) { >> >> >> which is next found to be dominated by the same test: >> >> >> if (flag == 1) { >> >> >> and is removed. However nothing in `dominated_by()` treats node >> dependent on tests that are not range check or predicates >> specially. So the second `array[i]` is allowed to float and become >> dependent on: >> >> >> if (flag == 1) { >> >> >> which is above the range check for that access. The test method in its >> last invocation is passed an index for the array access that's widely >> out of range. The array load happens before the range check and >> crashes the VM. `testLoopPredication()` is a similar test where array >> loads become dependent on predicates and end up above range checks. >> >> `TestArrayAccessCastIIAboveRC.java` is the test case from the bug >> where for similar reasons a range check `CastII` ends up above its >> range check, becomes top because its input becomes some integer that >> conflicts with its... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Changes requested by chagedorn (Reviewer). src/hotspot/share/opto/memnode.hpp line 295: > 293: bool has_pinned_control_dependency() const { return _control_dependency == Pinned; } > 294: > 295: LoadNode* pin_for_array_access() const override; Okay, apparently this does not build on macos (see GHA) due to `Winconsistent-missing-override`. So, you either have to mark all overriding methods in this class with `override` or use none at all. I guess it's simpler to just get ride of `override` here again. ------------- PR Review: https://git.openjdk.org/jdk/pull/16886#pullrequestreview-1783282655 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1427654549 From chagedorn at openjdk.org Fri Dec 15 07:51:44 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 07:51:44 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v2] In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 15:33:40 GMT, Roland Westrelin wrote: > > Nice summary! I have a few comments but otherwise, the fix looks reasonable. > > Thanks for the review and the suggestions. I made the changes you proposed. Thanks for the changes. There is a build problem now on macos (see comment). > > > Have you also run some performance testing to check if delaying RC smearing has any impact? > > @TobiHartmann ran performance testing for me. There was no regression. That's great to hear, thanks @TobiHartmann! ------------- PR Comment: https://git.openjdk.org/jdk/pull/16886#issuecomment-1857425640 From roland at openjdk.org Fri Dec 15 08:41:20 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 15 Dec 2023 08:41:20 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v7] In-Reply-To: References: Message-ID: > Range check smearing and range check predication make an array access > dependent on 2 (or more in the case of RC smearing) conditions. As a > consequence, if a range check can be eliminated because there's an > identical dominating range check, the control dependent nodes that > could float and become dependent on the dominating range check cannot > be allowed to float because there's a risk that they would then bypass > one of the checks that make the access legal. > > `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have > logic to prevent this: nodes that are control dependent on a range > check or predicate are not allowed to float. This is however not > sufficient as demonstrated by the test cases. > > In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: > > > v += array[i]; > if (flag2) { > if (flag3) { > field = 0x42; > } > } > if (flagField == 1) { > v += array[i]; > } > > > The range check for the second `array[i]` load is replaced by the > dominating range check for the first `array[i]` but because the second > `array[i]` load could really be dependent on multiple range checks (in > case smearing happened which is not the case here), c2 doesn't allow > the second `array[i]` to float when the second range check is > removed. The second `array[i]` is then control dependent on: > > > if (flagField == 1) { > > > which is next found to be dominated by the same test: > > > if (flag == 1) { > > > and is removed. However nothing in `dominated_by()` treats node > dependent on tests that are not range check or predicates > specially. So the second `array[i]` is allowed to float and become > dependent on: > > > if (flag == 1) { > > > which is above the range check for that access. The test method in its > last invocation is passed an index for the array access that's widely > out of range. The array load happens before the range check and > crashes the VM. `testLoopPredication()` is a similar test where array > loads become dependent on predicates and end up above range checks. > > `TestArrayAccessCastIIAboveRC.java` is the test case from the bug > where for similar reasons a range check `CastII` ends up above its > range check, becomes top because its input becomes some integer that > conflicts with its type (but there's no condition to catch it). The > graph becomes broken and c2 crashes. > > Logic in the `dominated_by()` methods ... Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: - Revert "Update src/hotspot/share/opto/castnode.hpp" This reverts commit 356c91cca911ed486f9f87f3eff53ce21e1e3ec9. - Revert "Update src/hotspot/share/opto/memnode.hpp" This reverts commit bdb731ea562f314f44d327f7243ef5cf9ad40b2e. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16886/files - new: https://git.openjdk.org/jdk/pull/16886/files/07867e6a..0ab8ae5f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=05-06 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16886.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16886/head:pull/16886 PR: https://git.openjdk.org/jdk/pull/16886 From roland at openjdk.org Fri Dec 15 08:41:20 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 15 Dec 2023 08:41:20 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v2] In-Reply-To: References: Message-ID: <_Xj6c756-8ax1bhDXn1SA85WhXKOr_MlUmdhRh71vic=.e3fda2a9-0696-4e42-84c4-d6aeb93ad75d@github.com> On Fri, 15 Dec 2023 07:48:34 GMT, Christian Hagedorn wrote: > Thanks for the changes. There is a build problem now on macos (see comment). I reverted the `override` changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16886#issuecomment-1857488344 From antoine.dessaigne at gmail.com Fri Dec 15 10:20:59 2023 From: antoine.dessaigne at gmail.com (Antoine DESSAIGNE) Date: Fri, 15 Dec 2023 11:20:59 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 Message-ID: Hello everyone, I've found an issue while migrating to OpenJDK 21. A valued local variable (effectively final) has its value removed and it throws a NullPointerException. Unfortunately, I cannot provide the source code and the data to reproduce the issue, and I couldn't create a smaller code snippet to show the issue. That said, I'll happily show the code and perform many tests during calls. Here's what I did so far to diagnose the issue. I bisected the repository to find where the regression comes from. I found this commit 3696711efa5 [1] but it's a merge so I bisected the branch and found 10737e168c9 [2]. Looking at this commit, I have no idea how it could introduce this kind of regression. Then, thanks to the guidance from Aleksey Shipil?v, I tested many things * Issue does *not* happen with the following flags: -Xint, -XX:-TieredCompilation, -XX:TieredStopAtLevel=1, -XX:TieredStopAtLevel=2, -XX:TieredStopAtLevel=3 * Issue also happens with fastdebug builds of OpenJDK, without crashing due to assertions * Issue still happens in the latest version of the code (commit b31454e3623) * Issue happens no matter which GC is used, I tried SerialGC, ParallelGC, G1GC, and ShenandoahGC The tests were performed in Docker containers running on 4 different hosts. Therefore it looks like C2 is generating an invalid assembly code. Unfortunately, I'm not great with assembly and the generated assembly is quite big (main code is around 20k). Do you have an idea of why this is happening? Do you know what test I can run? If one of you is available, we can schedule calls for me to show you the code and my tests. Thank you very much for your assistance. Have a nice day, Antoine DESSAIGNE [1] https://github.com/openjdk/jdk/commit/3696711efa566fb776d6923da86e17b0e1e22964 [2] https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab From aph-open at littlepinkcloud.com Fri Dec 15 10:46:20 2023 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Fri, 15 Dec 2023 10:46:20 +0000 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: On 12/15/23 10:20, Antoine DESSAIGNE wrote: > Do you have an idea of why this is happening? Do you know what test I > can run? First, try to reproduce it with JDK 22 preview. If you can't provide a reproducer, it's likely that no one will be able to fix it now, and you'll have to wait until it gets fixed. Try: '-XX:CompileCommand=exclude,foo.bar.baz.Classname::badMethodName' -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From chagedorn at openjdk.org Fri Dec 15 10:50:18 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 10:50:18 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 11:34:19 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > renamings and proof improvement in adjust_pre_loop_limit_to_align_main_loop_vectors Thanks for addressing my other comments! I really like the new structure of having an `AlignmentSolution` interface with different alignment solution classes. I have some more comments but mostly fine tuning - it's already in a quite good shape and the comments really add a lot of benefit to understand the idea behind the code. We get there :-) src/hotspot/share/opto/superword.cpp line 3468: > 3466: // the address of "align_to_ref" to the maximal possible vector width. We adjust the pre-loop > 3467: // iteration count by adjusting the pre-loop limit. > 3468: void SuperWord::adjust_pre_loop_limit_to_align_main_loop_vectors() { Nice rewrite of this method - it's much easier to follow now :-) I have a few comments below. src/hotspot/share/opto/superword.cpp line 3495: > 3493: // alignment of the address. > 3494: // > 3495: // adr = base + offset + invar + scale * iv (1) For completeness, you can also add the desired `% aw = 0`: Suggestion: // adr = (base + offset + invar + scale * iv) % aw = 0 (1) src/hotspot/share/opto/superword.cpp line 3514: > 3512: // boi = base + offset + invar (4) > 3513: // > 3514: // And now we can simplify the address, using (1), (2), and (4): Suggestion: // And now we can simplify the address using (1), (2), and (4): src/hotspot/share/opto/superword.cpp line 3518: > 3516: // adr = boi + scale * new_limit > 3517: // adr = boi + scale * old_limit + scale * adjust_pre_iter (5a, stride > 0) > 3518: // adr = boi + scale * old_limit - scale * adjust_pre_iter (5b, stride < 0) It might be easier to use the following form, especially when later deriving equations 11*: Suggestion: // adr = boi + scale * (old_limit + adjust_pre_iter) (5a, stride > 0) // adr = boi + scale * (old_limit - adjust_pre_iter) (5b, stride < 0) I've suggested the related required changes below. What do you think? src/hotspot/share/opto/superword.cpp line 3523: > 3521: // > 3522: // (boi + scale * old_limit + scale * adjust_pre_iter) % aw = 0 (6a, stride > 0) > 3523: // (boi + scale * old_limit - scale * adjust_pre_iter) % aw = 0 (6b, stride < 0) Suggestion: // (boi + scale * (old_limit + adjust_pre_iter) % aw = 0 (6a, stride > 0) // (boi + scale * (old_limit - adjust_pre_iter) % aw = 0 (6b, stride < 0) src/hotspot/share/opto/superword.cpp line 3525: > 3523: // (boi + scale * old_limit - scale * adjust_pre_iter) % aw = 0 (6b, stride < 0) > 3524: // > 3525: // In most cases, scale is the element size (elt_size), for example: Since there is no variable `elt_size` left, you can remove it: Suggestion: // In most cases, scale is the element size, for example: src/hotspot/share/opto/superword.cpp line 3537: > 3535: // we are not able to affect the alignment at all. Hence, we require abs(scale) < aw. > 3536: // > 3537: // Moreover, for alignment to be acheivabe, boi must be a multiple of scale. If strict Suggestion: // Moreover, for alignment to be achievable, boi must be a multiple of scale. If strict src/hotspot/share/opto/superword.cpp line 3553: > 3551: // > 3552: // (BOI + sign(scale) * old_limit + sign(scale) * adjust_pre_iter) % AW = 0 (9a, stride > 0) > 3553: // (BOI + sign(scale) * old_limit - sign(scale) * adjust_pre_iter) % AW = 0 (9b, stride < 0) Suggestion: // (BOI + sign(scale) * (old_limit + adjust_pre_iter) % AW = 0 (9a, stride > 0) // (BOI + sign(scale) * (old_limit - adjust_pre_iter) % AW = 0 (9b, stride < 0) src/hotspot/share/opto/superword.cpp line 3565: > 3563: // We solve (9) for adjust_pre_iter, in the following 4 cases: > 3564: // > 3565: // Case A: scale > 0 && stride > 0 (i.e. adjust_pre_iter >= 0) For completeness, you could add what `sign(scale)` is: Suggestion: // Case A: scale > 0 && stride > 0 (i.e. sign(scale) = 1, adjust_pre_iter >= 0) src/hotspot/share/opto/superword.cpp line 3569: > 3567: // adjust_pre_iter = (-BOI - old_limit) % AW (11a) > 3568: // > 3569: // Case B: scale < 0 && stride > 0 (i.e. adjust_pre_iter >= 0) Suggestion: // Case B: scale < 0 && stride > 0 (i.e. sign(scale) = -1, adjust_pre_iter >= 0) src/hotspot/share/opto/superword.cpp line 3573: > 3571: // adjust_pre_iter = (BOI - old_limit) % AW (11b) > 3572: // > 3573: // Case C: scale > 0 && stride < 0 (i.e. adjust_pre_iter <= 0) Suggestion: // Case C: scale > 0 && stride < 0 (i.e. sign(scale) = 1, adjust_pre_iter <= 0) src/hotspot/share/opto/superword.cpp line 3577: > 3575: // adjust_pre_iter = (BOI + old_limit) % AW (11c) > 3576: // > 3577: // Case D: scale < 0 && stride < 0 (i.e. adjust_pre_iter <= 0) Suggestion: // Case D: scale < 0 && stride < 0 (i.e. sign(scale) = -1, adjust_pre_iter <= 0) src/hotspot/share/opto/superword.cpp line 3599: > 3597: // XBOI = BOI > 3598: // = boi / abs(scale) > 3599: // = xboi / abs(scale) (14b, stride * scale < 0) I suggest the following structure to better highlight the final result and how `XBOI` is defined: Suggestion: // We now generalize the equations (11*) by using: // // OP: (stride > 0) ? SUB : ADD // XBOI: (stride * scale > 0) ? -BOI : BOI // // which gives us the final pre-loop limit adjustment: // // adjust_pre_iter = (XBOI OP old_limit) % AW (12) // // We can construct XBOI by additionally defining: // // xboi = -boi = (-base - offset - invar) (13a, stride * scale > 0) // xboi = +boi = (+base + offset + invar) (13b, stride * scale < 0) // // which gives us: // // XBOI = (stride * scale > 0) ? -BOI : BOI // = (stride * scale > 0) ? -boi / abs(scale) : boi / abs(scale) // = xboi / abs(scale) (14) src/hotspot/share/opto/superword.cpp line 3655: > 3653: #endif > 3654: > 3655: // 1: Compute: You could reference the equations above here: Suggestion: // 1: Compute (13a, b): src/hotspot/share/opto/superword.cpp line 3702: > 3700: > 3701: // 2: Compute: XBOI = xboi / abs(scale) > 3702: // The division is executed as shift Same here: Suggestion: // 2: Compute (14): // XBOI = xboi / abs(scale) // The division is executed as shift src/hotspot/share/opto/superword.cpp line 3708: > 3706: _phase->set_ctrl(XBOI, pre_ctrl); > 3707: > 3708: // 3: Compute: XBOI_OP_old_limit = XBOI OP old_limit Maybe you can use 3.1 and 3.2 to better highlight that you calculate `(12)`here: Suggestion: // 3: Compute (12): // 3.1: XBOI_OP_old_limit = XBOI OP old_limit src/hotspot/share/opto/superword.cpp line 3718: > 3716: _phase->set_ctrl(XBOI_OP_old_limit, pre_ctrl); > 3717: > 3718: // 4: Compute: Suggestion: // 3.2: Compute: src/hotspot/share/opto/superword.cpp line 3731: > 3729: // 5: Compute: > 3730: // new_limit = old_limit + adjust_pre_iter (stride > 0) > 3731: // new_limit = old_limit - adjust_pre_iter (stride < 0) Suggestion: // 4: Compute (2a, b): // new_limit = old_limit + adjust_pre_iter (stride > 0) // new_limit = old_limit - adjust_pre_iter (stride < 0) src/hotspot/share/opto/superword.cpp line 3741: > 3739: _phase->set_ctrl(new_limit, pre_ctrl); > 3740: > 3741: // 6. Make sure not to exceed the original limit with the new limit Maybe you also want to mention that in the comment above somewhere for completeness. src/hotspot/share/opto/superword.cpp line 3741: > 3739: _phase->set_ctrl(new_limit, pre_ctrl); > 3740: > 3741: // 6. Make sure not to exceed the original limit with the new limit Suggestion: // 5: Make sure not to exceed the original limit with the new limit src/hotspot/share/opto/vectorization.cpp line 824: > 822: // (C_const + C_pre * pre_iter_C_const) % aw = 0 (4c) > 823: // > 824: // We can only guarantee solutions to (4a) and (4b) if: Should we also mention here that strengthening it with (4a-c) is the best we can do to prove (3) statically while we might miss some solutions for (3) where some of (4a-c) are false which, however, cannot be proven statically? src/hotspot/share/opto/vectorization.cpp line 829: > 827: // C_invar % abs(C_pre) = 0 (5b*) > 828: // > 829: // Which means there are X and Y such that: It should be obvious but add here that X and Y must be integers: Suggestion: // Which means there are integers X and Y such that: src/hotspot/share/opto/vectorization.cpp line 926: > 924: // = pre_r + pre_q * m + alignment_init(X * var_init) + alignment_invar(Y * var_invar) > 925: // > 926: // Hence, the solution depends on: Suggestion: // Hence, the solution for pre_iter depends on: src/hotspot/share/opto/vectorization.cpp line 940: > 938: // hence we have to add a dependency for invar, and scale (pre_stride is the > 939: // same for all mem_refs in the loop). If there is no invariant, then we add > 940: // a dependency that there is no invariant. I found it a little difficult to understand this part. I had to write it down to verify that we indeed are not missing a dependency. Maybe we can be more explicit here. How about something like that? // Hence, the solution for pre_iter depends on: // - Always: pre_r and pre_q // - If an init is present, we have a dependency on it. But since init is fixed and given by the loop for all mem_refs // of all packs, we can drop it. // - If init is a constant, we do not have another dependency. // - If init is a non-constant, we have a dependency on scale since C_init = scale. We need to keep since it is not // given by the loop and could vary from pack to pack. We do not have to add another dependency since (5a*): // // C_init % abs(C_pre) = 0 <=> C_init = C_pre * X // // can only be satisfied if X is constant: // // C_init = C_pre * X // X = C_init / C_pre // X = scale / (scale * pre_stride) // X = 1 / pre_stride // // - If an invariant is present, we have a dependency on it which we need to keep since it is not given by the loop // and could vary from pack to pack. We additionally have to add another dependency on scale since (5b*): // // C_invar % abs(C_pre) = 0 <=> C_invar = C_pre * Y // // can only be satisfied if we have the following Y: // // C_invar = C_pre * Y // Y = (C_invar / C_pre) // Y = abs(invar_factor) / (scale * pre_stride) // // A dependency for pre_stride is not required as it is fixed and given by the loop for all mem_refs of all packs // (similar to init). src/hotspot/share/opto/vectorization.cpp line 957: > 955: > 956: const Node* invar_dependency = _invar; > 957: const int scale_dependency = (_invar != nullptr || !_init_node->is_ConI()) ? _scale : 0; Suggestion: const int scale_dependency = (_invar != nullptr || !_init_node->is_ConI()) ? _scale : 0; src/hotspot/share/opto/vectorization.hpp line 293: > 291: class AlignmentSolutionEmpty; > 292: class AlignmentSolutionTrivial; > 293: class AlignmentSolutionConstrained; I usually prefer to have the other way round to better distinguish the different classes in the code. But that's just personal taste. I leave it up to you to decide which one you like better. Suggestion: class EmptyAlignmentSolution; class TrivialAlignmentSolution; class ConstrainedAlignmentSolution; ------------- PR Review: https://git.openjdk.org/jdk/pull/14785#pullrequestreview-1772538187 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427670431 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427675101 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426839974 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426848558 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426851242 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426837081 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426838703 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426852449 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426853895 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426854616 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426854821 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426856582 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426875615 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426877072 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426878545 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426879328 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426882871 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426884143 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426884733 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427830566 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427692355 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427753811 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427720698 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427814350 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427829342 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427822774 From chagedorn at openjdk.org Fri Dec 15 10:50:24 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 10:50:24 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v40] In-Reply-To: References: Message-ID: On Thu, 7 Dec 2023 15:15:13 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > remove dead code src/hotspot/share/opto/c2_globals.hpp line 97: > 95: \ > 96: develop(bool, VerifyAlignVector, false, \ > 97: "Check that vector stores/loads are aligned if AlignVector is on.") \ Nit: Do we need to wrap the line here such that the `` is aligned again? src/hotspot/share/opto/superword.cpp line 591: > 589: // We can find adjacent memory references by comparing their relative > 590: // alignment. If the final vectors can be aligned can not yet be determined, > 591: // that is only done once all vectors are extended and combined. Suggestion: // alignment. Whether the final vectors can be aligned is determined later // once all vectors are extended and combined. src/hotspot/share/opto/vectorization.cpp line 772: > 770: // during "main_iter" main-loop iterations. > 771: > 772: // Attribute init either to C_const or to C_init term. Suggestion: // Attribute init (i.e. _init_node) either to C_const or to C_init term. src/hotspot/share/opto/vectorization.cpp line 779: > 777: const int C_invar = (_invar == nullptr) ? 0 : abs(_invar_factor); > 778: > 779: const int C_const = _offset + C_const_init * _scale; Nit: I suggest to move this line up to `C_const_init` and follow the structure in the comment above: // Attribute init (i.e. _init_node) either to C_const or to C_init term. const int C_const_init = _init_node->is_ConI() ? _init_node->as_ConI()->get_int() : 0; const int C_const = _offset + C_const_init * _scale; // Set C_invar depending on if invar is present const int C_invar = (_invar == nullptr) ? 0 : abs(_invar_factor); const int C_init = _init_node->is_ConI() ? 0 : _scale; const int C_pre = _scale * _pre_stride; const int C_main = _scale * _main_stride; src/hotspot/share/opto/vectorization.cpp line 900: > 898: // Otherwise, if abs(C_pre) < aw, we find all solutions for pre_iter_C_const in (4c). > 899: // We state pre_iter_C_const in terms of the smallest possible pre_q and pre_r, such > 900: // that pre_q >= 0 and 0 <= pre_r < pre_q: I think you can remove the first inequation as it is implied by the second: Suggestion: // that 0 <= pre_r < pre_q: src/hotspot/share/opto/vectorization.cpp line 906: > 904: // We can now restate (4c) with (7): > 905: // > 906: // (C_const + C_pre * pre_r + C_pre * pre_q * m) % aw = 0 (8) You might want to repeat 4c here to better follow the transformation. For example: Suggestion: // (C_const + C_pre * pre_iter_C_const) % aw = 0 (4c) // (C_const + C_pre * (pre_r + pre_q * m) % aw = 0 (applying (7)) // (C_const + C_pre * pre_r + C_pre * pre_q * m) % aw = 0 (8) src/hotspot/share/opto/vectorization.cpp line 915: > 913: // Given that abs(C_pre) is a powers of 2, and abs(C_pre) < aw: > 914: // > 915: const int pre_q = _aw / abs(C_pre); Suggestion: const int pre_q = _aw / abs(C_pre); src/hotspot/share/opto/vectorization.hpp line 572: > 570: > 571: private: > 572: #ifdef ASSERT Since the `Tracer` class in this file is `NOT_PRODUCT`, I suggest to also go with that here and where the trace methods are used. Suggestion: #ifndef PRODUCT ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1423754881 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1420583333 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1424172989 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1424166141 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1424281102 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1424285379 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1424287409 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1420602287 From chagedorn at openjdk.org Fri Dec 15 10:50:24 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 10:50:24 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v42] In-Reply-To: References: Message-ID: On Wed, 13 Dec 2023 08:42:14 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - expose less from AlignmentSolutionConstrained > - fix up some virtual functions for Christian src/hotspot/share/opto/vectorization.cpp line 803: > 801: > 802: // In what follows, we need to show that the C_const, init and invar terms can be aligned by > 803: // adjusting the pre-loop limit (pre_iter). We decompose pre_iter: It's correct that in the end we want to adjust the pre-loop limit to get everything aligned. But in the context of this comment here to find suitable `pre_iter` values it might be better to directly state that we are trying to adjust `pre_iter` to achieve that. Maybe we can restate this as: Suggestion: // adjusting the pre-loop iteration count (pre_iter) which is defined by the pre-loop limit. We decompose pre_iter: ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1425257460 From chagedorn at openjdk.org Fri Dec 15 10:50:24 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 10:50:24 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 08:28:51 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> renamings and proof improvement in adjust_pre_loop_limit_to_align_main_loop_vectors > > src/hotspot/share/opto/vectorization.cpp line 824: > >> 822: // (C_const + C_pre * pre_iter_C_const) % aw = 0 (4c) >> 823: // >> 824: // We can only guarantee solutions to (4a) and (4b) if: > > Should we also mention here that strengthening it with (4a-c) is the best we can do to prove (3) statically while we might miss some solutions for (3) where some of (4a-c) are false which, however, cannot be proven statically? Maybe we can also put (5a-b) in words: Suggestion: // We can only guarantee solutions to (4a) and (4b) if C_init and C_invar are zero or multiples of C_pre: > src/hotspot/share/opto/vectorization.cpp line 940: > >> 938: // hence we have to add a dependency for invar, and scale (pre_stride is the >> 939: // same for all mem_refs in the loop). If there is no invariant, then we add >> 940: // a dependency that there is no invariant. > > I found it a little difficult to understand this part. I had to write it down to verify that we indeed are not missing a dependency. Maybe we can be more explicit here. How about something like that? > > > // Hence, the solution for pre_iter depends on: > // - Always: pre_r and pre_q > // - If an init is present, we have a dependency on it. But since init is fixed and given by the loop for all mem_refs > // of all packs, we can drop it. > // - If init is a constant, we do not have another dependency. > // - If init is a non-constant, we have a dependency on scale since C_init = scale. We need to keep since it is not > // given by the loop and could vary from pack to pack. We do not have to add another dependency since (5a*): > // > // C_init % abs(C_pre) = 0 <=> C_init = C_pre * X > // > // can only be satisfied if X is constant: > // > // C_init = C_pre * X > // X = C_init / C_pre > // X = scale / (scale * pre_stride) > // X = 1 / pre_stride > // > // - If an invariant is present, we have a dependency on it which we need to keep since it is not given by the loop > // and could vary from pack to pack. We additionally have to add another dependency on scale since (5b*): > // > // C_invar % abs(C_pre) = 0 <=> C_invar = C_pre * Y > // > // can only be satisfied if we have the following Y: > // > // C_invar = C_pre * Y > // Y = (C_invar / C_pre) > // Y = abs(invar_factor) / (scale * pre_stride) > // > // A dependency for pre_stride is not required as it is fixed and given by the loop for all mem_refs of all packs > // (similar to init). Another thought: Should we assert that if init is a non-constant that `abs(stride) = 1` if there is a solution? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427694498 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427819252 From thartmann at openjdk.org Fri Dec 15 11:02:00 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 15 Dec 2023 11:02:00 GMT Subject: [jdk22] RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" Message-ID: Hi all, This pull request contains a backport of commit [69014cd5](https://github.com/openjdk/jdk/commit/69014cd55b59a0a63f4918fad575a6887640573e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The commit being backported was authored by Daniel Lund?n on 14 Dec 2023 and was reviewed by Tobias Hartmann, Andrew Haley and Dean Long. Thanks! ------------- Commit messages: - Backport 69014cd55b59a0a63f4918fad575a6887640573e Changes: https://git.openjdk.org/jdk22/pull/15/files Webrev: https://webrevs.openjdk.org/?repo=jdk22&pr=15&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8320682 Stats: 36 lines in 2 files changed: 3 ins; 23 del; 10 mod Patch: https://git.openjdk.org/jdk22/pull/15.diff Fetch: git fetch https://git.openjdk.org/jdk22.git pull/15/head:pull/15 PR: https://git.openjdk.org/jdk22/pull/15 From volker.simonis at gmail.com Fri Dec 15 11:03:19 2023 From: volker.simonis at gmail.com (Volker Simonis) Date: Fri, 15 Dec 2023 12:03:19 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: Can you please share hs_error file from the assertion crash with the fastdebug build? That may give us some hints. Antoine DESSAIGNE schrieb am Fr., 15. Dez. 2023, 11:21: > Hello everyone, > > I've found an issue while migrating to OpenJDK 21. A valued local > variable (effectively final) has its value removed and it throws a > NullPointerException. Unfortunately, I cannot provide the source code > and the data to reproduce the issue, and I couldn't create a smaller > code snippet to show the issue. That said, I'll happily show the code > and perform many tests during calls. > > Here's what I did so far to diagnose the issue. > > I bisected the repository to find where the regression comes from. I > found this commit 3696711efa5 [1] but it's a merge so I bisected the > branch and found 10737e168c9 [2]. Looking at this commit, I have no > idea how it could introduce this kind of regression. > > Then, thanks to the guidance from Aleksey Shipil?v, I tested many things > * Issue does *not* happen with the following flags: -Xint, > -XX:-TieredCompilation, -XX:TieredStopAtLevel=1, > -XX:TieredStopAtLevel=2, -XX:TieredStopAtLevel=3 > * Issue also happens with fastdebug builds of OpenJDK, without > crashing due to assertions > * Issue still happens in the latest version of the code (commit > b31454e3623) > * Issue happens no matter which GC is used, I tried SerialGC, > ParallelGC, G1GC, and ShenandoahGC > > The tests were performed in Docker containers running on 4 different hosts. > > Therefore it looks like C2 is generating an invalid assembly code. > Unfortunately, I'm not great with assembly and the generated assembly > is quite big (main code is around 20k). > > Do you have an idea of why this is happening? Do you know what test I > can run? If one of you is available, we can schedule calls for me to > show you the code and my tests. Thank you very much for your > assistance. > > Have a nice day, > > Antoine DESSAIGNE > > [1] > https://github.com/openjdk/jdk/commit/3696711efa566fb776d6923da86e17b0e1e22964 > [2] > https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duke at openjdk.org Fri Dec 15 11:49:50 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Fri, 15 Dec 2023 11:49:50 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: - Use zero_extend instead of shifts where possible - Use andn instead of notr + andr where possible - Replace shNadd with one instruction in most cases ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17046/files - new: https://git.openjdk.org/jdk/pull/17046/files/d8d6968f..f7a4f0c7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=00-01 Stats: 23 lines in 2 files changed: 4 ins; 7 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From duke at openjdk.org Fri Dec 15 11:49:54 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Fri, 15 Dec 2023 11:49:54 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: On Tue, 12 Dec 2023 09:55:09 GMT, Vladimir Kempik wrote: >> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: >> >> - Use zero_extend instead of shifts where possible >> - Use andn instead of notr + andr where possible >> - Replace shNadd with one instruction in most cases > > src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3753: > >> 3751: mv(tmp5, bits32); >> 3752: notr(crc, crc); >> 3753: andr(crc, crc, tmp5); > > can use andn(crc, tmp, crc); here, so get some accel when Zbb present Done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1427886298 From duke at openjdk.org Fri Dec 15 11:49:52 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Fri, 15 Dec 2023 11:49:52 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: <9TEIo2dROB23lR_5-zKZlnsCpMgFamUPOSL0bxBZu5Q=.3dcfe283-5b3e-4798-b781-100836b451b6@github.com> References: <9TEIo2dROB23lR_5-zKZlnsCpMgFamUPOSL0bxBZu5Q=.3dcfe283-5b3e-4798-b781-100836b451b6@github.com> Message-ID: On Wed, 13 Dec 2023 01:04:16 GMT, Feilong Jiang wrote: >> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: >> >> - Use zero_extend instead of shifts where possible >> - Use andn instead of notr + andr where possible >> - Replace shNadd with one instruction in most cases > > src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp line 1642: > >> 1640: __ xori(crc, crc, -1); // ~crc >> 1641: __ slli(crc, crc, 32); >> 1642: __ srli(crc, crc, 32); > > Suggestion: > > __ notr(crc, crc); // ~crc > __ zero_extend(crc, crc, 32); Done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1427886119 From antoine.dessaigne at gmail.com Fri Dec 15 12:10:54 2023 From: antoine.dessaigne at gmail.com (Antoine DESSAIGNE) Date: Fri, 15 Dec 2023 13:10:54 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: Thank you Volker and Andrew for your replies, Unfortunately, there's no hs_error generated, the code throws a NullPointerException when it shouldn't. I can even reproduce it with JDK 23, I took the master of the jdk repository this morning (commit b31454e3623 to be precise) and it still fails. I cannot exclude this method from the JIT as it's used a lot in our application. I checked many many times the commit to be sure that I have the right one and I do. It fails almost every time with 10737e168c9 [1] but it never fails with its parent commit, even after 20 tests. [1] https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab Le ven. 15 d?c. 2023 ? 12:37, Andrew Haley a ?crit : > > On 12/15/23 10:20, Antoine DESSAIGNE wrote: > > Do you have an idea of why this is happening? Do you know what test I > > can run? > > First, try to reproduce it with JDK 22 preview. > > If you can't provide a reproducer, it's likely that no one will be > able to fix it now, and you'll have to wait until it gets fixed. > > Try: '-XX:CompileCommand=exclude,foo.bar.baz.Classname::badMethodName' > > -- > Andrew Haley (he/him) > Java Platform Lead Engineer > Red Hat UK Ltd. > https://keybase.io/andrewhaley > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > From epeter at openjdk.org Fri Dec 15 12:12:57 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 15 Dec 2023 12:12:57 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 08:10:47 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> renamings and proof improvement in adjust_pre_loop_limit_to_align_main_loop_vectors > > src/hotspot/share/opto/superword.cpp line 3495: > >> 3493: // alignment of the address. >> 3494: // >> 3495: // adr = base + offset + invar + scale * iv (1) > > For completeness, you can also add the desired `% aw = 0`: > Suggestion: > > // adr = (base + offset + invar + scale * iv) % aw = 0 (1) I moved `adr % aw = 0` up, flipping (2) and (3) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427906257 From epeter at openjdk.org Fri Dec 15 12:18:39 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 15 Dec 2023 12:18:39 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v44] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review First batch of Christian's suggestions Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/b37f0c12..05bf5cd5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=43 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=42-43 Stats: 14 lines in 1 file changed: 0 ins; 0 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Fri Dec 15 12:25:24 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 15 Dec 2023 12:25:24 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v45] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Apply suggestions from code review Co-authored-by: Christian Hagedorn - more review fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/05bf5cd5..f1a706f7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=44 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=43-44 Stats: 23 lines in 2 files changed: 4 ins; 3 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From chagedorn at openjdk.org Fri Dec 15 12:32:42 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 12:32:42 GMT Subject: [jdk22] RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 10:55:48 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [69014cd5](https://github.com/openjdk/jdk/commit/69014cd55b59a0a63f4918fad575a6887640573e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Daniel Lund?n on 14 Dec 2023 and was reviewed by Tobias Hartmann, Andrew Haley and Dean Long. > > Thanks! Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk22/pull/15#pullrequestreview-1783845282 From thartmann at openjdk.org Fri Dec 15 12:37:43 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 15 Dec 2023 12:37:43 GMT Subject: [jdk22] RFR: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 10:55:48 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [69014cd5](https://github.com/openjdk/jdk/commit/69014cd55b59a0a63f4918fad575a6887640573e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Daniel Lund?n on 14 Dec 2023 and was reviewed by Tobias Hartmann, Andrew Haley and Dean Long. > > Thanks! Thanks, Christian! ------------- PR Comment: https://git.openjdk.org/jdk22/pull/15#issuecomment-1857815064 From epeter at openjdk.org Fri Dec 15 12:53:14 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 15 Dec 2023 12:53:14 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v46] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Christians suggestions - more suggestions implemented ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/f1a706f7..b323d7a7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=45 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=44-45 Stats: 35 lines in 3 files changed: 16 ins; 1 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From chagedorn at openjdk.org Fri Dec 15 12:55:44 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 12:55:44 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v7] In-Reply-To: References: Message-ID: <4O9MKwWQwTlG_t8S4rFdTu2QyvR2TtA953RcJ2YG6zM=.22262ff9-7b80-41e0-8986-eb08e4ab2630@github.com> On Fri, 15 Dec 2023 08:41:20 GMT, Roland Westrelin wrote: >> Range check smearing and range check predication make an array access >> dependent on 2 (or more in the case of RC smearing) conditions. As a >> consequence, if a range check can be eliminated because there's an >> identical dominating range check, the control dependent nodes that >> could float and become dependent on the dominating range check cannot >> be allowed to float because there's a risk that they would then bypass >> one of the checks that make the access legal. >> >> `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have >> logic to prevent this: nodes that are control dependent on a range >> check or predicate are not allowed to float. This is however not >> sufficient as demonstrated by the test cases. >> >> In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: >> >> >> v += array[i]; >> if (flag2) { >> if (flag3) { >> field = 0x42; >> } >> } >> if (flagField == 1) { >> v += array[i]; >> } >> >> >> The range check for the second `array[i]` load is replaced by the >> dominating range check for the first `array[i]` but because the second >> `array[i]` load could really be dependent on multiple range checks (in >> case smearing happened which is not the case here), c2 doesn't allow >> the second `array[i]` to float when the second range check is >> removed. The second `array[i]` is then control dependent on: >> >> >> if (flagField == 1) { >> >> >> which is next found to be dominated by the same test: >> >> >> if (flag == 1) { >> >> >> and is removed. However nothing in `dominated_by()` treats node >> dependent on tests that are not range check or predicates >> specially. So the second `array[i]` is allowed to float and become >> dependent on: >> >> >> if (flag == 1) { >> >> >> which is above the range check for that access. The test method in its >> last invocation is passed an index for the array access that's widely >> out of range. The array load happens before the range check and >> crashes the VM. `testLoopPredication()` is a similar test where array >> loads become dependent on predicates and end up above range checks. >> >> `TestArrayAccessCastIIAboveRC.java` is the test case from the bug >> where for similar reasons a range check `CastII` ends up above its >> range check, becomes top because its input becomes some integer that >> conflicts with its... > > Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: > > - Revert "Update src/hotspot/share/opto/castnode.hpp" > > This reverts commit 356c91cca911ed486f9f87f3eff53ce21e1e3ec9. > - Revert "Update src/hotspot/share/opto/memnode.hpp" > > This reverts commit bdb731ea562f314f44d327f7243ef5cf9ad40b2e. Otherwise, looks good! src/hotspot/share/opto/loopopts.cpp line 360: > 358: // dependent nodes end up at the lowest/nearest dominating check in the graph. To ensure that these Loads/Casts > 359: // do not float above any of the dominating checks (even when the lowest dominating check is later replaced by > 360: // yet another dominating check), we need to pin them at the lowest dominating check. Should we also add this updated comment to `ifnode.cpp:569` and `ifnode.cpp:1536`? ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16886#pullrequestreview-1783911844 PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1427944258 From epeter at openjdk.org Fri Dec 15 13:27:55 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 15 Dec 2023 13:27:55 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: <_oKHio4P31Cg7U5s8Aeg5-6ytj7SeiPjFNhAvt0iDwg=.5ad82a51-e9d4-45f4-bf55-268358f4329e@github.com> On Fri, 15 Dec 2023 08:31:19 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/vectorization.cpp line 824: >> >>> 822: // (C_const + C_pre * pre_iter_C_const) % aw = 0 (4c) >>> 823: // >>> 824: // We can only guarantee solutions to (4a) and (4b) if: >> >> Should we also mention here that strengthening it with (4a-c) is the best we can do to prove (3) statically while we might miss some solutions for (3) where some of (4a-c) are false which, however, cannot be proven statically? > > Maybe we can also put (5a-b) in words: > Suggestion: > > // We can only guarantee solutions to (4a) and (4b) if C_init and C_invar are zero or multiples of C_pre: Ok, I added some extra comments here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427973108 From epeter at openjdk.org Fri Dec 15 13:36:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 15 Dec 2023 13:36:10 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v47] In-Reply-To: References: Message-ID: <1TEBdfpuy8JHOPsFSetvjQzyzwMm_Vpxmi9Avdxv35w=.acd2c2ab-4a18-4fe8-a3a9-ad3d412446a8@github.com> > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: another small review step ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/b323d7a7..0ae53186 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=46 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=45-46 Stats: 15 lines in 1 file changed: 9 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From thartmann at openjdk.org Fri Dec 15 14:31:45 2023 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 15 Dec 2023 14:31:45 GMT Subject: [jdk22] Integrated: 8320682: [AArch64] C1 compilation fails with "Field too big for insn" In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 10:55:48 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [69014cd5](https://github.com/openjdk/jdk/commit/69014cd55b59a0a63f4918fad575a6887640573e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Daniel Lund?n on 14 Dec 2023 and was reviewed by Tobias Hartmann, Andrew Haley and Dean Long. > > Thanks! This pull request has now been integrated. Changeset: 6b46c776 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk22/commit/6b46c776e4c317c1b0778109d1684d96d7087a36 Stats: 36 lines in 2 files changed: 3 ins; 23 del; 10 mod 8320682: [AArch64] C1 compilation fails with "Field too big for insn" Reviewed-by: chagedorn Backport-of: 69014cd55b59a0a63f4918fad575a6887640573e ------------- PR: https://git.openjdk.org/jdk22/pull/15 From roland at openjdk.org Fri Dec 15 14:32:57 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 15 Dec 2023 14:32:57 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v8] In-Reply-To: References: Message-ID: > Range check smearing and range check predication make an array access > dependent on 2 (or more in the case of RC smearing) conditions. As a > consequence, if a range check can be eliminated because there's an > identical dominating range check, the control dependent nodes that > could float and become dependent on the dominating range check cannot > be allowed to float because there's a risk that they would then bypass > one of the checks that make the access legal. > > `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have > logic to prevent this: nodes that are control dependent on a range > check or predicate are not allowed to float. This is however not > sufficient as demonstrated by the test cases. > > In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: > > > v += array[i]; > if (flag2) { > if (flag3) { > field = 0x42; > } > } > if (flagField == 1) { > v += array[i]; > } > > > The range check for the second `array[i]` load is replaced by the > dominating range check for the first `array[i]` but because the second > `array[i]` load could really be dependent on multiple range checks (in > case smearing happened which is not the case here), c2 doesn't allow > the second `array[i]` to float when the second range check is > removed. The second `array[i]` is then control dependent on: > > > if (flagField == 1) { > > > which is next found to be dominated by the same test: > > > if (flag == 1) { > > > and is removed. However nothing in `dominated_by()` treats node > dependent on tests that are not range check or predicates > specially. So the second `array[i]` is allowed to float and become > dependent on: > > > if (flag == 1) { > > > which is above the range check for that access. The test method in its > last invocation is passed an index for the array access that's widely > out of range. The array load happens before the range check and > crashes the VM. `testLoopPredication()` is a similar test where array > loads become dependent on predicates and end up above range checks. > > `TestArrayAccessCastIIAboveRC.java` is the test case from the bug > where for similar reasons a range check `CastII` ends up above its > range check, becomes top because its input becomes some integer that > conflicts with its type (but there's no condition to catch it). The > graph becomes broken and c2 crashes. > > Logic in the `dominated_by()` methods ... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16886/files - new: https://git.openjdk.org/jdk/pull/16886/files/0ab8ae5f..32e41299 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16886&range=06-07 Stats: 9 lines in 1 file changed: 3 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/16886.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16886/head:pull/16886 PR: https://git.openjdk.org/jdk/pull/16886 From roland at openjdk.org Fri Dec 15 14:33:00 2023 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 15 Dec 2023 14:33:00 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v7] In-Reply-To: <4O9MKwWQwTlG_t8S4rFdTu2QyvR2TtA953RcJ2YG6zM=.22262ff9-7b80-41e0-8986-eb08e4ab2630@github.com> References: <4O9MKwWQwTlG_t8S4rFdTu2QyvR2TtA953RcJ2YG6zM=.22262ff9-7b80-41e0-8986-eb08e4ab2630@github.com> Message-ID: On Fri, 15 Dec 2023 12:52:45 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision: >> >> - Revert "Update src/hotspot/share/opto/castnode.hpp" >> >> This reverts commit 356c91cca911ed486f9f87f3eff53ce21e1e3ec9. >> - Revert "Update src/hotspot/share/opto/memnode.hpp" >> >> This reverts commit bdb731ea562f314f44d327f7243ef5cf9ad40b2e. > > src/hotspot/share/opto/loopopts.cpp line 360: > >> 358: // dependent nodes end up at the lowest/nearest dominating check in the graph. To ensure that these Loads/Casts >> 359: // do not float above any of the dominating checks (even when the lowest dominating check is later replaced by >> 360: // yet another dominating check), we need to pin them at the lowest dominating check. > > Should we also add this updated comment to `ifnode.cpp:569` and `ifnode.cpp:1536`? Yes, right. Thanks! I updated the change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1428036421 From chagedorn at openjdk.org Fri Dec 15 14:38:43 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 14:38:43 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v8] In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 14:32:57 GMT, Roland Westrelin wrote: >> Range check smearing and range check predication make an array access >> dependent on 2 (or more in the case of RC smearing) conditions. As a >> consequence, if a range check can be eliminated because there's an >> identical dominating range check, the control dependent nodes that >> could float and become dependent on the dominating range check cannot >> be allowed to float because there's a risk that they would then bypass >> one of the checks that make the access legal. >> >> `IfNode::dominated_by()` and `PhaseIdealLoop::dominated_by()` have >> logic to prevent this: nodes that are control dependent on a range >> check or predicate are not allowed to float. This is however not >> sufficient as demonstrated by the test cases. >> >> In `TestArrayAccessAboveRCAfterSmearingOrPredication.testRangeCheckSmearing()`: >> >> >> v += array[i]; >> if (flag2) { >> if (flag3) { >> field = 0x42; >> } >> } >> if (flagField == 1) { >> v += array[i]; >> } >> >> >> The range check for the second `array[i]` load is replaced by the >> dominating range check for the first `array[i]` but because the second >> `array[i]` load could really be dependent on multiple range checks (in >> case smearing happened which is not the case here), c2 doesn't allow >> the second `array[i]` to float when the second range check is >> removed. The second `array[i]` is then control dependent on: >> >> >> if (flagField == 1) { >> >> >> which is next found to be dominated by the same test: >> >> >> if (flag == 1) { >> >> >> and is removed. However nothing in `dominated_by()` treats node >> dependent on tests that are not range check or predicates >> specially. So the second `array[i]` is allowed to float and become >> dependent on: >> >> >> if (flag == 1) { >> >> >> which is above the range check for that access. The test method in its >> last invocation is passed an index for the array access that's widely >> out of range. The array load happens before the range check and >> crashes the VM. `testLoopPredication()` is a similar test where array >> loads become dependent on predicates and end up above range checks. >> >> `TestArrayAccessCastIIAboveRC.java` is the test case from the bug >> where for similar reasons a range check `CastII` ends up above its >> range check, becomes top because its input becomes some integer that >> conflicts with its... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/16886#pullrequestreview-1784218217 From chagedorn at openjdk.org Fri Dec 15 14:38:46 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 15 Dec 2023 14:38:46 GMT Subject: RFR: 8319793: C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888 [v7] In-Reply-To: References: <4O9MKwWQwTlG_t8S4rFdTu2QyvR2TtA953RcJ2YG6zM=.22262ff9-7b80-41e0-8986-eb08e4ab2630@github.com> Message-ID: On Fri, 15 Dec 2023 14:29:05 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/loopopts.cpp line 360: >> >>> 358: // dependent nodes end up at the lowest/nearest dominating check in the graph. To ensure that these Loads/Casts >>> 359: // do not float above any of the dominating checks (even when the lowest dominating check is later replaced by >>> 360: // yet another dominating check), we need to pin them at the lowest dominating check. >> >> Should we also add this updated comment to `ifnode.cpp:569` and `ifnode.cpp:1536`? > > Yes, right. Thanks! I updated the change. Great, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16886#discussion_r1428044084 From mli at openjdk.org Fri Dec 15 14:48:52 2023 From: mli at openjdk.org (Hamlin Li) Date: Fri, 15 Dec 2023 14:48:52 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic Message-ID: Hi, Can you review this minor patch to improve MD5 instrinsic? Thanks! ## Test tests (`find test/ -iname "*md5*.java"`) passed. ------------- Commit messages: - Initial commit Changes: https://git.openjdk.org/jdk/pull/17123/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17123&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322195 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17123.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17123/head:pull/17123 PR: https://git.openjdk.org/jdk/pull/17123 From luhenry at openjdk.org Fri Dec 15 15:55:39 2023 From: luhenry at openjdk.org (Ludovic Henry) Date: Fri, 15 Dec 2023 15:55:39 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 14:40:59 GMT, Hamlin Li wrote: > Hi, > Can you review this minor patch to improve MD5 instrinsic? > Thanks! > > ## Test > > tests (`find test/ -iname "*md5*.java"`) passed. I am assuming we cannot rework the store side with the same idea because of sign-extension? ------------- PR Review: https://git.openjdk.org/jdk/pull/17123#pullrequestreview-1784428466 From mli at openjdk.org Fri Dec 15 16:30:40 2023 From: mli at openjdk.org (Hamlin Li) Date: Fri, 15 Dec 2023 16:30:40 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 15:52:40 GMT, Ludovic Henry wrote: > I am assuming we cannot rework the store side with the same idea because of sign-extension? Correct, for the store side, some code like `__ andr(state0, state0, t0);` is necessary because the upper 32 bits of state0 is not guaranteed to be zero, which is because of sign-extension introduced via `__ addw(state0, state0, a);` at the end of every block calculation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17123#issuecomment-1858146162 From mli at openjdk.org Fri Dec 15 16:37:48 2023 From: mli at openjdk.org (Hamlin Li) Date: Fri, 15 Dec 2023 16:37:48 GMT Subject: RFR: 8322209: RISC-V: Enable some tests related to MD5 instrinsic Message-ID: Hi, Can you review this simple patch to enable some tests for MD5 instrinsic? Thanks! ------------- Commit messages: - Initial commit Changes: https://git.openjdk.org/jdk/pull/17126/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17126&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322209 Stats: 5 lines in 2 files changed: 4 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17126.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17126/head:pull/17126 PR: https://git.openjdk.org/jdk/pull/17126 From never at openjdk.org Fri Dec 15 17:27:56 2023 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 15 Dec 2023 17:27:56 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes [v2] In-Reply-To: References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: On Thu, 7 Dec 2023 02:56:57 GMT, Doug Simon wrote: >> Tom Rodriguez has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge remote-tracking branch 'origin/master' into tkr-clean-weak >> - Comment and types improvements >> - 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotJVMCIRuntime.java line 506: > >> 504: static class KlassWeakReference extends WeakReference { >> 505: >> 506: private final Long klassPointer; > > I assume this is `Long` instead of `long` to avoid boxing in `expungeStaleEntries`? There's autoboxing in a more few places and I wanted to avoid having multiple copies of the box live so I promoted it to Long on entry to fromMetaspace. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16981#discussion_r1428239226 From never at openjdk.org Fri Dec 15 17:27:54 2023 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 15 Dec 2023 17:27:54 GMT Subject: RFR: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes [v2] In-Reply-To: References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: On Mon, 11 Dec 2023 22:01:58 GMT, Tom Rodriguez wrote: >> HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. >> >> This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. > > Tom Rodriguez has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge remote-tracking branch 'origin/master' into tkr-clean-weak > - Comment and types improvements > - 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16981#issuecomment-1858229215 From never at openjdk.org Fri Dec 15 17:27:58 2023 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 15 Dec 2023 17:27:58 GMT Subject: Integrated: 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes In-Reply-To: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> References: <-_CpkWzu-kr4DjL_t7tespZcMBCFz3QQmr4-KsHqjMg=.aa6d1624-610e-41a4-aad3-a714f2c168a3@github.com> Message-ID: On Tue, 5 Dec 2023 19:00:51 GMT, Tom Rodriguez wrote: > HotSpotJVMCIRuntime.resolvedJavaTypes implements a weak value map but is lacking code to clean out cleared weak references. In normal mixed execution this isn't likely to get big and generally isolates are shutdown frequently so this doesn't lead to problems. In Xcomp mode with tests that stress unloading this becomes more problematic. In the worst case is still doesn't lead to large heaps but does make the idle heap larger than required. > > This PR adds ReferenceQueue based cleaning of reclaimed values. Testing in the context of a long running isolate shows that they are no longer accumulating. This pull request has now been integrated. Changeset: 05f7f0ad Author: Tom Rodriguez URL: https://git.openjdk.org/jdk/commit/05f7f0ade2c6c8ef57e884048cf159c46fa27b36 Stats: 48 lines in 1 file changed: 43 ins; 0 del; 5 mod 8321288: [JVMCI] HotSpotJVMCIRuntime doesn't clean up WeakReferences in resolvedJavaTypes Reviewed-by: dnsimon, kvn ------------- PR: https://git.openjdk.org/jdk/pull/16981 From volker.simonis at gmail.com Fri Dec 15 17:44:05 2023 From: volker.simonis at gmail.com (Volker Simonis) Date: Fri, 15 Dec 2023 18:44:05 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: On Fri, Dec 15, 2023 at 1:11?PM Antoine DESSAIGNE wrote: > > Thank you Volker and Andrew for your replies, > > Unfortunately, there's no hs_error generated, the code throws a > NullPointerException when it shouldn't. Sorry, overlooked the "without" before "crashing due to assertions" :) > > I can even reproduce it with JDK 23, I took the master of the jdk > repository this morning (commit b31454e3623 to be precise) and it > still fails. > > I cannot exclude this method from the JIT as it's used a lot in our application. > > I checked many many times the commit to be sure that I have the right > one and I do. It fails almost every time with 10737e168c9 [1] but it > never fails with its parent commit, even after 20 tests. > The change looks innocent, but CCing hotstpo-runtime-dev and Coleen (who's the author of that change [1]). Maybe she has an idea? Is your code doing a lot of dynamic class loading and/or bytecode instrumentation/rewriting? > [1] https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab > > Le ven. 15 d?c. 2023 ? 12:37, Andrew Haley > a ?crit : > > > > On 12/15/23 10:20, Antoine DESSAIGNE wrote: > > > Do you have an idea of why this is happening? Do you know what test I > > > can run? > > > > First, try to reproduce it with JDK 22 preview. > > > > If you can't provide a reproducer, it's likely that no one will be > > able to fix it now, and you'll have to wait until it gets fixed. > > > > Try: '-XX:CompileCommand=exclude,foo.bar.baz.Classname::badMethodName' > > > > -- > > Andrew Haley (he/him) > > Java Platform Lead Engineer > > Red Hat UK Ltd. > > https://keybase.io/andrewhaley > > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > > From luhenry at openjdk.org Fri Dec 15 18:02:38 2023 From: luhenry at openjdk.org (Ludovic Henry) Date: Fri, 15 Dec 2023 18:02:38 GMT Subject: RFR: 8322209: RISC-V: Enable some tests related to MD5 instrinsic In-Reply-To: References: Message-ID: <-ZaW-lOHY2ULRXXLl5BghaoiBKwEFDjJPBUQSxty1yQ=.045bf7a6-bf54-4093-a194-f77646ec95d2@github.com> On Fri, 15 Dec 2023 16:32:28 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch to enable some tests for MD5 instrinsic? > Thanks! Marked as reviewed by luhenry (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17126#pullrequestreview-1784664855 From antoine.dessaigne at gmail.com Fri Dec 15 20:08:37 2023 From: antoine.dessaigne at gmail.com (Antoine DESSAIGNE) Date: Fri, 15 Dec 2023 21:08:37 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: Hi Volker, > Is your code doing a lot of dynamic class loading and/or bytecode > instrumentation/rewriting? No bytecode instrumentation or rewriting. We do use OSGi and proxies *but* all loading is paused while we process a file. Processing this file makes thousands of calls to this method which gets compiled by C2, then it throws the exception. If I stop at level 3 with -XX:TieredStopAtLevel=3 then I don't have the exception. And I'm sure I have the right commit because all my tests with its parent commit are successful. Now I still don't understand how it can have this side effect. Le ven. 15 d?c. 2023 ? 18:44, Volker Simonis a ?crit : > > On Fri, Dec 15, 2023 at 1:11?PM Antoine DESSAIGNE > wrote: > > > > Thank you Volker and Andrew for your replies, > > > > Unfortunately, there's no hs_error generated, the code throws a > > NullPointerException when it shouldn't. > > Sorry, overlooked the "without" before "crashing due to assertions" :) > > > > > I can even reproduce it with JDK 23, I took the master of the jdk > > repository this morning (commit b31454e3623 to be precise) and it > > still fails. > > > > I cannot exclude this method from the JIT as it's used a lot in our application. > > > > I checked many many times the commit to be sure that I have the right > > one and I do. It fails almost every time with 10737e168c9 [1] but it > > never fails with its parent commit, even after 20 tests. > > > > The change looks innocent, but CCing hotstpo-runtime-dev and Coleen > (who's the author of that change [1]). Maybe she has an idea? > > Is your code doing a lot of dynamic class loading and/or bytecode > instrumentation/rewriting? > > > [1] https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab > > > > Le ven. 15 d?c. 2023 ? 12:37, Andrew Haley > > a ?crit : > > > > > > On 12/15/23 10:20, Antoine DESSAIGNE wrote: > > > > Do you have an idea of why this is happening? Do you know what test I > > > > can run? > > > > > > First, try to reproduce it with JDK 22 preview. > > > > > > If you can't provide a reproducer, it's likely that no one will be > > > able to fix it now, and you'll have to wait until it gets fixed. > > > > > > Try: '-XX:CompileCommand=exclude,foo.bar.baz.Classname::badMethodName' > > > > > > -- > > > Andrew Haley (he/him) > > > Java Platform Lead Engineer > > > Red Hat UK Ltd. > > > https://keybase.io/andrewhaley > > > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > > > From duke at openjdk.org Fri Dec 15 20:46:57 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Fri, 15 Dec 2023 20:46:57 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) [v3] In-Reply-To: References: Message-ID: > Hello, > > (~a) | (~b) => ~(a & b) is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1617C28-L1617C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. > > Thanks. Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: use common helpful functions. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16334/files - new: https://git.openjdk.org/jdk/pull/16334/files/341f869c..6d291ae1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16334&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16334&range=01-02 Stats: 28 lines in 1 file changed: 18 ins; 2 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/16334.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16334/head:pull/16334 PR: https://git.openjdk.org/jdk/pull/16334 From duke at openjdk.org Fri Dec 15 20:46:58 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Fri, 15 Dec 2023 20:46:58 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) [v2] In-Reply-To: References: Message-ID: On Thu, 14 Dec 2023 18:06:40 GMT, Quan Anh Mai wrote: > LGTM, but may you consider having dedicated functions to check and create `Not` patterns? Thanks a lot. Thanks. I updated. Please let me know if there are more comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16334#issuecomment-1858463476 From duke at openjdk.org Fri Dec 15 23:35:56 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Fri, 15 Dec 2023 23:35:56 GMT Subject: RFR: 8322077: Add Ideal transformation: (~a) | (~b) => ~(a & b) [v4] In-Reply-To: References: Message-ID: <5KKTYmY7dJ4nW0OEQ2UPuloIWOYc-pI9M8HRjoaRzw4=.f5eda6f9-c38c-4a2b-9690-cbb1791a2622@github.com> > Hello, > > (~a) | (~b) => ~(a & b) is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1617C28-L1617C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. > > Thanks. Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: untabify. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16334/files - new: https://git.openjdk.org/jdk/pull/16334/files/6d291ae1..8697e399 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16334&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16334&range=02-03 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/16334.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16334/head:pull/16334 PR: https://git.openjdk.org/jdk/pull/16334 From aph-open at littlepinkcloud.com Sat Dec 16 13:26:18 2023 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Sat, 16 Dec 2023 13:26:18 +0000 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: <3f42eb44-67bc-4080-bf45-986a2f6a704c@littlepinkcloud.com> On 12/15/23 12:10, Antoine DESSAIGNE wrote: > I cannot exclude this method from the JIT as it's used a lot in our application. Try "dontinline" instead of "exclude". -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From jbhateja at openjdk.org Sun Dec 17 17:55:11 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sun, 17 Dec 2023 17:55:11 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v8] In-Reply-To: References: Message-ID: <1-fZiP70dJ9idc-DmdVVQggxLvDuP2zeb3wt9l-R1dA=.265f4701-8feb-400b-a3e2-272b6d96fe7a@github.com> > Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact > JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.3-5x gains with included micro. > > > ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) > > > 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. > > 3) Some minor adjustments in existing gather instruction pattens for double/quad words. > > > Kindly review and share your feedback. > > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Refined AVX3 implementation with integral gather. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 - Fix incorrect comment - Review comments resolutions. - Review comments resolutions. - Review comments resolutions. - Restricting masked sub-word gather to AVX512 target to align with integral gather support. - Review comments resolution. - 8318650: Optimized subword gather for x86 targets. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16354/files - new: https://git.openjdk.org/jdk/pull/16354/files/328b2217..a6f0f8cf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=06-07 Stats: 842039 lines in 5288 files changed: 204894 ins; 551932 del; 85213 mod Patch: https://git.openjdk.org/jdk/pull/16354.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16354/head:pull/16354 PR: https://git.openjdk.org/jdk/pull/16354 From jbhateja at openjdk.org Sun Dec 17 17:55:12 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sun, 17 Dec 2023 17:55:12 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v7] In-Reply-To: <6trfINWUmYQ7emAfsQAeLPzXSWZGRubI7v8s-wWcGn4=.91ef2dc5-dc4b-4fdd-aa0d-1e7f809cd060@github.com> References: <6trfINWUmYQ7emAfsQAeLPzXSWZGRubI7v8s-wWcGn4=.91ef2dc5-dc4b-4fdd-aa0d-1e7f809cd060@github.com> Message-ID: On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact >> JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.3-5x gains with included micro. >> >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) >> >> >> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. >> >> 3) Some minor adjustments in existing gather instruction pattens for double/quad words. >> >> >> Kindly review and share your feedback. >> >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Fix incorrect comment Refined implementation using integral gather operation for AVX512 targets. As per Intel Optimization manual section 4.8.1.6 gather are micro coded atom with 50+ cycles latency, existing hybrid algorithm is performant for Intel Atom family CPUs and with runtime flag UseAVX=2. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16354#issuecomment-1859235516 From jbhateja at openjdk.org Sun Dec 17 18:08:45 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sun, 17 Dec 2023 18:08:45 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v8] In-Reply-To: <1-fZiP70dJ9idc-DmdVVQggxLvDuP2zeb3wt9l-R1dA=.265f4701-8feb-400b-a3e2-272b6d96fe7a@github.com> References: <1-fZiP70dJ9idc-DmdVVQggxLvDuP2zeb3wt9l-R1dA=.265f4701-8feb-400b-a3e2-272b6d96fe7a@github.com> Message-ID: On Sun, 17 Dec 2023 17:55:11 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact >> JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.3-5x gains with included micro. >> >> >> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) >> >> >> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. >> >> 3) Some minor adjustments in existing gather instruction pattens for double/quad words. >> >> >> Kindly review and share your feedback. >> >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Refined AVX3 implementation with integral gather. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 > - Fix incorrect comment > - Review comments resolutions. > - Review comments resolutions. > - Review comments resolutions. > - Restricting masked sub-word gather to AVX512 target to align with integral gather support. > - Review comments resolution. > - 8318650: Optimized subword gather for x86 targets. Latest performance number for E-Core and P-Core targets. ![image](https://github.com/openjdk/jdk/assets/59989778/34390500-44dd-425b-84d7-a1f59827397e) ![image](https://github.com/openjdk/jdk/assets/59989778/08c4c119-95d7-4466-868c-a2103d652f4c) ![image](https://github.com/openjdk/jdk/assets/59989778/5220fbbc-b8bb-4ca1-8670-26c190e0168d) ![image](https://github.com/openjdk/jdk/assets/59989778/bfee7e5e-f07b-4da6-b98d-d6dca9668ec0) ------------- PR Comment: https://git.openjdk.org/jdk/pull/16354#issuecomment-1859238889 From david.holmes at oracle.com Mon Dec 18 00:02:15 2023 From: david.holmes at oracle.com (David Holmes) Date: Mon, 18 Dec 2023 10:02:15 +1000 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> On 16/12/2023 3:44 am, Volker Simonis wrote: > On Fri, Dec 15, 2023 at 1:11?PM Antoine DESSAIGNE > wrote: >> >> Thank you Volker and Andrew for your replies, >> >> Unfortunately, there's no hs_error generated, the code throws a >> NullPointerException when it shouldn't. > > Sorry, overlooked the "without" before "crashing due to assertions" :) > >> >> I can even reproduce it with JDK 23, I took the master of the jdk >> repository this morning (commit b31454e3623 to be precise) and it >> still fails. >> >> I cannot exclude this method from the JIT as it's used a lot in our application. >> >> I checked many many times the commit to be sure that I have the right >> one and I do. It fails almost every time with 10737e168c9 [1] but it >> never fails with its parent commit, even after 20 tests. >> > > The change looks innocent, but CCing hotstpo-runtime-dev and Coleen > (who's the author of that change [1]). Maybe she has an idea? Coleen is away for a while. I reviewed the change in [1] and I also cannot see how it could affect anything to do with the JIT. From the original mail on compiler-dev: > A valued local variable (effectively final) has its value removed What does this mean??? David ----- > Is your code doing a lot of dynamic class loading and/or bytecode > instrumentation/rewriting? > >> [1] https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab >> >> Le ven. 15 d?c. 2023 ? 12:37, Andrew Haley >> a ?crit : >>> >>> On 12/15/23 10:20, Antoine DESSAIGNE wrote: >>>> Do you have an idea of why this is happening? Do you know what test I >>>> can run? >>> >>> First, try to reproduce it with JDK 22 preview. >>> >>> If you can't provide a reproducer, it's likely that no one will be >>> able to fix it now, and you'll have to wait until it gets fixed. >>> >>> Try: '-XX:CompileCommand=exclude,foo.bar.baz.Classname::badMethodName' >>> >>> -- >>> Andrew Haley (he/him) >>> Java Platform Lead Engineer >>> Red Hat UK Ltd. >>> https://keybase.io/andrewhaley >>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 >>> From david.holmes at oracle.com Mon Dec 18 00:26:50 2023 From: david.holmes at oracle.com (David Holmes) Date: Mon, 18 Dec 2023 10:26:50 +1000 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> Message-ID: Also from the original: > I bisected the repository to find where the regression comes from. I > found this commit 3696711efa5 [1] but it's a merge so I bisected the > branch and found 10737e168c9 [2]. but commit [2] is not part of the merge commit [1]. David On 18/12/2023 10:02 am, David Holmes wrote: > On 16/12/2023 3:44 am, Volker Simonis wrote: >> On Fri, Dec 15, 2023 at 1:11?PM Antoine DESSAIGNE >> wrote: >>> >>> Thank you Volker and Andrew for your replies, >>> >>> Unfortunately, there's no hs_error generated, the code throws a >>> NullPointerException when it shouldn't. >> >> Sorry, overlooked the "without" before "crashing due to assertions" :) >> >>> >>> I can even reproduce it with JDK 23, I took the master of the jdk >>> repository this morning (commit b31454e3623 to be precise) and it >>> still fails. >>> >>> I cannot exclude this method from the JIT as it's used a lot in our >>> application. >>> >>> I checked many many times the commit to be sure that I have the right >>> one and I do. It fails almost every time with 10737e168c9 [1] but it >>> never fails with its parent commit, even after 20 tests. >>> >> >> The change looks innocent, but CCing hotstpo-runtime-dev and Coleen >> (who's the author of that change [1]). Maybe she has an idea? > > Coleen is away for a while. I reviewed the change in [1] and I also > cannot see how it could affect anything to do with the JIT. > > From the original mail on compiler-dev: > > >? A valued local variable (effectively final) has its value removed > > What does this mean??? > > David > ----- > >> Is your code doing a lot of dynamic class loading and/or bytecode >> instrumentation/rewriting? >> >>> [1] >>> https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab >>> >>> Le ven. 15 d?c. 2023 ? 12:37, Andrew Haley >>> a ?crit : >>>> >>>> On 12/15/23 10:20, Antoine DESSAIGNE wrote: >>>>> Do you have an idea of why this is happening? Do you know what test I >>>>> can run? >>>> >>>> First, try to reproduce it with JDK 22 preview. >>>> >>>> If you can't provide a reproducer, it's likely that no one will be >>>> able to fix it now, and you'll have to wait until it gets fixed. >>>> >>>> Try: '-XX:CompileCommand=exclude,foo.bar.baz.Classname::badMethodName' >>>> >>>> -- >>>> Andrew Haley? (he/him) >>>> Java Platform Lead Engineer >>>> Red Hat UK Ltd. >>>> https://keybase.io/andrewhaley >>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 >>>> From fjiang at openjdk.org Mon Dec 18 01:01:42 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 18 Dec 2023 01:01:42 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: <-ESASip8OA080tvie3uCIlfszW2DvtPxGT_jri-6m4U=.e2800f29-650c-4e17-9c89-fcaae0627f8e@github.com> On Fri, 15 Dec 2023 11:49:50 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: > > - Use zero_extend instead of shifts where possible > - Use andn instead of notr + andr where possible > - Replace shNadd with one instruction in most cases src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp line 1643: > 1641: __ zero_extend(crc, crc, 32); > 1642: __ update_byte_crc32(crc, val, res); > 1643: __ notr(res, crc); // ~crc Do you miss the `zero_extend(crc, crc, 32)`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1429344807 From jbhateja at openjdk.org Mon Dec 18 06:01:12 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 18 Dec 2023 06:01:12 GMT Subject: RFR: 8318650: Optimized subword gather for x86 targets. [v9] In-Reply-To: References: Message-ID: > Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro on Intel Atom family CPUs and with JVM option UseAVX=2. > > ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d) > > > 2) For AVX512 targets algorithm uses integral gather instructions to load values from normalized indices which are multiple of integer size, followed by shuffling and packing exact sub-word values from integral lanes. > > 3) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Removing JDK-8321648 related changes. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16354/files - new: https://git.openjdk.org/jdk/pull/16354/files/a6f0f8cf..4af776e8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=07-08 Stats: 16 lines in 1 file changed: 14 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/16354.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16354/head:pull/16354 PR: https://git.openjdk.org/jdk/pull/16354 From fyang at openjdk.org Mon Dec 18 07:46:41 2023 From: fyang at openjdk.org (Fei Yang) Date: Mon, 18 Dec 2023 07:46:41 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 14:40:59 GMT, Hamlin Li wrote: > Hi, > Can you review this minor patch to improve MD5 instrinsic? > Thanks! > > ## Test > > tests (`find test/ -iname "*md5*.java"`) passed. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4165: > 4163: __ ld(state2, Address(state, 8)); > 4164: __ srli(state3, state2, 32); > 4165: __ andr(state2, state2, t0); Better to add some code comment for this tweak. Or turn the four succeeding `mv` [1] into `addw` with `zr` to make it explicit that the lower half of `state` is needed? [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp#L4170 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17123#discussion_r1429607831 From fyang at openjdk.org Mon Dec 18 07:57:38 2023 From: fyang at openjdk.org (Fei Yang) Date: Mon, 18 Dec 2023 07:57:38 GMT Subject: RFR: 8322209: RISC-V: Enable some tests related to MD5 instrinsic In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 16:32:28 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch to enable some tests for MD5 instrinsic? > Thanks! Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17126#pullrequestreview-1786041440 From antoine.dessaigne at gmail.com Mon Dec 18 08:26:50 2023 From: antoine.dessaigne at gmail.com (Antoine DESSAIGNE) Date: Mon, 18 Dec 2023 09:26:50 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> Message-ID: Hi David, Here's a simplified version of the code private void resolveArrhythmicUpdate() { // extracting a value non-null value ReadableInterval storedValueVT = storedValue.getValidTime(); if (conditionA && !conditionB) { // then call storedValueVT.getStartMillis() and it's working } // some code not altering storedValueVT // same if as previouly but with one more condition if (conditionA && !conditionB && !conditionC) { // then call storedValueVT.getStartMillis() and it throws an NPE } } In my code storedValueVT always has a value, it's tested in the first if, but in the second if somehow it throws an NPE. The variable is not set to null and it's not reassigned, it only happens with C2 compiled code in the second "if" statement, the first one always works. I cannot publicly share the code on the internet but if you mail me directly we can arrange a video call and I can show you the code. As for the branch, here's the bisect history I did, started at jdk-19-ga, so skipping some history (but I have it complete if you need) good: [0ba473489151d74c8a15b75ff4964ac480fecb28] bad: [0eeaeb8e7ba40be5e93eb87c7e3dc94230062746] bad: [3696711efa566fb776d6923da86e17b0e1e22964] good: [f771c56e16a39724712ca0d8c2dd55b9ce260f4d] first bad commit: [3696711efa566fb776d6923da86e17b0e1e22964] The commit marked as "first bad commit" is a merge that has 2 parents f4caaca100d334b671eed56287dfe7a1009c47d7 and f771c56e16a39724712ca0d8c2dd55b9ce260f4d. We know from the first bisect that the commit f771 is valid so I redid a bisect on the other side, ending with f4ca... bad: [3696711efa566fb776d6923da86e17b0e1e22964] good: [10bc86cc260fac48bf10f67dd56aa73c6954f026] bad: [e41686b4050d6b32fb451de8af39a78ec8bed0fd] good: [0ef353925e645dd519e17aeb7a83e927271f8b95] bad: [3cdbd878e68dc1131093137a7357710ad303ae8c] good: [4b313b51b1787113961c289a41708e31fa19cacc] bad: [10737e168c967a08e257927251861bf2c14795ab] first bad commit: [10737e168c967a08e257927251861bf2c14795ab] As I thought many times that I've messed up finding the commit, I tested the parent commit the commit 10737e168c967a08e257927251861bf2c14795ab and its parent many many times and it's always the same: 10737e1 is failing whereas its parent is always working. Thank you for your help with this bug. Le lun. 18 d?c. 2023 ? 01:36, David Holmes a ?crit : > > Also from the original: > > > I bisected the repository to find where the regression comes from. I > > found this commit 3696711efa5 [1] but it's a merge so I bisected the > > branch and found 10737e168c9 [2]. > > but commit [2] is not part of the merge commit [1]. > > David > > On 18/12/2023 10:02 am, David Holmes wrote: > > On 16/12/2023 3:44 am, Volker Simonis wrote: > >> On Fri, Dec 15, 2023 at 1:11?PM Antoine DESSAIGNE > >> wrote: > >>> > >>> Thank you Volker and Andrew for your replies, > >>> > >>> Unfortunately, there's no hs_error generated, the code throws a > >>> NullPointerException when it shouldn't. > >> > >> Sorry, overlooked the "without" before "crashing due to assertions" :) > >> > >>> > >>> I can even reproduce it with JDK 23, I took the master of the jdk > >>> repository this morning (commit b31454e3623 to be precise) and it > >>> still fails. > >>> > >>> I cannot exclude this method from the JIT as it's used a lot in our > >>> application. > >>> > >>> I checked many many times the commit to be sure that I have the right > >>> one and I do. It fails almost every time with 10737e168c9 [1] but it > >>> never fails with its parent commit, even after 20 tests. > >>> > >> > >> The change looks innocent, but CCing hotstpo-runtime-dev and Coleen > >> (who's the author of that change [1]). Maybe she has an idea? > > > > Coleen is away for a while. I reviewed the change in [1] and I also > > cannot see how it could affect anything to do with the JIT. > > > > From the original mail on compiler-dev: > > > > > A valued local variable (effectively final) has its value removed > > > > What does this mean??? > > > > David > > ----- > > > >> Is your code doing a lot of dynamic class loading and/or bytecode > >> instrumentation/rewriting? > >> > >>> [1] > >>> https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab > >>> > >>> Le ven. 15 d?c. 2023 ? 12:37, Andrew Haley > >>> a ?crit : > >>>> > >>>> On 12/15/23 10:20, Antoine DESSAIGNE wrote: > >>>>> Do you have an idea of why this is happening? Do you know what test I > >>>>> can run? > >>>> > >>>> First, try to reproduce it with JDK 22 preview. > >>>> > >>>> If you can't provide a reproducer, it's likely that no one will be > >>>> able to fix it now, and you'll have to wait until it gets fixed. > >>>> > >>>> Try: '-XX:CompileCommand=exclude,foo.bar.baz.Classname::badMethodName' > >>>> > >>>> -- > >>>> Andrew Haley (he/him) > >>>> Java Platform Lead Engineer > >>>> Red Hat UK Ltd. > >>>> https://keybase.io/andrewhaley > >>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > >>>> From aph-open at littlepinkcloud.com Mon Dec 18 09:42:10 2023 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Mon, 18 Dec 2023 09:42:10 +0000 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> Message-ID: On 12/18/23 08:26, Antoine DESSAIGNE wrote: > In my code storedValueVT always has a value, it's tested in the first > if, but in the second if somehow it throws an NPE. The variable is not > set to null and it's not reassigned, it only happens with C2 compiled > code in the second "if" statement, the first one always works. I wonder if there's a data race here. Is it possible that the storedValue is changing while this executes? What would happen if you changed the code to do this? private void resolveArrhythmicUpdate() { // extracting a value non-null value if (conditionA && !conditionB) { // then call storedValue.getValidTime();.getStartMillis() and it's working } // some code not altering storedValueVT // same if as previouly but with one more condition if (conditionA && !conditionB && !conditionC) { // then call storedValue.getValidTime();.getStartMillis() and it throws an NPE } } -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From antoine.dessaigne at gmail.com Mon Dec 18 10:08:13 2023 From: antoine.dessaigne at gmail.com (Antoine DESSAIGNE) Date: Mon, 18 Dec 2023 11:08:13 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> Message-ID: > I wonder if there's a data race here. Is it possible that the storedValue > is changing while this executes? No, storedValue doesn't change. And here is how the method getValidTime looks like public ReadableInterval getValidTime() { if (this.validTime == null) { this.validTime = new Interval(this.startMillis, this.endMillis); // both fields are longs } return this.validTime; } Once the reference to Interval is copied into my local variable storedValueVT, no change on the other class can change it. From mli at openjdk.org Mon Dec 18 10:34:48 2023 From: mli at openjdk.org (Hamlin Li) Date: Mon, 18 Dec 2023 10:34:48 GMT Subject: RFR: 8322209: RISC-V: Enable some tests related to MD5 instrinsic In-Reply-To: <-ZaW-lOHY2ULRXXLl5BghaoiBKwEFDjJPBUQSxty1yQ=.045bf7a6-bf54-4093-a194-f77646ec95d2@github.com> References: <-ZaW-lOHY2ULRXXLl5BghaoiBKwEFDjJPBUQSxty1yQ=.045bf7a6-bf54-4093-a194-f77646ec95d2@github.com> Message-ID: On Fri, 15 Dec 2023 18:00:04 GMT, Ludovic Henry wrote: >> Hi, >> Can you review this simple patch to enable some tests for MD5 instrinsic? >> Thanks! > > Marked as reviewed by luhenry (Committer). @luhenry @RealFYang Thanks for your reviewing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17126#issuecomment-1860054057 From mli at openjdk.org Mon Dec 18 10:34:50 2023 From: mli at openjdk.org (Hamlin Li) Date: Mon, 18 Dec 2023 10:34:50 GMT Subject: Integrated: 8322209: RISC-V: Enable some tests related to MD5 instrinsic In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 16:32:28 GMT, Hamlin Li wrote: > Hi, > Can you review this simple patch to enable some tests for MD5 instrinsic? > Thanks! This pull request has now been integrated. Changeset: a247d0c7 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/a247d0c74bea50f11d24fb5f3576947c6901e567 Stats: 5 lines in 2 files changed: 4 ins; 0 del; 1 mod 8322209: RISC-V: Enable some tests related to MD5 instrinsic Reviewed-by: luhenry, fyang ------------- PR: https://git.openjdk.org/jdk/pull/17126 From mli at openjdk.org Mon Dec 18 11:03:03 2023 From: mli at openjdk.org (Hamlin Li) Date: Mon, 18 Dec 2023 11:03:03 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic [v2] In-Reply-To: References: Message-ID: > Hi, > Can you review this minor patch to improve MD5 instrinsic? > Thanks! > > ## Test > > tests (`find test/ -iname "*md5*.java"`) passed. Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: - add space - Add some comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17123/files - new: https://git.openjdk.org/jdk/pull/17123/files/2b7855a8..606b6248 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17123&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17123&range=00-01 Stats: 8 lines in 1 file changed: 7 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17123.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17123/head:pull/17123 PR: https://git.openjdk.org/jdk/pull/17123 From mli at openjdk.org Mon Dec 18 11:03:06 2023 From: mli at openjdk.org (Hamlin Li) Date: Mon, 18 Dec 2023 11:03:06 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic [v2] In-Reply-To: References: Message-ID: On Mon, 18 Dec 2023 07:43:49 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: >> >> - add space >> - Add some comment > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4165: > >> 4163: __ ld(state2, Address(state, 8)); >> 4164: __ srli(state3, state2, 32); >> 4165: __ andr(state2, state2, t0); > > Better to add some code comment for this tweak. Or turn the four succeeding `mv` [1] into `addw` with `zr` to make it explicit that only the lower half of `state` is needed? > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp#L4170 Agree, added some comments. I don't change `mv` to `addw`, as anyway it will be sign-extened to higher 32 bits, so maybe comment make it more clear. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17123#discussion_r1429928918 From aph-open at littlepinkcloud.com Mon Dec 18 11:04:42 2023 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Mon, 18 Dec 2023 11:04:42 +0000 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> Message-ID: <0b2a2d31-74cc-49a1-9140-b84ea6f9ac0a@littlepinkcloud.com> On 12/18/23 10:08, Antoine DESSAIGNE wrote: > > Once the reference to Interval is copied into my local variable > storedValueVT, no change on the other class can change it. I'm not sure about that. C2 contains logic to rematerialize a value rather than spilling, if it's cheaper. That could be happening here. At least, I know of no place in the JMM that forbids it. As an experiment, you could try making validTime volatile. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From luhenry at openjdk.org Mon Dec 18 11:29:41 2023 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 18 Dec 2023 11:29:41 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic [v2] In-Reply-To: References: Message-ID: On Mon, 18 Dec 2023 11:03:03 GMT, Hamlin Li wrote: >> Hi, >> Can you review this minor patch to improve MD5 instrinsic? >> Thanks! >> >> ## Test >> >> tests (`find test/ -iname "*md5*.java"`) passed. > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: > > - add space > - Add some comment Marked as reviewed by luhenry (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17123#pullrequestreview-1786673650 From antoine.dessaigne at gmail.com Mon Dec 18 11:41:24 2023 From: antoine.dessaigne at gmail.com (Antoine DESSAIGNE) Date: Mon, 18 Dec 2023 12:41:24 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: <0b2a2d31-74cc-49a1-9140-b84ea6f9ac0a@littlepinkcloud.com> References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> <0b2a2d31-74cc-49a1-9140-b84ea6f9ac0a@littlepinkcloud.com> Message-ID: Hello everyone, Good news, I've found the typo in https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab, it's in loaderConstraints.cpp. Previously it was } else if (pp1 == NULL) { pp2->extend_loader_constraint(class_name, class_loader1, klass); } else if (pp2 == NULL) { pp1->extend_loader_constraint(class_name, class_loader2, klass); Now it is (and still is on master but line numbers have changed) } else if (pp1 == NULL) { pp2->extend_loader_constraint(class_name, loader1, klass); } else if (pp2 == NULL) { pp1->extend_loader_constraint(class_name, loader1, klass); The last line should be using loader2 pp1->extend_loader_constraint(class_name, loader2, klass); If I do the fix locally then I no longer have the issue. Now, I don't know what to do next to have it fixed. Can someone either do it or tell me how to do it? Thank you. Le lun. 18 d?c. 2023 ? 12:04, Andrew Haley a ?crit : > > On 12/18/23 10:08, Antoine DESSAIGNE wrote: > > > > > Once the reference to Interval is copied into my local variable > > storedValueVT, no change on the other class can change it. > > I'm not sure about that. C2 contains logic to rematerialize a value > rather than spilling, if it's cheaper. That could be happening here. > At least, I know of no place in the JMM that forbids it. > > As an experiment, you could try making validTime volatile. > > -- > Andrew Haley (he/him) > Java Platform Lead Engineer > Red Hat UK Ltd. > https://keybase.io/andrewhaley > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > From antoine.dessaigne at gmail.com Mon Dec 18 12:07:15 2023 From: antoine.dessaigne at gmail.com (Antoine DESSAIGNE) Date: Mon, 18 Dec 2023 13:07:15 +0100 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> <0b2a2d31-74cc-49a1-9140-b84ea6f9ac0a@littlepinkcloud.com> Message-ID: Quick follow-up, Aleksey Shipilev created the bug [1] and the pull request [2] as I have no idea how long it will take for me to have the signed OCA :) Thank you all Antoine [1] https://bugs.openjdk.org/browse/JDK-8322282 [2] https://github.com/openjdk/jdk/pull/17140 Le lun. 18 d?c. 2023 ? 12:41, Antoine DESSAIGNE a ?crit : > > Hello everyone, > > Good news, I've found the typo in > https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab, > it's in loaderConstraints.cpp. > > Previously it was > } else if (pp1 == NULL) { > pp2->extend_loader_constraint(class_name, class_loader1, klass); > } else if (pp2 == NULL) { > pp1->extend_loader_constraint(class_name, class_loader2, klass); > > Now it is (and still is on master but line numbers have changed) > } else if (pp1 == NULL) { > pp2->extend_loader_constraint(class_name, loader1, klass); > } else if (pp2 == NULL) { > pp1->extend_loader_constraint(class_name, loader1, klass); > > The last line should be using loader2 > pp1->extend_loader_constraint(class_name, loader2, klass); > > If I do the fix locally then I no longer have the issue. > > Now, I don't know what to do next to have it fixed. Can someone either > do it or tell me how to do it? Thank you. > > Le lun. 18 d?c. 2023 ? 12:04, Andrew Haley > a ?crit : > > > > On 12/18/23 10:08, Antoine DESSAIGNE wrote: > > > > > > > > Once the reference to Interval is copied into my local variable > > > storedValueVT, no change on the other class can change it. > > > > I'm not sure about that. C2 contains logic to rematerialize a value > > rather than spilling, if it's cheaper. That could be happening here. > > At least, I know of no place in the JMM that forbids it. > > > > As an experiment, you could try making validTime volatile. > > > > -- > > Andrew Haley (he/him) > > Java Platform Lead Engineer > > Red Hat UK Ltd. > > https://keybase.io/andrewhaley > > EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 > > From david.holmes at oracle.com Mon Dec 18 12:21:58 2023 From: david.holmes at oracle.com (David Holmes) Date: Mon, 18 Dec 2023 22:21:58 +1000 Subject: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: <5757b046-33d5-45d1-8912-b7b5978b7670@oracle.com> <0b2a2d31-74cc-49a1-9140-b84ea6f9ac0a@littlepinkcloud.com> Message-ID: <9261ffb0-1b0c-4fc5-bff7-dc87cb247282@oracle.com> I will need to follow up on this tomorrow. Even with the mistake I cannot see how it can possibly lead to the NPE. Thanks, David On 18/12/2023 10:07 pm, Antoine DESSAIGNE wrote: > Quick follow-up, Aleksey Shipilev created the bug [1] and the pull > request [2] as I have no idea how long it will take for me to have the > signed OCA :) > > Thank you all > > Antoine > > [1] https://bugs.openjdk.org/browse/JDK-8322282 > [2] https://github.com/openjdk/jdk/pull/17140 > > Le lun. 18 d?c. 2023 ? 12:41, Antoine DESSAIGNE > a ?crit : >> >> Hello everyone, >> >> Good news, I've found the typo in >> https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab, >> it's in loaderConstraints.cpp. >> >> Previously it was >> } else if (pp1 == NULL) { >> pp2->extend_loader_constraint(class_name, class_loader1, klass); >> } else if (pp2 == NULL) { >> pp1->extend_loader_constraint(class_name, class_loader2, klass); >> >> Now it is (and still is on master but line numbers have changed) >> } else if (pp1 == NULL) { >> pp2->extend_loader_constraint(class_name, loader1, klass); >> } else if (pp2 == NULL) { >> pp1->extend_loader_constraint(class_name, loader1, klass); >> >> The last line should be using loader2 >> pp1->extend_loader_constraint(class_name, loader2, klass); >> >> If I do the fix locally then I no longer have the issue. >> >> Now, I don't know what to do next to have it fixed. Can someone either >> do it or tell me how to do it? Thank you. >> >> Le lun. 18 d?c. 2023 ? 12:04, Andrew Haley >> a ?crit : >>> >>> On 12/18/23 10:08, Antoine DESSAIGNE wrote: >>> >>> > >>> > Once the reference to Interval is copied into my local variable >>> > storedValueVT, no change on the other class can change it. >>> >>> I'm not sure about that. C2 contains logic to rematerialize a value >>> rather than spilling, if it's cheaper. That could be happening here. >>> At least, I know of no place in the JMM that forbids it. >>> >>> As an experiment, you could try making validTime volatile. >>> >>> -- >>> Andrew Haley (he/him) >>> Java Platform Lead Engineer >>> Red Hat UK Ltd. >>> https://keybase.io/andrewhaley >>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 >>> From kbarrett at openjdk.org Mon Dec 18 17:18:13 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 18 Dec 2023 17:18:13 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v35] In-Reply-To: References: Message-ID: <0vhJhWdel74-hzt5aQ7ykhOFvaVMBTNjH3daKZCjdM8=.97b6817b-f5f3-4b91-9e94-c6794a21b198@github.com> On Tue, 12 Dec 2023 16:11:05 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/divnode.hpp line 40: >> >>> 38: template >>> 39: void magic_divide_constants(T d, T N_neg, T N_pos, juint min_s, T& c, bool& c_ovf, juint& s); >>> 40: void magic_divide_constants_round_down(juint d, juint& c, juint& s); >> >> The definitions of these are in new file divconstants.cpp. So I think these should be in divconstants.hpp. > > I see there are multiple cases where a header is defined in multiple source files, and these are used exclusively for `DivNode`s so putting them here seems logical. If it were true these are only used in divnode.cpp then the declarations could be in that file. But these are also used in the gtest. That gtest duplicates the declaration. Better would be for it to include the suggested divconstants.hpp. Having examples of an unusual style doesn't mean we want more, at least not without some fairly good reason. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1430436221 From kbarrett at openjdk.org Mon Dec 18 17:18:10 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Mon, 18 Dec 2023 17:18:10 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v39] In-Reply-To: <4aXL_qh1epRWCwufaHiKXJ3wuPqG0xZSF6i-8r6OgcU=.97a5ff7e-5a19-47e5-b14d-af16ef5c56d5@github.com> References: <4aXL_qh1epRWCwufaHiKXJ3wuPqG0xZSF6i-8r6OgcU=.97a5ff7e-5a19-47e5-b14d-af16ef5c56d5@github.com> Message-ID: On Wed, 13 Dec 2023 16:01:42 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > missing revert Changes requested by kbarrett (Reviewer). src/hotspot/share/opto/divconstants.cpp line 28: > 26: #include > 27: #include > 28: #include "utilities/powerOfTwo.hpp" Comment about include order was marked resolved, but no change was made. test/hotspot/gtest/opto/test_constant_division.cpp line 33: > 31: > 32: template > 33: UT random(); Please make this (file scoped) static. (If I remember correctly, only needed (and maybe only allowed), here.) And can you add a comment explaining what this function is supposed to do? I failed to puzzle that out from the various implementations. test/hotspot/gtest/opto/test_constant_division.cpp line 66: > 64: > 65: template > 66: void magic_divide_constants(T d, T N_neg, T N_pos, juint min_s, T& c, bool& c_ovf, juint& s); I don't see any tests here of magic_divide_constants_round_down. ------------- PR Review: https://git.openjdk.org/jdk/pull/9947#pullrequestreview-1787305924 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1430437772 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1430396012 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1430442297 From sviswanathan at openjdk.org Mon Dec 18 22:30:38 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Mon, 18 Dec 2023 22:30:38 GMT Subject: RFR: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 07:26:31 GMT, Jatin Bhateja wrote: > Hi, > > This bug fix patch optimizes integral gather mask computation using cheaper instruction and fixes incorrect instruction attributes in legacy integral gather instructions. > > All Vector API JTREG tests are passing with this at various AVX levels. > > Kindly review and share feedback. > > Best Regards, > Jatin LGTM ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17048#pullrequestreview-1787810545 From fyang at openjdk.org Tue Dec 19 01:40:42 2023 From: fyang at openjdk.org (Fei Yang) Date: Tue, 19 Dec 2023 01:40:42 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic [v2] In-Reply-To: References: Message-ID: On Mon, 18 Dec 2023 11:03:03 GMT, Hamlin Li wrote: >> Hi, >> Can you review this minor patch to improve MD5 instrinsic? >> Thanks! >> >> ## Test >> >> tests (`find test/ -iname "*md5*.java"`) passed. > > Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: > > - add space > - Add some comment Looks good to me except for the nit. Thanks. src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4163: > 4161: // in the following code, it does not care about the content of > 4162: // higher 32-bits in state[x]. Based on this observation, > 4163: // we can apply futher optimization, which is to just ignore the Nit: /futher/further/ ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17123#pullrequestreview-1787955905 PR Review Comment: https://git.openjdk.org/jdk/pull/17123#discussion_r1430811402 From duke at openjdk.org Tue Dec 19 07:31:05 2023 From: duke at openjdk.org (Raphael Mosaner) Date: Tue, 19 Dec 2023 07:31:05 GMT Subject: RFR: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects [v2] In-Reply-To: References: Message-ID: <0Um1zwgyjW8IXcISUsf7s2fvfsD03JD7esf112bGa_g=.7b483632-9193-49ec-87d4-0ad5adbab234@github.com> > This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. Raphael Mosaner has updated the pull request incrementally with one additional commit since the last revision: [JVMCI] Documentation for _should_notify_object_alloc export. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/16980/files - new: https://git.openjdk.org/jdk/pull/16980/files/0b430d8b..35b2fd9c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16980&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16980&range=00-01 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/16980.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16980/head:pull/16980 PR: https://git.openjdk.org/jdk/pull/16980 From jbhateja at openjdk.org Tue Dec 19 07:54:49 2023 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 19 Dec 2023 07:54:49 GMT Subject: Integrated: 8321648: Integral gather optimized mask computation. In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 07:26:31 GMT, Jatin Bhateja wrote: > Hi, > > This bug fix patch optimizes integral gather mask computation using cheaper instruction and fixes incorrect instruction attributes in legacy integral gather instructions. > > All Vector API JTREG tests are passing with this at various AVX levels. > > Kindly review and share feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 76637c53 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/76637c53c56d39cc534ecaa9e9ff55413173b15c Stats: 31 lines in 3 files changed: 11 ins; 14 del; 6 mod 8321648: Integral gather optimized mask computation. Reviewed-by: thartmann, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/17048 From mli at openjdk.org Tue Dec 19 08:48:00 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 19 Dec 2023 08:48:00 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic [v3] In-Reply-To: References: Message-ID: > Hi, > Can you review this minor patch to improve MD5 instrinsic? > Thanks! > > ## Test > > tests (`find test/ -iname "*md5*.java"`) passed. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: Fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17123/files - new: https://git.openjdk.org/jdk/pull/17123/files/606b6248..301f35ad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17123&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17123&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17123.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17123/head:pull/17123 PR: https://git.openjdk.org/jdk/pull/17123 From mli at openjdk.org Tue Dec 19 08:48:03 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 19 Dec 2023 08:48:03 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic [v2] In-Reply-To: References: Message-ID: On Tue, 19 Dec 2023 01:36:39 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: >> >> - add space >> - Add some comment > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4163: > >> 4161: // in the following code, it does not care about the content of >> 4162: // higher 32-bits in state[x]. Based on this observation, >> 4163: // we can apply futher optimization, which is to just ignore the > > Nit: /futher/further/ Oh, thanks for catching! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17123#discussion_r1431087694 From mli at openjdk.org Tue Dec 19 08:48:00 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 19 Dec 2023 08:48:00 GMT Subject: RFR: 8322195: RISC-V: Minor improvement of MD5 instrinsic [v2] In-Reply-To: References: Message-ID: On Mon, 18 Dec 2023 11:26:55 GMT, Ludovic Henry wrote: >> Hamlin Li has updated the pull request incrementally with two additional commits since the last revision: >> >> - add space >> - Add some comment > > Marked as reviewed by luhenry (Committer). Thanks @luhenry @RealFYang for your reviewing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17123#issuecomment-1862343341 From mli at openjdk.org Tue Dec 19 08:48:05 2023 From: mli at openjdk.org (Hamlin Li) Date: Tue, 19 Dec 2023 08:48:05 GMT Subject: Integrated: 8322195: RISC-V: Minor improvement of MD5 instrinsic In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 14:40:59 GMT, Hamlin Li wrote: > Hi, > Can you review this minor patch to improve MD5 instrinsic? > Thanks! > > ## Test > > tests (`find test/ -iname "*md5*.java"`) passed. This pull request has now been integrated. Changeset: fff2e580 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/fff2e580cdab90ea828c1c300440471981646c51 Stats: 10 lines in 1 file changed: 6 ins; 2 del; 2 mod 8322195: RISC-V: Minor improvement of MD5 instrinsic Reviewed-by: luhenry, fyang ------------- PR: https://git.openjdk.org/jdk/pull/17123 From qamai at openjdk.org Tue Dec 19 10:42:54 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 19 Dec 2023 10:42:54 GMT Subject: Integrated: 8319451: PhaseIdealLoop::conditional_move is too conservative In-Reply-To: References: Message-ID: On Mon, 6 Nov 2023 19:10:42 GMT, Quan Anh Mai wrote: > Hi, > > When transforming a Phi into a CMove, the threshold is set to be approximately BlockLayoutMinDiamondPercentage, the reason is given: > > // BlockLayoutByFrequency optimization moves infrequent branch > // from hot path. No point in CMOV'ing in such case > > This sets the default value of the threshold to be around 18%, which is too conservative. The reason also does not make a lot of sense since the important property which makes jumping expensive is not code layout. We should remove this. > > Please kindly review, thank you very much. This pull request has now been integrated. Changeset: ac968c36 Author: Quan Anh Mai URL: https://git.openjdk.org/jdk/commit/ac968c36d7cc2e13270d28c9310178f6b654d7dc Stats: 74 lines in 2 files changed: 61 ins; 9 del; 4 mod 8319451: PhaseIdealLoop::conditional_move is too conservative Reviewed-by: redestad, thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/16524 From epeter at openjdk.org Tue Dec 19 11:56:26 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Dec 2023 11:56:26 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v48] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with three additional commits since the last revision: - fix whitespace - fix to last commit - Refactoring of AlignmentSolver ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/0ae53186..4acae448 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=47 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=46-47 Stats: 517 lines in 3 files changed: 277 ins; 70 del; 170 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Tue Dec 19 11:56:28 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Dec 2023 11:56:28 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 10:46:53 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> renamings and proof improvement in adjust_pre_loop_limit_to_align_main_loop_vectors > > Thanks for addressing my other comments! I really like the new structure of having an `AlignmentSolution` interface with different alignment solution classes. I have some more comments but mostly fine tuning - it's already in a quite good shape and the comments really add a lot of benefit to understand the idea behind the code. We get there :-) @chhagedorn Thanks for your already very detailed review! I noticed that the whole section with the "constraints" was not very clear, it was too hand-wavy. I refactored the solution quite a bit, and even fixed a bug with `init` (there should never be a constraint because of it). I state the solution more precise now, and the constraints follow more directly out of it. I think the solution is now even intuitively understandable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/14785#issuecomment-1862614106 From epeter at openjdk.org Tue Dec 19 11:56:29 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Dec 2023 11:56:29 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 10:33:49 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/vectorization.cpp line 940: >> >>> 938: // hence we have to add a dependency for invar, and scale (pre_stride is the >>> 939: // same for all mem_refs in the loop). If there is no invariant, then we add >>> 940: // a dependency that there is no invariant. >> >> I found it a little difficult to understand this part. I had to jump back and forth between the different equations and also had to write down some things to verify that we indeed are not missing a dependency. Maybe we can be more explicit here. How about something like that? >> >> >> // Hence, the solution for pre_iter depends on: >> // - Always: pre_r and pre_q >> // - If an init is present, we have a dependency on it. But since init is fixed and given by the loop for all mem_refs >> // of all packs, we can drop it. >> // - If init is a constant, we do not have another dependency. >> // - If init is a non-constant, we have a dependency on scale since C_init = scale. We need to keep since it is not >> // given by the loop and could vary from pack to pack. We do not have to add another dependency since (5a*): >> // >> // C_init % abs(C_pre) = 0 <=> C_init = C_pre * X >> // >> // can only be satisfied if X is constant: >> // >> // C_init = C_pre * X >> // X = C_init / C_pre >> // X = scale / (scale * pre_stride) >> // X = 1 / pre_stride >> // >> // - If an invariant is present, we have a dependency on it which we need to keep since it is not given by the loop >> // and could vary from pack to pack. We additionally have to add another dependency on scale since (5b*): >> // >> // C_invar % abs(C_pre) = 0 <=> C_invar = C_pre * Y >> // >> // can only be satisfied if we have the following Y: >> // >> // C_invar = C_pre * Y >> // Y = (C_invar / C_pre) >> // Y = abs(invar_factor) / (scale * pre_stride) >> // >> // A dependency for pre_stride is not required as it is fixed and given by the loop for all mem_refs of all packs >> // (similar to init). > > Another thought: Should we assert that if init is a non-constant that `abs(stride) = 1` if there is a solution? I am refactoring this, to make the proof more tight and even correct a mistake. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1431296353 From epeter at openjdk.org Tue Dec 19 11:56:33 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Dec 2023 11:56:33 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v40] In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 14:55:30 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> remove dead code > > src/hotspot/share/opto/vectorization.hpp line 572: > >> 570: >> 571: private: >> 572: #ifdef ASSERT > > Since the `Tracer` class in this file is `NOT_PRODUCT`, I suggest to also go with that here and where the trace methods are used. > Suggestion: > > #ifndef PRODUCT Actually, I want to go towards DEBUG/ASSERT only printing, and make the trace flags debug-only (they currently only have effects in debug anyway). I will refactor this in https://github.com/openjdk/jdk/pull/16620, and for now I will only add ASSERT guarded tracing. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1431298284 From epeter at openjdk.org Tue Dec 19 13:50:07 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Dec 2023 13:50:07 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v48] In-Reply-To: References: Message-ID: On Tue, 19 Dec 2023 11:56:26 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with three additional commits since the last revision: > > - fix whitespace > - fix to last commit > - Refactoring of AlignmentSolver src/hotspot/share/opto/vectorization.cpp line 814: > 812: // limit. We decompose pre_iter: > 813: // > 814: // pre_iter = pre_iter_C_const + pre_iter_C_invar + pre_iter_C_init integer or fractional? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1431425693 From epeter at openjdk.org Tue Dec 19 14:31:17 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 19 Dec 2023 14:31:17 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v49] In-Reply-To: References: Message-ID: <0YNH5HQlDDHNCcByYQTpGsWEpSJFQjUzIzLdl6P2MPw=.25dddf0f-2511-40b6-889b-f8944b475e22@github.com> > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: improve comments with fractional / integer question ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/4acae448..8fbac11d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=48 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=47-48 Stats: 10 lines in 1 file changed: 4 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From sgibbons at openjdk.org Tue Dec 19 18:46:22 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Tue, 19 Dec 2023 18:46:22 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding [v2] In-Reply-To: References: Message-ID: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'openjdk:master' into Base64-fix - Merge branch 'Base64-fix' of https://github.com/asgibbons/jdk into Base64-fix - Merge branch 'openjdk:master' into Base64-fix - Added tests for proper length and padding checks - Fix for JDK-8321599 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17039/files - new: https://git.openjdk.org/jdk/pull/17039/files/ead0cd20..4a730d25 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=00-01 Stats: 15777 lines in 358 files changed: 12178 ins; 2320 del; 1279 mod Patch: https://git.openjdk.org/jdk/pull/17039.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17039/head:pull/17039 PR: https://git.openjdk.org/jdk/pull/17039 From sgibbons at openjdk.org Tue Dec 19 18:46:22 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Tue, 19 Dec 2023 18:46:22 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Fri, 8 Dec 2023 20:56:52 GMT, Scott Gibbons wrote: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. Security issue has been resolved. Re-opening. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1863297432 From duke at openjdk.org Tue Dec 19 20:31:55 2023 From: duke at openjdk.org (Joshua Cao) Date: Tue, 19 Dec 2023 20:31:55 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input Message-ID: It is a common pattern to have: Node* n = new CastNode(...); n->set_req(control_node); We can modify the constructor to set the control node. It makes the code a little tidier. Passes tier1 locally on my Linux machine ------------- Commit messages: - 8322490: CastNode constructors accepts control node as input Changes: https://git.openjdk.org/jdk/pull/17162/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322490 Stats: 65 lines in 6 files changed: 1 ins; 28 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/17162.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17162/head:pull/17162 PR: https://git.openjdk.org/jdk/pull/17162 From sgibbons at openjdk.org Tue Dec 19 23:03:02 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Tue, 19 Dec 2023 23:03:02 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding [v3] In-Reply-To: References: Message-ID: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: Added some comments to the test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17039/files - new: https://git.openjdk.org/jdk/pull/17039/files/4a730d25..61b5de8a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=01-02 Stats: 11 lines in 1 file changed: 11 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17039.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17039/head:pull/17039 PR: https://git.openjdk.org/jdk/pull/17039 From sgibbons at openjdk.org Tue Dec 19 23:26:51 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Tue, 19 Dec 2023 23:26:51 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding [v4] In-Reply-To: References: Message-ID: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: Revert code size change - wa for an experiment only. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17039/files - new: https://git.openjdk.org/jdk/pull/17039/files/61b5de8a..40e5cbce Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17039.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17039/head:pull/17039 PR: https://git.openjdk.org/jdk/pull/17039 From sviswanathan at openjdk.org Tue Dec 19 23:59:39 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 19 Dec 2023 23:59:39 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding [v4] In-Reply-To: References: Message-ID: <2OoHPxglxCJi7vVjjj3rVz_lPM4rSr0nmlG9A_j7Kz0=.a6b59ccc-1a52-4e60-99ce-6604e3c050a5@github.com> On Tue, 19 Dec 2023 23:26:51 GMT, Scott Gibbons wrote: >> Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. >> >> Tested teir1 and with testcase supplied with JBS issue. >> >> The problem will only occur when all of the following are true: >> 1. The source offset of the string to be decoded is != 0. >> 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". >> 3. The string is >= 32 characters. >> 4. The string is not MIME encoded. >> >> If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Revert code size change - wa for an experiment only. The fix looks good to me. Could you please update the copyright year in TestBase64.java? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1863632268 From sgibbons at openjdk.org Wed Dec 20 00:09:01 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Wed, 20 Dec 2023 00:09:01 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding [v5] In-Reply-To: References: Message-ID: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: - Updated copyright year - Updated copyright year ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17039/files - new: https://git.openjdk.org/jdk/pull/17039/files/40e5cbce..f7d4705e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17039.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17039/head:pull/17039 PR: https://git.openjdk.org/jdk/pull/17039 From sviswanathan at openjdk.org Wed Dec 20 00:09:01 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 20 Dec 2023 00:09:01 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding [v5] In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023 00:06:39 GMT, Scott Gibbons wrote: >> Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. >> >> Tested teir1 and with testcase supplied with JBS issue. >> >> The problem will only occur when all of the following are true: >> 1. The source offset of the string to be decoded is != 0. >> 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". >> 3. The string is >= 32 characters. >> 4. The string is not MIME encoded. >> >> If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. > > Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: > > - Updated copyright year > - Updated copyright year Marked as reviewed by sviswanathan (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17039#pullrequestreview-1789883368 From gcao at openjdk.org Wed Dec 20 02:33:54 2023 From: gcao at openjdk.org (Gui Cao) Date: Wed, 20 Dec 2023 02:33:54 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 11:49:50 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: > > - Use zero_extend instead of shifts where possible > - Use andn instead of notr + andr where possible > - Replace shNadd with one instruction in most cases src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3717: > 3715: andi(tmp1, v, bits8); > 3716: shadd(tmp1, tmp1, table3, tmp2, 2); > 3717: Assembler::lwu(crc, tmp1, 0); Why not use `MacroAssembler::lwu` instead ? I see low difference in stub code emitted. Like: ``` diff diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp index 06026b98bfa..eb9362ca531 100644 --- a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp +++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp @@ -3696,26 +3696,26 @@ void MacroAssembler::update_word_crc32(Register crc, Register v, Register tmp1, andi(tmp1, v, bits8); shadd(tmp1, tmp1, table3, tmp2, 2); - Assembler::lwu(crc, tmp1, 0); + lwu(crc, Address(tmp1, 0)); srli(tmp1, v, 6); andi(tmp1, tmp1, (bits8 << 2)); add(tmp1, tmp1, table2); - Assembler::lwu(tmp2, tmp1, 0); + lwu(tmp2, Address(tmp1, 0)); srli(tmp1, v, 14); xorr(crc, crc, tmp2); andi(tmp1, tmp1, (bits8 << 2)); add(tmp1, tmp1, table1); - Assembler::lwu(tmp2, tmp1, 0); + lwu(tmp2, Address(tmp1, 0)); srli(tmp1, v, 22); xorr(crc, crc, tmp2); andi(tmp1, tmp1, (bits8 << 2)); add(tmp1, tmp1, table0); - Assembler::lwu(tmp2, tmp1, 0); + lwu(tmp2, Address(tmp1, 0)); xorr(crc, crc, tmp2); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1432182715 From chagedorn at openjdk.org Wed Dec 20 07:44:46 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 20 Dec 2023 07:44:46 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input In-Reply-To: References: Message-ID: On Tue, 19 Dec 2023 20:27:06 GMT, Joshua Cao wrote: > It is a common pattern to have: > > > Node* n = new CastNode(...); > n->set_req(control_node); > > > We can modify the constructor to set the control node. It makes the code a little tidier. > > Passes tier1 locally on my Linux machine I think that's a good idea but you've missed some cases of creating new `CastIINodes` that could use the control-based constructor. Example (there are more cases when you search for "new CastIINode"): https://github.com/openjdk/jdk/blob/f7dc257a206d3104d6d24c2079ef1fe349368c49/src/hotspot/share/opto/loopTransform.cpp#L3164-L3166 src/hotspot/share/opto/castnode.cpp line 126: > 124: } > 125: > 126: Node* ConstraintCastNode::make_cast(int opcode, Node* c, Node* n, const Type* t, DependencyType dependency, I'm wondering if this method is still worth keeping as it would simply replace a "new CastXXNode" line which is just as expressive/readable. As far as I can see, it's currently only used with statically known Opcodes. So, we could replace them. ------------- Changes requested by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17162#pullrequestreview-1790330182 PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1432357733 From chagedorn at openjdk.org Wed Dec 20 07:44:48 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 20 Dec 2023 07:44:48 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input In-Reply-To: References: Message-ID: <5zIO68nQy5YxQs69531jlEBgLHxa3len3vHapEQrkGg=.77636ff7-1cbe-4234-9a91-a8128c778aad@github.com> On Wed, 20 Dec 2023 07:39:31 GMT, Christian Hagedorn wrote: >> It is a common pattern to have: >> >> >> Node* n = new CastNode(...); >> n->set_req(control_node); >> >> >> We can modify the constructor to set the control node. It makes the code a little tidier. >> >> Passes tier1 locally on my Linux machine > > src/hotspot/share/opto/castnode.cpp line 126: > >> 124: } >> 125: >> 126: Node* ConstraintCastNode::make_cast(int opcode, Node* c, Node* n, const Type* t, DependencyType dependency, > > I'm wondering if this method is still worth keeping as it would simply replace a "new CastXXNode" line which is just as expressive/readable. As far as I can see, it's currently only used with statically known Opcodes. So, we could replace them. On a separate note and might be worth to do in this change as well, we have `ConstraintCastNode::make_cast_for_type()` and `ConstraintCastNode::make()`. The intent of the former is clear but the latter switches on the provided basic type. Maybe we can rename the latter to `make_cast_for_basic_type()` to better align it with `make_cast_for_type()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1432359469 From epeter at openjdk.org Wed Dec 20 10:23:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Dec 2023 10:23:20 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v50] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: necessary and sufficient (3) <-> (4) ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/8fbac11d..39aaf95a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=49 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=48-49 Stats: 71 lines in 1 file changed: 49 ins; 6 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From duke at openjdk.org Wed Dec 20 13:26:50 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Wed, 20 Dec 2023 13:26:50 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: <-ESASip8OA080tvie3uCIlfszW2DvtPxGT_jri-6m4U=.e2800f29-650c-4e17-9c89-fcaae0627f8e@github.com> References: <-ESASip8OA080tvie3uCIlfszW2DvtPxGT_jri-6m4U=.e2800f29-650c-4e17-9c89-fcaae0627f8e@github.com> Message-ID: On Mon, 18 Dec 2023 00:58:33 GMT, Feilong Jiang wrote: >> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: >> >> - Use zero_extend instead of shifts where possible >> - Use andn instead of notr + andr where possible >> - Replace shNadd with one instruction in most cases > > src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp line 1643: > >> 1641: __ zero_extend(crc, crc, 32); >> 1642: __ update_byte_crc32(crc, val, res); >> 1643: __ notr(res, crc); // ~crc > > Do you miss the `zero_extend(crc, crc, 32)`? As far as I can see this is unneeded, actually. I used the `test/hotspot/jtreg/compiler/codegen/CRCTest.java` test as a sanity check (it uses the `emit_updatecrc32` stub) and it's ok. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1432711074 From chagedorn at openjdk.org Wed Dec 20 14:39:05 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 20 Dec 2023 14:39:05 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v49] In-Reply-To: <0YNH5HQlDDHNCcByYQTpGsWEpSJFQjUzIzLdl6P2MPw=.25dddf0f-2511-40b6-889b-f8944b475e22@github.com> References: <0YNH5HQlDDHNCcByYQTpGsWEpSJFQjUzIzLdl6P2MPw=.25dddf0f-2511-40b6-889b-f8944b475e22@github.com> Message-ID: On Tue, 19 Dec 2023 14:31:17 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > improve comments with fractional / integer question src/hotspot/share/opto/vectorization.cpp line 741: > 739: // main_iter: number of main-loop iterations (main_iter >= 0) > 740: // > 741: // In the following, we restate the simple form of the address expression, by first You might want to have "Simple" in upper case as you introduce it above as a term: "The Simple form of the address.." Suggestion: // In the following, we restate the Simple form of the address expression, by first src/hotspot/share/opto/vectorization.cpp line 756: > 754: // We describe the 6 terms: > 755: // 1) The "base" of the address is the address of a Java object (e.g. array), > 756: // and hence can be assumed to already be aw-aligned (base % aw = 0). IIRC, you've had an explanation of why that is the case in an earlier version somewhere above. But now that you've moved the definition of `_aw` to the constructor, it might be good to restate here that objects are at least `ObjectAlignmentInBytes` aligned and that aw is a power of two and at most `ObjectAlignmentInBytes` by definition which implies `base % aw = 0`. It might not be evidently clear otherwise. src/hotspot/share/opto/vectorization.cpp line 758: > 756: // and hence can be assumed to already be aw-aligned (base % aw = 0). > 757: // 2) The "C_const" term is the sum of all constant terms. This is "offset", > 758: // plus "init" if it is constant. Should we also mention here `scale` in case `init` is constant? Suggestion: // 2) The "C_const" term is the sum of all constant terms. This is "offset", // plus "scale * init" if it is constant. src/hotspot/share/opto/vectorization.cpp line 762: > 760: // and variable term. If there is no invariant, then "C_invar" is zero. > 761: // > 762: // invar = C_invar * var_invar (FAC_INVAR) I suggest to move `(FAC_INVAR)` and `(FAC_INIT`) a little bit closer to the left. src/hotspot/share/opto/vectorization.cpp line 827: > 825: // > 826: // While we can now attribute the (fractional) amount of iterations required for the C_const, > 827: // invar and init terms, this does not give us a way to align these terms independendly. Suggestion: // invar and init terms, this does not give us a way to align these terms independently. src/hotspot/share/opto/vectorization.hpp line 103: > 101: > 102: // Biggest detectable factor of the invariant. > 103: int invar_factor(); Was like that before but you could make them all `const`. src/hotspot/share/opto/vectorization.hpp line 575: > 573: _mem_ref( mem_ref), > 574: _vector_length( vector_length), > 575: _element_size( mem_ref->memory_size()), You should assert here that `mem_ref` is non-null. src/hotspot/share/opto/vectorization.hpp line 598: > 596: class EQ4 { > 597: private: > 598: const int _C_const; Nit: Members should be indented by two spaces. I then usually indent the `private/public` keyword with one space. src/hotspot/share/opto/vectorization.hpp line 630: > 628: int C_const_mod_abs_C_pre() const { return AlignmentSolution::mod(_C_const, abs(_C_pre)); } > 629: int C_invar_mod_abs_C_pre() const { return AlignmentSolution::mod(_C_invar, abs(_C_pre)); } > 630: int C_init__mod_abs_C_pre() const { return AlignmentSolution::mod(_C_init, abs(_C_pre)); } Two consecutive `_`: Suggestion: int C_init_mod_aw() const { return AlignmentSolution::mod(_C_init, _aw); } int C_const_mod_abs_C_pre() const { return AlignmentSolution::mod(_C_const, abs(_C_pre)); } int C_invar_mod_abs_C_pre() const { return AlignmentSolution::mod(_C_invar, abs(_C_pre)); } int C_init_mod_abs_C_pre() const { return AlignmentSolution::mod(_C_init, abs(_C_pre)); } src/hotspot/share/opto/vectorization.hpp line 635: > 633: #ifdef ASSERT > 634: void trace() const; > 635: const char* state_to_str(State s) const { Could be made static src/hotspot/share/opto/vectorization.hpp line 663: > 661: const int C_pre, > 662: const int q, > 663: const int r) const; I think you can remove `const` from the declaration here in the header file as it only has an effect in the definition in the source file. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432382602 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432387018 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432392449 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432442598 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432443208 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432376331 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432367404 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432366482 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432369245 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432370031 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432371196 From chagedorn at openjdk.org Wed Dec 20 14:39:15 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 20 Dec 2023 14:39:15 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v50] In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023 10:23:20 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > necessary and sufficient (3) <-> (4) src/hotspot/share/opto/vectorization.cpp line 769: > 767: // > 768: // scale * init = C_init * var_init + scale * C_const_init (FAC_INIT) > 769: // C_init = (init is constant) ? 0 : (scale * init / var_init) Suggestion: // C_init = (init is constant) ? 0 : scale src/hotspot/share/opto/vectorization.cpp line 825: > 823: // (C_init * var_init + C_pre * pre_iter_C_init ) % aw = 0 (4c) > 824: // > 825: // We now prove that (4a, b, c) are sufficient as well as necessary go guarantee (3) Suggestion: // We now prove that (4a, b, c) are sufficient as well as necessary to guarantee (3) src/hotspot/share/opto/vectorization.cpp line 827: > 825: // We now prove that (4a, b, c) are sufficient as well as necessary go guarantee (3) > 826: // for any runtime value of var_invar and var_init (i.e. for any invar and init). > 827: // This tells us that the "strengthening" did not restrict the algorithm more than Suggestion: // This tells us that the "strengthening" does not restrict the algorithm more than src/hotspot/share/opto/vectorization.cpp line 836: > 834: // Adding up (4a, b, c): > 835: // > 836: // 0 The zero kinda looks lost. I suggest to align it like this: ``` 0 = ( C_const ... ... ) % aw = 0 = ( C_const ...) = ... src/hotspot/share/opto/vectorization.cpp line 906: > 904: // abs(C_pre) < aw AND C_const % abs(C_pre) != 0 > 905: // -> alignment has effect > 906: // -> But C_const cannot be aligned with C_pre -> empty As we have discussed offline, I suggest the following: // We look at (4a): // // abs(C_pre) >= aw // -> Since C_pre is a power of two, we have C_pre % aw = 0. Therefore, any multiple of C_pre // (i.e. choosing any value for pre_iter_C_Const) is also aw aligned. In this case, we can // only satisfy (4a) if C_Const is aw aligned: // // C_const % aw == 0: // -> (4a) has a trivial solution since we can choose any value for pre_iter_C_Const. // // C_const % aw != 0: // -> (4a) has an empty solution since no pre_iter_C_Const can achieve aw alignment. // // abs(C_pre) < aw: // -> Then for some x > 1: aw = abs(C_pre) * x since aw and C_pre are power of twos. // If (4a) holds, then the following also holds: // (C_const + C_pre * pre_iter_C_const) % abs(C_pre) <=> // (C_const ) % abs(C_pre) // // C_const % abs(C_pre) == 0: // -> pre_iter_C_const is chosen accordingly such that (4a) is satisfied with the given C_const value. // -> (4a) has a constrained solution. // // C_const % abs(C_pre) != 0: // -> Not "C_const % abs(C_pre) == 0" implies not (4a). Therefore, (4a) has an empty solution since no // pre_iter_C_Const can achieve aw alignment. src/hotspot/share/opto/vectorization.cpp line 985: > 983: // C_init = Z * abs(C_pre) ==> Z = C_init / abs(C_pre) (6c) > 984: // > 985: // Futher, we define: Suggestion: // Further, we define: src/hotspot/share/opto/vectorization.cpp line 1025: > 1023: // > 1024: // Having solved the equations using the division, we can re-substitute X, Y, and Z, and apply (FAC_INVAR) as > 1025: // well as (FAC_INIT): Suggestion: // well as (FAC_INIT). We use the fact that sign(x) == 1 / sign(x) and sign(x) * abs(x) == x: src/hotspot/share/opto/vectorization.cpp line 1044: > 1042: // pre_iter_C_invar = my2 * q (11b, no invar) > 1043: // > 1044: // If init is variable (i.e. C_init = scale * init / var_init): Suggestion: // If init is variable (i.e. C_init = scale, init = var_init): src/hotspot/share/opto/vectorization.cpp line 1048: > 1046: // pre_iter_C_init = mz2 * q - sign(C_pre) * Z * var_init > 1047: // = mz2 * q - sign(C_pre) * C_init * var_init / abs(C_pre) > 1048: // = mz2 * q - sign(C_pre) * scale * init / abs(C_pre) Suggestion: // = mz2 * q - sign(C_pre) * scale * init / abs(C_pre) src/hotspot/share/opto/vectorization.cpp line 1053: > 1051: // = mz2 * q - init / pre_stride (11c, variable init) > 1052: // > 1053: // If init is variable (i.e. C_init = 0 ==> Z = 0): Suggestion: // If init is constant (i.e. C_init = 0 ==> Z = 0): src/hotspot/share/opto/vectorization.cpp line 1070: > 1068: // [- init / pre_stride ] (align variable init term, if present) (12) > 1069: // > 1070: // We can still simply simplifiy this solution, with: Suggestion: // We can further simply this solution by introducing integer 0 <= r < q: src/hotspot/share/opto/vectorization.cpp line 1133: > 1131: // -> apply (8): q = aw / (abs(C_pre)) = aw / abs(scale * pre_stride) > 1132: // -> and hence: (scale * pre_stride * q) % aw = 0 > 1133: // -> all terms are cancled out Suggestion: // -> all terms are canceled out ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432764090 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432685117 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432686749 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432688473 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432650703 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432701526 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432730252 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432765450 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432766121 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432732903 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432770695 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432788762 From epeter at openjdk.org Wed Dec 20 15:13:19 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Dec 2023 15:13:19 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v51] In-Reply-To: References: Message-ID: <1tzO7gAtDi50soOc-hTAGPBD_utQ84Hf0e2INC2ODnk=.08f747b8-8574-48b2-9de9-a14fd9a9c534@github.com> > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review by Christian Thanks Christian Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/39aaf95a..2ee10c38 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=50 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=49-50 Stats: 11 lines in 1 file changed: 0 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Wed Dec 20 15:13:22 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 20 Dec 2023 15:13:22 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v49] In-Reply-To: References: <0YNH5HQlDDHNCcByYQTpGsWEpSJFQjUzIzLdl6P2MPw=.25dddf0f-2511-40b6-889b-f8944b475e22@github.com> Message-ID: On Wed, 20 Dec 2023 07:55:30 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> improve comments with fractional / integer question > > src/hotspot/share/opto/vectorization.hpp line 663: > >> 661: const int C_pre, >> 662: const int q, >> 663: const int r) const; > > I think you can remove `const` from the declaration here in the header file as it only has an effect in the definition in the source file. I would rather be explicit with everything. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432837848 From chagedorn at openjdk.org Wed Dec 20 15:33:09 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 20 Dec 2023 15:33:09 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v50] In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023 10:23:20 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > necessary and sufficient (3) <-> (4) src/hotspot/share/opto/vectorization.cpp line 1145: > 1143: > 1144: #ifdef ASSERT > 1145: void print_icon_or_idx(const Node* n) { Or coni: Suggestion: void print_con_or_idx(const Node* n) { src/hotspot/share/opto/vectorization.cpp line 1252: > 1250: tty->print_cr(" -> %s", state_to_str(eq4a_state())); > 1251: > 1252: tty->print_cr(" EQ(4a): (C_invar(%3d) * var_invar + C_pre(%d) * pre_iter_C_invar) %% aw(%d) = 0 (align invar term individually)", Suggestion: tty->print_cr(" EQ(4b): (C_invar(%3d) * var_invar + C_pre(%d) * pre_iter_C_invar) %% aw(%d) = 0 (align invar term individually)", src/hotspot/share/opto/vectorization.cpp line 1256: > 1254: tty->print_cr(" -> %s", state_to_str(eq4b_state())); > 1255: > 1256: tty->print_cr(" EQ(4a): (C_init( %3d) * var_init + C_pre(%d) * pre_iter_C_init ) %% aw(%d) = 0 (align init term individually)", Suggestion: tty->print_cr(" EQ(4c): (C_init( %3d) * var_init + C_pre(%d) * pre_iter_C_init ) %% aw(%d) = 0 (align init term individually)", src/hotspot/share/opto/vectorization.hpp line 422: > 420: // [- init / pre_stride ] > 421: // > 422: // Note: pre_stride and init are idential for all mem_refs in the loop. Suggestion: // Note: pre_stride and init are identical for all mem_refs in the loop. src/hotspot/share/opto/vectorization.hpp line 425: > 423: // > 424: // The init alignment term either does not exist for both mem_refs, or exists identically > 425: // for both. The init alignment term is thus triviall identical. Suggestion: // for both. The init alignment term is thus trivially identical. src/hotspot/share/opto/vectorization.hpp line 459: > 457: // > 458: // Since q1 and q2 are both powers of 2, and q1 <= q2, we know there > 459: // is an integer a: a * q1 = q1. Thus, it remains to check if there Suggestion: // is an integer a: a * q1 = q2. Thus, it remains to check if there src/hotspot/share/opto/vectorization.hpp line 484: > 482: > 483: // When strict alignment is required (e.g. -XX:+AlignVector), then we must ensure > 484: // that all vector memory accesses can be aligned. We acheive this alignment by Suggestion: // that all vector memory accesses can be aligned. We achieve this alignment by ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432806809 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432810574 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432810700 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432821557 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432821841 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432849340 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432862529 From duke at openjdk.org Wed Dec 20 15:37:50 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Wed, 20 Dec 2023 15:37:50 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023 02:30:40 GMT, Gui Cao wrote: >> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: >> >> - Use zero_extend instead of shifts where possible >> - Use andn instead of notr + andr where possible >> - Replace shNadd with one instruction in most cases > > src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3717: > >> 3715: andi(tmp1, v, bits8); >> 3716: shadd(tmp1, tmp1, table3, tmp2, 2); >> 3717: Assembler::lwu(crc, tmp1, 0); > > Why not use `MacroAssembler::lwu` instead ? I see no difference in stub code emitted. > Like: > ``` diff > diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp > index 06026b98bfa..eb9362ca531 100644 > --- a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp > +++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp > @@ -3696,26 +3696,26 @@ void MacroAssembler::update_word_crc32(Register crc, Register v, Register tmp1, > > andi(tmp1, v, bits8); > shadd(tmp1, tmp1, table3, tmp2, 2); > - Assembler::lwu(crc, tmp1, 0); > + lwu(crc, Address(tmp1, 0)); > > srli(tmp1, v, 6); > andi(tmp1, tmp1, (bits8 << 2)); > add(tmp1, tmp1, table2); > - Assembler::lwu(tmp2, tmp1, 0); > + lwu(tmp2, Address(tmp1, 0)); > > srli(tmp1, v, 14); > xorr(crc, crc, tmp2); > > andi(tmp1, tmp1, (bits8 << 2)); > add(tmp1, tmp1, table1); > - Assembler::lwu(tmp2, tmp1, 0); > + lwu(tmp2, Address(tmp1, 0)); > > srli(tmp1, v, 22); > xorr(crc, crc, tmp2); > > andi(tmp1, tmp1, (bits8 << 2)); > add(tmp1, tmp1, table0); > - Assembler::lwu(tmp2, tmp1, 0); > + lwu(tmp2, Address(tmp1, 0)); > xorr(crc, crc, tmp2); > } When I tried `MacroAssembler::lwu` I got the following instructions on T-head: 0.47% ? 0x0000003fac6a8738: li t3,1 0.51% ? 0x0000003fac6a873a: slli t3,t3,0x20 0.00% ? 0x0000003fac6a873c: addi t3,t3,-1 ... 2.68% ? 0x0000003fac6a8752: lw a0,0(t1) 5.25% ? 0x0000003fac6a8756: and a0,a0,t3 ... ? 0x0000003fac6a876a: lw t4,0(t1) 1.78% ? 0x0000003fac6a876e: and t1,t4,t3 ... 0.49% ? 0x0000003fac6a8786: lw t4,0(t1) 2.62% ? 0x0000003fac6a878a: and t1,t4,t3 ... 0.41% ? 0x0000003fac6a87a2: lw t4,0(t1) 3.97% ? 0x0000003fac6a87a6: and t1,t4,t3 instead of just 4.52% ?? 0x0000003fb49e96f6: lwu a0,0(t1) ... ?? 0x0000003fb49e970a: lwu t3,0(t1) ... ?? 0x0000003fb49e9722: lwu t3,0(t1) ... 0.02% ?? 0x0000003fb49e973a: lwu t3,0(t1) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1432869971 From sgibbons at openjdk.org Wed Dec 20 16:28:03 2023 From: sgibbons at openjdk.org (Scott Gibbons) Date: Wed, 20 Dec 2023 16:28:03 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding [v6] In-Reply-To: References: Message-ID: > Fix for looking for padding characters within the encoded string. Was not adding start offset to length, so was looking at potentially freed or uninitialized memory. > > Tested teir1 and with testcase supplied with JBS issue. > > The problem will only occur when all of the following are true: > 1. The source offset of the string to be decoded is != 0. > 2. The characters at the beginning of the string (minus the offset) plus the string length mod 64 are either "=" or "==". > 3. The string is >= 32 characters. > 4. The string is not MIME encoded. > > If any of these conditions are not met, the decode works as expected. This was due to omitting the source offset of the string when checking for padding characters. Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: - Merge branch 'openjdk:master' into Base64-fix - Updated copyright year - Updated copyright year - Revert code size change - wa for an experiment only. - Added some comments to the test - Merge branch 'openjdk:master' into Base64-fix - Merge branch 'Base64-fix' of https://github.com/asgibbons/jdk into Base64-fix - Merge branch 'openjdk:master' into Base64-fix - Added tests for proper length and padding checks - Fix for JDK-8321599 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17039/files - new: https://git.openjdk.org/jdk/pull/17039/files/f7d4705e..ba60ac59 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17039&range=04-05 Stats: 136 lines in 10 files changed: 117 ins; 8 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/17039.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17039/head:pull/17039 PR: https://git.openjdk.org/jdk/pull/17039 From rrich at openjdk.org Wed Dec 20 20:36:14 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 20 Dec 2023 20:36:14 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop Message-ID: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port. * `frame::get_oop_map()` is moved to shared code * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. ------------- Commit messages: - 8322294: Cleanup NativePostCallNop Changes: https://git.openjdk.org/jdk/pull/17150/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17150&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322294 Stats: 203 lines in 30 files changed: 53 ins; 114 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/17150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17150/head:pull/17150 PR: https://git.openjdk.org/jdk/pull/17150 From rrich at openjdk.org Wed Dec 20 20:36:54 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 20 Dec 2023 20:36:54 GMT Subject: RFR: 8290965: PPC64: Implement post-call NOPs Message-ID: #### Implementation of post call nops (PCNs) on ppc64. Depends on https://github.com/openjdk/jdk/pull/17150 About post call nops: - instruction(s) at return addresses of compiled java calls - emitted iff vm continuations are enabled to support virtual threads - encode data that can be be used to find the corresponding CodeBlob and oop map faster - mt-safe patchable to trigger deoptimization Background: - Frames in continuation StackChunks are not visited if their compiled method is made not entrant (in contrast to frames on stack). Instead all PCNs of the compiled method are patched to trigger deoptimization when control returns to such frames. - With vm continuations, stacks are walked and inspected more frequently. This requires lookup of metadata like frame size and oop maps. As an optimization the offset of the CodeBlob to the PCN and the oop map slot are encoded as data in the PCN. Post call nops on ppc64 - 1 instruction, i.e. 4 bytes (either CMPI or CMPLI[1]) x86_64: 1 instruction, 8 bytes aarch64: 3 instruction, 12 bytes [1] 3.1.10 Fixed Point Compare Instructions in Power ISA 3.1B https://openpowerfoundation.org/specifications/isa/ - 26 bits data payload x86_64: 32 bits; aarch64: 32 bits - 9 bits dedicated to oop map slot. With 8 bits there where cases with SPECjvm2008 where the slot could not be encoded (on ppc64 and x86_64). x86_64: 8 bits; aarch64: 8 bits - 17 bits dedicated to cb offset. Effectively 19 bits due to instruction alignment. x86_64: 24 bits; aarch64: 24 bits - Also used when reconstructing the back chain after thawing continuation frames (see `Thaw::patch_caller_links`) - Refactored frame constructors to make use of fast CodeBlob lookup based on PCNs. The fast lookup may only be used if the pc is known to be in the code cache because `CodeCache::find_blob_fast` can yield wrong results if it finds instructions outside the code cache that look just like PCNs. Callers of the frame class constructors need to pass `frame::kind::native` in that case to avoid errors. Other platforms don't make this explicit which is a problem in my eyes. Picking the wrong constructor can cause errors when porting and in future development. - Currently only the PCNs in nmethods are initialized. Therefore we don't even try to make a fast lookup based on PCNs if we know the CodeBlob is, e.g., a RuntimeStub. To achieve this we call the frame constructor passing `frame::kind::code_blob`. #### Statistics | SpecJVM2008 compiler.compiler with fix iterations | ppc64le | x86_64 | |---------------------------------------------------|---------|---------| | PCN lookup success | 3715494 | 3410337 | | PCN lookup failure | 220987 | 235436 | | PCN decode success | 3660675 | 3320496 | | PCN decode failure (C1) | 53539 | 46816 | | PCN patch success | 63848 | 42310 | | PCN patch cb offset failure | 0 | 0 | | PCN patch oopmap slot failure | 0 | 298 | | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | ppc64le | x86_64 | |------------------------------------------------------|-----------|-----------| | PCN lookup success | 306955525 | 247185016 | | PCN lookup failure | 500975 | 421098 | | PCN decode success (C2) | 306951893 | 247181691 | | PCN decode failure | 3168 | 59 | | PCN patch success | 2080 | 2662 | | PCN patch cb offset failure | 0 | 0 | | PCN patch oopmap slot failure | 0 | 0 | Comments C1: We get decode failures even if patching always succeeded because not all PCNs are patched. Only PCNs in nmethods are actually patched. E.g. C2 runtime stubs like `_new_array_nozero_Java` have PCNs that are not patched. C2: With Skynet.java there are 100x more PCN lookups. This is because it stresses virtual threads. C2: With Skynet.java there are more PCN lookups on ppc64le. They originate from `Thaw::patch_caller_links`. ### Testing The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. ------------- Depends on: https://git.openjdk.org/jdk/pull/17150 Commit messages: - 8290965: PPC64: Implement post-call NOPs Changes: https://git.openjdk.org/jdk/pull/17171/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17171&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290965 Stats: 133 lines in 13 files changed: 96 ins; 0 del; 37 mod Patch: https://git.openjdk.org/jdk/pull/17171.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17171/head:pull/17171 PR: https://git.openjdk.org/jdk/pull/17171 From duke at openjdk.org Thu Dec 21 00:54:55 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 21 Dec 2023 00:54:55 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input In-Reply-To: <5zIO68nQy5YxQs69531jlEBgLHxa3len3vHapEQrkGg=.77636ff7-1cbe-4234-9a91-a8128c778aad@github.com> References: <5zIO68nQy5YxQs69531jlEBgLHxa3len3vHapEQrkGg=.77636ff7-1cbe-4234-9a91-a8128c778aad@github.com> Message-ID: On Wed, 20 Dec 2023 07:41:44 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/castnode.cpp line 126: >> >>> 124: } >>> 125: >>> 126: Node* ConstraintCastNode::make_cast(int opcode, Node* c, Node* n, const Type* t, DependencyType dependency, >> >> I'm wondering if this method is still worth keeping as it would simply replace a "new CastXXNode" line which is just as expressive/readable. As far as I can see, it's currently only used with statically known Opcodes. So, we could replace them. > > On a separate note and might be worth to do in this change as well, we have `ConstraintCastNode::make_cast_for_type()` and `ConstraintCastNode::make()`. The intent of the former is clear but the latter switches on the provided basic type. Maybe we can rename the latter to `make_cast_for_basic_type()` to better align it with `make_cast_for_type()`. Agree that `make_cast` probably is not needed. And your renaming suggestion makes sense. Can we have this in a separate PR? I generally prefer to keep PRs as small of a unit as possible. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1433317682 From duke at openjdk.org Thu Dec 21 05:23:11 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 21 Dec 2023 05:23:11 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input [v2] In-Reply-To: References: Message-ID: > It is a common pattern to have: > > > Node* n = new CastNode(...); > n->set_req(control_node); > > > We can modify the constructor to set the control node. It makes the code a little tidier. > > Passes tier1 locally on my Linux machine Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Convert some CastIINode instantiations to use the constructor with ctrl node ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17162/files - new: https://git.openjdk.org/jdk/pull/17162/files/3c35568f..d9a40fd7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=00-01 Stats: 10 lines in 4 files changed: 0 ins; 5 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/17162.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17162/head:pull/17162 PR: https://git.openjdk.org/jdk/pull/17162 From chagedorn at openjdk.org Thu Dec 21 07:06:46 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 21 Dec 2023 07:06:46 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input [v2] In-Reply-To: References: <5zIO68nQy5YxQs69531jlEBgLHxa3len3vHapEQrkGg=.77636ff7-1cbe-4234-9a91-a8128c778aad@github.com> Message-ID: On Thu, 21 Dec 2023 00:51:35 GMT, Joshua Cao wrote: >> On a separate note and might be worth to do in this change as well, we have `ConstraintCastNode::make_cast_for_type()` and `ConstraintCastNode::make()`. The intent of the former is clear but the latter switches on the provided basic type. Maybe we can rename the latter to `make_cast_for_basic_type()` to better align it with `make_cast_for_type()`. > > Agree that `make_cast` probably is not needed. And your renaming suggestion makes sense. > > Can we have this in a separate PR? I generally prefer to keep PRs as small of a unit as possible. If you change the title of the PR into something like "cleanup CastNode construction" then I guess it's fine to do it all in once. But I leave it up to you to decide if you want to split it or not. Both is fine. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1433629316 From chagedorn at openjdk.org Thu Dec 21 07:09:42 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 21 Dec 2023 07:09:42 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input [v2] In-Reply-To: References: Message-ID: On Thu, 21 Dec 2023 05:23:11 GMT, Joshua Cao wrote: >> It is a common pattern to have: >> >> >> Node* n = new CastNode(...); >> n->set_req(control_node); >> >> >> We can modify the constructor to set the control node. It makes the code a little tidier. >> >> Passes tier1 locally on my Linux machine > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Convert some CastIINode instantiations to use the constructor with ctrl > node src/hotspot/share/opto/compile.cpp line 4478: > 4476: // node from floating above the range check during loop optimizations. Otherwise, the > 4477: // ConvI2L node may be eliminated independently of the range check, causing the data path > 4478: // to become TOP while the control path is still there (although it's unreachable). Might be cleaner to move this comment block above the creation of the cast node ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1433631487 From chagedorn at openjdk.org Thu Dec 21 07:12:39 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 21 Dec 2023 07:12:39 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input [v2] In-Reply-To: References: Message-ID: On Thu, 21 Dec 2023 05:23:11 GMT, Joshua Cao wrote: >> It is a common pattern to have: >> >> >> Node* n = new CastNode(...); >> n->set_req(control_node); >> >> >> We can modify the constructor to set the control node. It makes the code a little tidier. >> >> Passes tier1 locally on my Linux machine > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Convert some CastIINode instantiations to use the constructor with ctrl > node Otherwise, the cleanup looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17162#pullrequestreview-1792341125 From duke at openjdk.org Thu Dec 21 09:17:16 2023 From: duke at openjdk.org (Joshua Cao) Date: Thu, 21 Dec 2023 09:17:16 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input [v3] In-Reply-To: References: Message-ID: > It is a common pattern to have: > > > Node* n = new CastNode(...); > n->set_req(control_node); > > > We can modify the constructor to set the control node. It makes the code a little tidier. > > Passes tier1 locally on my Linux machine Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Move comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17162/files - new: https://git.openjdk.org/jdk/pull/17162/files/d9a40fd7..69c796b6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=01-02 Stats: 2 lines in 1 file changed: 1 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17162.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17162/head:pull/17162 PR: https://git.openjdk.org/jdk/pull/17162 From mli at openjdk.org Thu Dec 21 11:40:52 2023 From: mli at openjdk.org (Hamlin Li) Date: Thu, 21 Dec 2023 11:40:52 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: <9DaZ8Dup4ZpiZBjtoGl9KTyHJZXQPuk0ux6oVa3jBLo=.10fda31c-ae9c-4d66-86b4-595996da5b56@github.com> On Fri, 15 Dec 2023 11:49:50 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: > > - Use zero_extend instead of shifts where possible > - Use andn instead of notr + andr where possible > - Replace shNadd with one instruction in most cases Some minor comments src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp line 775: > 773: > 774: void LIRGenerator::do_update_CRC32(Intrinsic* x) { > 775: assert(UseCRC32Intrinsics, "why are we here?"); I suppose the performance data is for C2 intrinsic, is there performance data for C1? src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3719: > 3717: Assembler::lwu(crc, tmp1, 0); > 3718: > 3719: srli(tmp1, v, 6); The comment at the beginning is `crc = table3[v&0xff]^table2[(v>>8)&0xff]^table1[(v>>16)&0xff]^table0[v>>24]`, i.e. shifts of v are `6/14/22`, but in the code here shift is `6/14/22` + `bits8 << 2`, is this intended? Can you add some comments here? src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3760: > 3758: add(table1, table0, 1*256*sizeof(juint), tmp); > 3759: add(table2, table0, 2*256*sizeof(juint), tmp); > 3760: add(table3, table0, 3*256*sizeof(juint), tmp); With `add(table3, table2, 256*(junit))`, it might save one instruction. src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3762: > 3760: add(table3, table0, 3*256*sizeof(juint), tmp); > 3761: > 3762: bind(L_by16); seems `L_by16` is not necessary. ------------- PR Review: https://git.openjdk.org/jdk/pull/17046#pullrequestreview-1792785890 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1433950783 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1433950997 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1433951041 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1433951122 From mli at openjdk.org Thu Dec 21 11:40:53 2023 From: mli at openjdk.org (Hamlin Li) Date: Thu, 21 Dec 2023 11:40:53 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: <-ESASip8OA080tvie3uCIlfszW2DvtPxGT_jri-6m4U=.e2800f29-650c-4e17-9c89-fcaae0627f8e@github.com> Message-ID: On Wed, 20 Dec 2023 13:24:23 GMT, ArsenyBochkarev wrote: >> src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp line 1643: >> >>> 1641: __ zero_extend(crc, crc, 32); >>> 1642: __ update_byte_crc32(crc, val, res); >>> 1643: __ notr(res, crc); // ~crc >> >> Do you miss the `zero_extend(crc, crc, 32)`? > > As far as I can see this is unneeded, actually. I used the `test/hotspot/jtreg/compiler/codegen/CRCTest.java` test as a sanity check (it uses the `emit_updatecrc32` stub) and it's ok. > Do you miss the zero_extend(crc, crc, 32)? seems not, it needs 32 bits only. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1433956324 From fjiang at openjdk.org Thu Dec 21 12:59:50 2023 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 21 Dec 2023 12:59:50 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: <5CP-jGXy0kE9XAAcxu25cc5ZIZCiXoPgvV8ZYJrQ6hw=.6e9eb5b6-d31e-4a4e-b18f-3b997eb79549@github.com> On Fri, 15 Dec 2023 11:49:50 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: > > - Use zero_extend instead of shifts where possible > - Use andn instead of notr + andr where possible > - Replace shNadd with one instruction in most cases src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4637: > 4635: * > 4636: * Output: > 4637: * rax - int crc result should be `a0` or `x10`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434045300 From chagedorn at openjdk.org Thu Dec 21 14:07:10 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 21 Dec 2023 14:07:10 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v51] In-Reply-To: <1tzO7gAtDi50soOc-hTAGPBD_utQ84Hf0e2INC2ODnk=.08f747b8-8574-48b2-9de9-a14fd9a9c534@github.com> References: <1tzO7gAtDi50soOc-hTAGPBD_utQ84Hf0e2INC2ODnk=.08f747b8-8574-48b2-9de9-a14fd9a9c534@github.com> Message-ID: On Wed, 20 Dec 2023 15:13:19 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review by Christian > > Thanks Christian > > Co-authored-by: Christian Hagedorn src/hotspot/share/opto/superword.cpp line 3465: > 3463: } > 3464: > 3465: // Ensure that the main loop vectors are aligned by adjusting the pre loop limit. We memory align Suggestion: // Ensure that the main loop vectors are aligned by adjusting the pre loop limit. We memory-align src/hotspot/share/opto/superword.cpp line 3491: > 3489: // For the main-loop, we want the address of align_to_ref to be memory aligned > 3490: // with some alignment width (aw, a power of 2). When we enter the main-loop, > 3491: // we know that iv is equals to the pre-loop limit. If we adjust the pre-loop Suggestion: // we know that iv is equal to the pre-loop limit. If we adjust the pre-loop src/hotspot/share/opto/superword.cpp line 3540: > 3538: // alignment is required (i.e. -XX:+AlignVector), this is guaranteed by the filtering > 3539: // done with the AlignmentSolver / AlignmentSolution. If strict alignment is not > 3540: // required, then alignment is still preferrable for performance, but not necessary. Suggestion: // required, then alignment is still preferable for performance, but not necessary. src/hotspot/share/opto/superword.cpp line 3557: > 3555: // where: sign(scale) = scale / abs(scale) = (scale > 0 ? 1 : -1) > 3556: // > 3557: // Note, (9) allows for periodic solutons of adjust_pre_iter, with periodicity AW. Suggestion: // Note, (9) allows for periodic solutions of adjust_pre_iter, with periodicity AW. src/hotspot/share/opto/vectorization.cpp line 706: > 704: } > 705: #endif > 706: Remove extra new line: Suggestion: src/hotspot/share/opto/vectorization.hpp line 370: > 368: const int _r = 0; > 369: const Node* _invar = nullptr; > 370: const int _scale = 0; You don't need to set an initial value here as you always set the fields in the constructor. src/hotspot/share/opto/vectorization.hpp line 516: > 514: // > 515: // For each vector memory access, we can find the set of pre_iter (number of pre-loop > 516: // iterations) which would align its address. The AlignmentSolver finds such a Suggestion: // iterations) which would align its address. The AlignmentSolver finds such an src/hotspot/share/opto/vectornode.hpp line 1042: > 1040: }; > 1041: > 1042: // Verify that memory address (adr) is alignemd. The mask specifies the Suggestion: // Verify that memory address (adr) is aligned. The mask specifies the src/hotspot/share/opto/vectornode.hpp line 1056: > 1054: public: > 1055: VerifyVectorAlignmentNode(Node* adr, Node* mask) : Node(nullptr, adr, mask) {} > 1056: public: Repeated and can be removed: Suggestion: ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433802873 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433804787 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433884021 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433942138 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433783372 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433789388 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433794623 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433799981 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1433800962 From epeter at openjdk.org Thu Dec 21 14:14:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 14:14:20 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v52] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with five additional commits since the last revision: - improve case analysis empty / constrained / trivial - for Christian: fix base alignment and FAC_* whitespace - for Christian: made some methods const - for Christian: assert mem_ref not nullptr - Christian's suggestions for EQ4 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/2ee10c38..40833a89 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=51 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=50-51 Stats: 127 lines in 2 files changed: 60 ins; 4 del; 63 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 21 14:14:20 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 14:14:20 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v50] In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023 12:24:26 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> necessary and sufficient (3) <-> (4) > > src/hotspot/share/opto/vectorization.cpp line 906: > >> 904: // abs(C_pre) < aw AND C_const % abs(C_pre) != 0 >> 905: // -> alignment has effect >> 906: // -> But C_const cannot be aligned with C_pre -> empty > > As we have discussed offline, I suggest the following: > > > // We look at (4a): > // > // abs(C_pre) >= aw > // -> Since C_pre is a power of two, we have C_pre % aw = 0. Therefore, any multiple of C_pre > // (i.e. choosing any value for pre_iter_C_Const) is also aw aligned. In this case, we can > // only satisfy (4a) if C_Const is aw aligned: > // > // C_const % aw == 0: > // -> (4a) has a trivial solution since we can choose any value for pre_iter_C_Const. > // > // C_const % aw != 0: > // -> (4a) has an empty solution since no pre_iter_C_Const can achieve aw alignment. > // > // abs(C_pre) < aw: > // -> Then for some x > 1: aw = abs(C_pre) * x since aw and C_pre are power of twos. > // If (4a) holds, then the following also holds: > // (C_const + C_pre * pre_iter_C_const) % abs(C_pre) <=> > // (C_const ) % abs(C_pre) > // > // C_const % abs(C_pre) == 0: > // -> pre_iter_C_const is chosen accordingly such that (4a) is satisfied with the given C_const value. > // -> (4a) has a constrained solution. > // > // C_const % abs(C_pre) != 0: > // -> Not "C_const % abs(C_pre) == 0" implies not (4a). Therefore, (4a) has an empty solution since no > // pre_iter_C_Const can achieve aw alignment. I took your part as inspiration, and wrote this part with more detail ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1434125129 From epeter at openjdk.org Thu Dec 21 14:21:39 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 14:21:39 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v53] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix for yesterday's reviews by Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/40833a89..fba19f81 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=52 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=51-52 Stats: 19 lines in 2 files changed: 0 ins; 1 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 21 14:27:22 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 14:27:22 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v54] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Apply suggestions from code review from Christian Co-authored-by: Christian Hagedorn - more small fixes by Christian ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/fba19f81..7b55eea2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=53 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=52-53 Stats: 13 lines in 3 files changed: 0 ins; 1 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From chagedorn at openjdk.org Thu Dec 21 14:27:22 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 21 Dec 2023 14:27:22 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v52] In-Reply-To: References: Message-ID: <5bbdOUC2BFJFTLcBl_5l4wkGcjL_JREY-iItT3dAqjM=.f0a067b6-d870-4636-b79a-57d577398a1a@github.com> On Thu, 21 Dec 2023 14:14:20 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with five additional commits since the last revision: > > - improve case analysis empty / constrained / trivial > - for Christian: fix base alignment and FAC_* whitespace > - for Christian: made some methods const > - for Christian: assert mem_ref not nullptr > - Christian's suggestions for EQ4 src/hotspot/share/opto/vectorization.cpp line 909: > 907: // > 908: // C_const % aw == 0: > 909: // -> (4a) has a trivial solution since we can choose any value for pre_iter_C_Const. Suggestion: // -> (4a) has a trivial solution since we can choose any value for pre_iter_C_const. src/hotspot/share/opto/vectorization.cpp line 912: > 910: // > 911: // C_const % aw != 0: > 912: // -> (4a) has an empty solution since no pre_iter_C_Const can achieve aw alignment. Suggestion: // -> (4a) has an empty solution since no pre_iter_C_const can achieve aw alignment. src/hotspot/share/opto/vectorization.cpp line 920: > 918: // > 919: // C_const % abs(C_pre) == 0: > 920: // -> Exists integer z: C_const = C_pre * z For consistency to the line above: Suggestion: // -> There exists integer z: C_const = C_pre * z ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1434132022 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1434132231 PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1434133130 From epeter at openjdk.org Thu Dec 21 14:41:24 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 14:41:24 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v55] In-Reply-To: References: Message-ID: > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: comments about modulo positive / negative values ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/7b55eea2..fcc12466 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=54 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=53-54 Stats: 15 lines in 3 files changed: 13 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 21 15:37:46 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 15:37:46 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v56] In-Reply-To: References: Message-ID: <32--7t8Z7f0stK3Xp3hhj3Vl9NwLAic-1p_TEMzLSCk=.35953e56-4ab6-4cc2-98cd-baf7189ed0d8@github.com> > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: more comments in SuperWord::adjust_pre_loop_limit_to_align_main_loop_vectors ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/fcc12466..d2a06990 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=55 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=54-55 Stats: 42 lines in 1 file changed: 29 ins; 1 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From epeter at openjdk.org Thu Dec 21 15:40:16 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 15:40:16 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v57] In-Reply-To: References: Message-ID: <4g4SbB2RBLU-ZFcrH_ukdqC_QSoSvibNGanasAFl-lw=.731266a6-9974-402e-954e-e441706426ab@github.com> > I want to push this in JDK23. > After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). > > To calm your nerves: most of the changes are in auto-generated tests, and tests in general. > > **Context** > > `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). > > Alignment is split into two tasks: > - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. > - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. > > **Problem** > > I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). > In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. > Thanks @fg1417 for confirming this! > Hence, we need to fix the alignment correctness checks. > > While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. > > **Problem Details** > > Reproducer: > > > static void test(short[] a, short[] b, short mask) { > for (int i = 0; i < RANGE; i+=8) { > // Problematic for AlignVector > b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 > > b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes > b[i+4] = (short)(a[i+4] & mask); > b[i+5] = (short)(a[i+5] & mask); > b[i+6] = (short)(a[i+6] & mask); > } > } > > > During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. > > This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and by... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Apply suggestions from code review by Christian Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/14785/files - new: https://git.openjdk.org/jdk/pull/14785/files/d2a06990..0070ec22 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=56 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=55-56 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/14785.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785 PR: https://git.openjdk.org/jdk/pull/14785 From mli at openjdk.org Thu Dec 21 15:40:51 2023 From: mli at openjdk.org (Hamlin Li) Date: Thu, 21 Dec 2023 15:40:51 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023 15:34:40 GMT, ArsenyBochkarev wrote: >> src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3717: >> >>> 3715: andi(tmp1, v, bits8); >>> 3716: shadd(tmp1, tmp1, table3, tmp2, 2); >>> 3717: Assembler::lwu(crc, tmp1, 0); >> >> Why not use `MacroAssembler::lwu` instead ? I see no difference in stub code emitted. >> Like: >> ``` diff >> diff --git a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp >> index 06026b98bfa..eb9362ca531 100644 >> --- a/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp >> +++ b/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp >> @@ -3696,26 +3696,26 @@ void MacroAssembler::update_word_crc32(Register crc, Register v, Register tmp1, >> >> andi(tmp1, v, bits8); >> shadd(tmp1, tmp1, table3, tmp2, 2); >> - Assembler::lwu(crc, tmp1, 0); >> + lwu(crc, Address(tmp1, 0)); >> >> srli(tmp1, v, 6); >> andi(tmp1, tmp1, (bits8 << 2)); >> add(tmp1, tmp1, table2); >> - Assembler::lwu(tmp2, tmp1, 0); >> + lwu(tmp2, Address(tmp1, 0)); >> >> srli(tmp1, v, 14); >> xorr(crc, crc, tmp2); >> >> andi(tmp1, tmp1, (bits8 << 2)); >> add(tmp1, tmp1, table1); >> - Assembler::lwu(tmp2, tmp1, 0); >> + lwu(tmp2, Address(tmp1, 0)); >> >> srli(tmp1, v, 22); >> xorr(crc, crc, tmp2); >> >> andi(tmp1, tmp1, (bits8 << 2)); >> add(tmp1, tmp1, table0); >> - Assembler::lwu(tmp2, tmp1, 0); >> + lwu(tmp2, Address(tmp1, 0)); >> xorr(crc, crc, tmp2); >> } > > When I tried `MacroAssembler::lwu` I got the following instructions on T-head: > > 0.47% ? 0x0000003fac6a8738: li t3,1 > 0.51% ? 0x0000003fac6a873a: slli t3,t3,0x20 > 0.00% ? 0x0000003fac6a873c: addi t3,t3,-1 > ... > 2.68% ? 0x0000003fac6a8752: lw a0,0(t1) > 5.25% ? 0x0000003fac6a8756: and a0,a0,t3 > ... > ? 0x0000003fac6a876a: lw t4,0(t1) > 1.78% ? 0x0000003fac6a876e: and t1,t4,t3 > ... > 0.49% ? 0x0000003fac6a8786: lw t4,0(t1) > 2.62% ? 0x0000003fac6a878a: and t1,t4,t3 > ... > 0.41% ? 0x0000003fac6a87a2: lw t4,0(t1) > 3.97% ? 0x0000003fac6a87a6: and t1,t4,t3 > > instead of just > > 4.52% ?? 0x0000003fb49e96f6: lwu a0,0(t1) > ... > ?? 0x0000003fb49e970a: lwu t3,0(t1) > ... > ?? 0x0000003fb49e9722: lwu t3,0(t1) > ... > 0.02% ?? 0x0000003fb49e973a: lwu t3,0(t1) Interesting, I tried on qemu and `T-HEAD Light Lichee Pi 4A`, I don't get this code generated with just `lwu(tmp2, Address(tmp1, 0));`. Do you know how does it happen? I mean how does this happen on a specific hardware only. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434221095 From epeter at openjdk.org Thu Dec 21 15:43:07 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 21 Dec 2023 15:43:07 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43] In-Reply-To: References: Message-ID: On Fri, 15 Dec 2023 10:46:53 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> renamings and proof improvement in adjust_pre_loop_limit_to_align_main_loop_vectors > > Thanks for addressing my other comments! I really like the new structure of having an `AlignmentSolution` interface with different alignment solution classes. I have some more comments but mostly fine tuning - it's already in a quite good shape and the comments really add a lot of benefit to understand the idea behind the code. We get there :-) @chhagedorn thank you very much for the detailed review! I think now we are at a point that is stable enough, so reviewers can jump in ;) @fg1417 @vnkozlov @TobiHartmann ------------- PR Comment: https://git.openjdk.org/jdk/pull/14785#issuecomment-1866513361 From sviswanathan at openjdk.org Thu Dec 21 16:56:51 2023 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 21 Dec 2023 16:56:51 GMT Subject: RFR: JDK-8321599 Data loss in AVX3 Base64 decoding In-Reply-To: References: Message-ID: On Sat, 9 Dec 2023 23:36:45 GMT, Vladimir Kozlov wrote: >>> @asgibbons, am I correct the problem is that padding '=' characters were not found and not processed. This happens because a source offset is not taken into account. A test is: >>> >>> ``` >>> A, B: String >>> Buf: ByteBuffer >>> C := base64_encode(A) + base64_encode(B) # encode(B) should have '=' or '==' >>> put C in Buf >>> A' := base64_decode(Buf) >>> B' := base64_decode(Buf) >>> assert(A.equals(A')) >>> assert(B.equals(B')) >>> ``` >> >> No. The padding '=' character was found and terminated the decoding, which is expected. The issue is that the input string (encoded) is quite long in this case and the test is decoding a substring of the full string. The parameters passed to Decode are a pointer to the start of the (long) string and a (large) offset. I was looking for padding characters relative to the start of the long string instead of the substring (start plus the starting offset). Example: >> >> >> Encoded string: >> . . . = = . . . a a a a a a a ... a a a a >> ^ ^ >> | | >> start start + offset >> >> I was asked to decode the bytes at ```(start + offset)```. When the algorithm gets to the last 31 bytes of ```a a a a ... a a a a```, it looks for padding at ```(start + remaining_length - 1)``` instead of ```(start + start_offset + remaining_length - 1)```. It actually found a padding byte at ```(start + remaining_length - 1)``` and decided that the output length should be reduced by one character (or 2 if there were 2 padding bytes found). A very specific edge case (so good catch by testers). > >> @asgibbons, thank you for the quick fix. >> I think it's worth to add the reproducer for the JBS issue as a test. > > Yes, we need regression test with this changes. @vnkozlov Please advice if we can go ahead and integrate this fix. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17039#issuecomment-1866641380 From duke at openjdk.org Thu Dec 21 17:46:07 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 21 Dec 2023 17:46:07 GMT Subject: RFR: 8322589: Add Ideal transformation: (~a) & (~b) => ~(a | b) Message-ID: <-pwjKEB97C-bM068JQN0PY1hl65IzcuQZfHzRoKu92g=.d6118773-20d2-46cd-9284-5168c9334bb5@github.com> Hello, `(~a) & (~b) => ~(a | b)` is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1616C28-L1616C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. Thanks. ------------- Commit messages: - connect test with correct bug id. - remove tabs. - include new optimization and tests. Changes: https://git.openjdk.org/jdk/pull/16333/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16333&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322589 Stats: 107 lines in 4 files changed: 100 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/16333.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16333/head:pull/16333 PR: https://git.openjdk.org/jdk/pull/16333 From duke at openjdk.org Thu Dec 21 17:46:07 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 21 Dec 2023 17:46:07 GMT Subject: RFR: 8322589: Add Ideal transformation: (~a) & (~b) => ~(a | b) In-Reply-To: <-pwjKEB97C-bM068JQN0PY1hl65IzcuQZfHzRoKu92g=.d6118773-20d2-46cd-9284-5168c9334bb5@github.com> References: <-pwjKEB97C-bM068JQN0PY1hl65IzcuQZfHzRoKu92g=.d6118773-20d2-46cd-9284-5168c9334bb5@github.com> Message-ID: On Tue, 24 Oct 2023 04:49:20 GMT, Zhiqiang Zang wrote: > Hello, > > `(~a) & (~b) => ~(a | b)` is a widely seen pattern, for example it is implemented for LLVM [here](https://github.com/llvm/llvm-project/blob/397f1ce9efb4eea1ee10fe4833f733b8c7abd878/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp#L1616C28-L1616C28); however it is missing in current implementation of hotspot. This pull request adds this transformation and associated tests. > > Thanks. Hi, can I get a review? ------------- PR Comment: https://git.openjdk.org/jdk/pull/16333#issuecomment-1821175827 From jpai at openjdk.org Thu Dec 21 17:46:07 2023 From: jpai at openjdk.org (Jaikiran Pai) Date: Thu, 21 Dec 2023 17:46:07 GMT Subject: RFR: 8322589: Add Ideal transformation: (~a) & (~b) => ~(a | b) In-Reply-To: References: <-pwjKEB97C-bM068JQN0PY1hl65IzcuQZfHzRoKu92g=.d6118773-20d2-46cd-9284-5168c9334bb5@github.com> Message-ID: On Tue, 21 Nov 2023 15:43:49 GMT, Zhiqiang Zang wrote: > Hi, can I get a review? Hello @CptGit, can you create an enhancement request here https://bugreport.java.com/bugreport/? Someone with knowledge of this area will then be able to decide if it's a valid enhancement, in which case there will be a corresponding JDK issue created against which you can then link this PR for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16333#issuecomment-1851753646 From duke at openjdk.org Thu Dec 21 17:46:08 2023 From: duke at openjdk.org (Zhiqiang Zang) Date: Thu, 21 Dec 2023 17:46:08 GMT Subject: RFR: 8322589: Add Ideal transformation: (~a) & (~b) => ~(a | b) In-Reply-To: References: <-pwjKEB97C-bM068JQN0PY1hl65IzcuQZfHzRoKu92g=.d6118773-20d2-46cd-9284-5168c9334bb5@github.com> Message-ID: On Tue, 12 Dec 2023 10:24:49 GMT, Jaikiran Pai wrote: > > Hi, can I get a review? > > Hello @CptGit, can you create an enhancement request here https://bugreport.java.com/bugreport/? Someone with knowledge of this area will then be able to decide if it's a valid enhancement, in which case there will be a corresponding JDK issue created against which you can then link this PR for review. Thanks for the comment. I have created a bug report. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16333#issuecomment-1852819397 From never at openjdk.org Thu Dec 21 19:25:57 2023 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 21 Dec 2023 19:25:57 GMT Subject: RFR: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects [v2] In-Reply-To: <0Um1zwgyjW8IXcISUsf7s2fvfsD03JD7esf112bGa_g=.7b483632-9193-49ec-87d4-0ad5adbab234@github.com> References: <0Um1zwgyjW8IXcISUsf7s2fvfsD03JD7esf112bGa_g=.7b483632-9193-49ec-87d4-0ad5adbab234@github.com> Message-ID: On Tue, 19 Dec 2023 07:31:05 GMT, Raphael Mosaner wrote: >> This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. > > Raphael Mosaner has updated the pull request incrementally with one additional commit since the last revision: > > [JVMCI] Documentation for _should_notify_object_alloc export. Marked as reviewed by never (Reviewer). Looks good. ------------- PR Review: https://git.openjdk.org/jdk/pull/16980#pullrequestreview-1793541357 PR Comment: https://git.openjdk.org/jdk/pull/16980#issuecomment-1866817021 From duke at openjdk.org Thu Dec 21 19:25:58 2023 From: duke at openjdk.org (Raphael Mosaner) Date: Thu, 21 Dec 2023 19:25:58 GMT Subject: Integrated: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects In-Reply-To: References: Message-ID: On Tue, 5 Dec 2023 18:26:57 GMT, Raphael Mosaner wrote: > This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. This pull request has now been integrated. Changeset: 84c23792 Author: Raphael Mosaner Committer: Tom Rodriguez URL: https://git.openjdk.org/jdk/commit/84c23792856c5c2374963d78a7a734a467bbb79b Stats: 13 lines in 3 files changed: 13 ins; 0 del; 0 mod 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects Reviewed-by: never, dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/16980 From maurizio.cimadamore at oracle.com Thu Dec 21 21:54:33 2023 From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore) Date: Thu, 21 Dec 2023 21:54:33 +0000 Subject: Fwd: Invalid code generated by C2 compiler in OpenJDK 21 In-Reply-To: References: Message-ID: <9c5d6867-d63a-4558-a3ad-f9b127768cb0@oracle.com> Adding hotspot-compiler-dev, asd I don't think this is a javac compiler issue? Maurizio -------- Forwarded Message -------- Subject: Invalid code generated by C2 compiler in OpenJDK 21 Date: Fri, 15 Dec 2023 11:12:15 +0100 From: Antoine DESSAIGNE To: compiler-dev at openjdk.org Hello everyone, I've found an issue while migrating to OpenJDK 21. A valued local variable (effectively final) has its value removed and it throws a NullPointerException. Unfortunately, I cannot provide the source code and the data to reproduce the issue, and I couldn't create a smaller code snippet to show the issue. That said, I'll happily show the code and perform many tests during calls. Here's what I did so far to diagnose the issue. I bisected the repository to find where the regression comes from. I found this commit?3696711efa5 [1] but it's a merge so I bisected the branch and found?10737e168c9 [2]. Looking at this commit, I have no idea how it could introduce this kind of regression. Then, thanks to the guidance from Aleksey Shipil?v, I tested many things * Issue does *not* happen with the following flags:?-Xint,?-XX:-TieredCompilation,?-XX:TieredStopAtLevel=1,?-XX:TieredStopAtLevel=2,?-XX:TieredStopAtLevel=3 * Issue also happens with fastdebug builds of OpenJDK, without crashing due to assertions * Issue still happens in the latest version of the code (commit?b31454e3623) * Issue happens no matter which GC is used, I tried SerialGC, ParallelGC, G1GC, and?ShenandoahGC The tests were performed in Docker containers running on 4 different hosts. Therefore it looks like C2 is generating an invalid assembly code. Unfortunately, I'm not great with assembly and the generated assembly is quite big (main code is around 20k). Do you have an idea of why this is happening? Do you know what test I can run? If one of you is available, we can schedule calls for me to show you the code and my tests. Thank you very much for your assistance. Have a nice day, Antoine DESSAIGNE [1] https://github.com/openjdk/jdk/commit/3696711efa566fb776d6923da86e17b0e1e22964 [2] https://github.com/openjdk/jdk/commit/10737e168c967a08e257927251861bf2c14795ab -------------- next part -------------- An HTML attachment was scrubbed... URL: From duke at openjdk.org Thu Dec 21 22:20:15 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Thu, 21 Dec 2023 22:20:15 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v3] In-Reply-To: References: Message-ID: <_CysHDX3CV-ZM4ilLgHSRrcDk4DHDNe1ClAKFCV_uoM=.751d91bf-e7e0-4b78-8ff5-2b864c38dd73@github.com> > Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. > > ### Correctness checks > > Tier 1/2 tests are ok. > > ### Performance results on T-Head board > > #### Results for enabled intrinsic: > > Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --- | ---- | ----- | --- | ---- | --- | ---- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | > > #### Results for disabled intrinsic: > > | Benchmark | (count) | Mode | Cnt | Score | Error | Units | > | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | > | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | > | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | ArsenyBochkarev has updated the pull request incrementally with five additional commits since the last revision: - Use MacroAssembler::lwu instead of Assembler::lwu - Save instruction when getting table3 address - Left note on how table elements are accessed - Fix comment for result register - Remove unused L_by16 label ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17046/files - new: https://git.openjdk.org/jdk/pull/17046/files/f7a4f0c7..a59481b4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17046&range=01-02 Stats: 29 lines in 2 files changed: 11 ins; 1 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/17046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046 PR: https://git.openjdk.org/jdk/pull/17046 From duke at openjdk.org Thu Dec 21 22:20:15 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Thu, 21 Dec 2023 22:20:15 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: <9DaZ8Dup4ZpiZBjtoGl9KTyHJZXQPuk0ux6oVa3jBLo=.10fda31c-ae9c-4d66-86b4-595996da5b56@github.com> References: <9DaZ8Dup4ZpiZBjtoGl9KTyHJZXQPuk0ux6oVa3jBLo=.10fda31c-ae9c-4d66-86b4-595996da5b56@github.com> Message-ID: On Thu, 21 Dec 2023 11:32:11 GMT, Hamlin Li wrote: >> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: >> >> - Use zero_extend instead of shifts where possible >> - Use andn instead of notr + andr where possible >> - Replace shNadd with one instruction in most cases > > src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp line 775: > >> 773: >> 774: void LIRGenerator::do_update_CRC32(Intrinsic* x) { >> 775: assert(UseCRC32Intrinsics, "why are we here?"); > > I suppose the performance data is for C2 intrinsic, is there performance data for C1? Measured and provided it in a comment below. I used the `-XX:TieredStopAtLevel=1` flag with the same benchmark on the same machine. > src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3719: > >> 3717: Assembler::lwu(crc, tmp1, 0); >> 3718: >> 3719: srli(tmp1, v, 6); > > The comment at the beginning is `crc = table3[v&0xff]^table2[(v>>8)&0xff]^table1[(v>>16)&0xff]^table0[v>>24]`, i.e. shifts of v are `8/16/24`, but in the code here shift is `6/14/22` + `bits8 << 2`, is this intended? Can you add some comments here? It is intended, yes. I just thought that since the table access needs the index to be shifted left by 2 it can be optimized a bit. I left the note in the comments. > src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3760: > >> 3758: add(table1, table0, 1*256*sizeof(juint), tmp); >> 3759: add(table2, table0, 2*256*sizeof(juint), tmp); >> 3760: add(table3, table0, 3*256*sizeof(juint), tmp); > > With `add(table3, table2, 256*(junit))`, it might save one instruction. Good catch, thanks! Fixed. > src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 3762: > >> 3760: add(table3, table0, 3*256*sizeof(juint), tmp); >> 3761: >> 3762: bind(L_by16); > > seems `L_by16` is not necessary. Removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434555369 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434553356 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434553888 PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434554086 From duke at openjdk.org Thu Dec 21 22:20:15 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Thu, 21 Dec 2023 22:20:15 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: References: Message-ID: On Thu, 21 Dec 2023 15:36:58 GMT, Hamlin Li wrote: >> When I tried `MacroAssembler::lwu` I got the following instructions on T-head: >> >> 0.47% ? 0x0000003fac6a8738: li t3,1 >> 0.51% ? 0x0000003fac6a873a: slli t3,t3,0x20 >> 0.00% ? 0x0000003fac6a873c: addi t3,t3,-1 >> ... >> 2.68% ? 0x0000003fac6a8752: lw a0,0(t1) >> 5.25% ? 0x0000003fac6a8756: and a0,a0,t3 >> ... >> ? 0x0000003fac6a876a: lw t4,0(t1) >> 1.78% ? 0x0000003fac6a876e: and t1,t4,t3 >> ... >> 0.49% ? 0x0000003fac6a8786: lw t4,0(t1) >> 2.62% ? 0x0000003fac6a878a: and t1,t4,t3 >> ... >> 0.41% ? 0x0000003fac6a87a2: lw t4,0(t1) >> 3.97% ? 0x0000003fac6a87a6: and t1,t4,t3 >> >> instead of just >> >> 4.52% ?? 0x0000003fb49e96f6: lwu a0,0(t1) >> ... >> ?? 0x0000003fb49e970a: lwu t3,0(t1) >> ... >> ?? 0x0000003fb49e9722: lwu t3,0(t1) >> ... >> 0.02% ?? 0x0000003fb49e973a: lwu t3,0(t1) > > Interesting, I tried on qemu and `T-HEAD Light Lichee Pi 4A`, I don't get this code generated with just `lwu(tmp2, Address(tmp1, 0));`. > Do you know how does it happen? I mean how does this happen on a specific hardware only. Whoops, I tried to reproduce behavior above and it turned out that it was just some old profiling data I collected before implementing version with `lwu` (previously I used the `lw` + `andi` pair of instructions). The `MacroAssembler::lwu` version actually works fine. Thanks everyone for pointing it out! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434553637 From duke at openjdk.org Thu Dec 21 22:20:16 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Thu, 21 Dec 2023 22:20:16 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v2] In-Reply-To: <5CP-jGXy0kE9XAAcxu25cc5ZIZCiXoPgvV8ZYJrQ6hw=.6e9eb5b6-d31e-4a4e-b18f-3b997eb79549@github.com> References: <5CP-jGXy0kE9XAAcxu25cc5ZIZCiXoPgvV8ZYJrQ6hw=.6e9eb5b6-d31e-4a4e-b18f-3b997eb79549@github.com> Message-ID: On Thu, 21 Dec 2023 12:56:59 GMT, Feilong Jiang wrote: >> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision: >> >> - Use zero_extend instead of shifts where possible >> - Use andn instead of notr + andr where possible >> - Replace shNadd with one instruction in most cases > > src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4637: > >> 4635: * >> 4636: * Output: >> 4637: * rax - int crc result > > should be `a0` or `x10`? `a0`, of course. Thanks, fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17046#discussion_r1434553222 From duke at openjdk.org Thu Dec 21 22:33:47 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Thu, 21 Dec 2023 22:33:47 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic [v3] In-Reply-To: <_CysHDX3CV-ZM4ilLgHSRrcDk4DHDNe1ClAKFCV_uoM=.751d91bf-e7e0-4b78-8ff5-2b864c38dd73@github.com> References: <_CysHDX3CV-ZM4ilLgHSRrcDk4DHDNe1ClAKFCV_uoM=.751d91bf-e7e0-4b78-8ff5-2b864c38dd73@github.com> Message-ID: On Thu, 21 Dec 2023 22:20:15 GMT, ArsenyBochkarev wrote: >> Hi everyone! Please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L4224) `_updateBytesCRC32`, `_updateByteBufferCRC32` and `_updateCRC32` intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version. >> >> ### Correctness checks >> >> Tier 1/2 tests are ok. >> >> ### Performance results on T-Head board >> >> #### Results for enabled intrinsic: >> >> Used test is `test/micro/org/openjdk/bench/java/util/TestCRC32.java` >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --- | ---- | ----- | --- | ---- | --- | ---- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms | >> >> #### Results for disabled intrinsic: >> >> | Benchmark | (count) | Mode | Cnt | Score | Error | Units | >> | --------------------------------------------------- | ---------- | --------- | ---- | ----------- | --------- | ---------- | >> | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms | >> | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms | > > ArsenyBochkarev has updated the pull request incrementally with five additional commits since the last revision: > > - Use MacroAssembler::lwu instead of Assembler::lwu > - Save instruction when getting table3 address > - Left note on how table elements are accessed > - Fix comment for result register > - Remove unused L_by16 label Performance measurements on the same benchmark, on T-Head board. I used the `-XX:TieredStopAtLevel=1` flag: | Benchmark | (count) | Mode | Cnt | Score | Error | Units | | ------------------------------------------------ | ----------- | ------ | ---- | ---------- | ---------- | ---------- | | CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 3617.860 | 17.463 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2253.626 | 4.739 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1242.516 | 83.245 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 687.339 | 1.712 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 183.016 | 0.258 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 23.368 | 0.133 | ops/ms | | CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.640 | 0.023 | ops/ms | ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-1867010402 From phh at openjdk.org Thu Dec 21 23:49:48 2023 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 21 Dec 2023 23:49:48 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input [v3] In-Reply-To: References: Message-ID: On Thu, 21 Dec 2023 09:17:16 GMT, Joshua Cao wrote: >> It is a common pattern to have: >> >> >> Node* n = new CastNode(...); >> n->set_req(control_node); >> >> >> We can modify the constructor to set the control node. It makes the code a little tidier. >> >> Passes tier1 locally on my Linux machine > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Move comment Marked as reviewed by phh (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17162#pullrequestreview-1793792668 From phh at openjdk.org Thu Dec 21 23:54:48 2023 From: phh at openjdk.org (Paul Hohensee) Date: Thu, 21 Dec 2023 23:54:48 GMT Subject: RFR: 8322490: CastNode constructors accepts control node as input [v2] In-Reply-To: References: Message-ID: On Thu, 21 Dec 2023 07:09:55 GMT, Christian Hagedorn wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> Convert some CastIINode instantiations to use the constructor with ctrl >> node > > Otherwise, the cleanup looks good. @chhagedorn, Josh applied your requested changes, please re-review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17162#issuecomment-1867061593 From duke at openjdk.org Fri Dec 22 00:46:04 2023 From: duke at openjdk.org (Joshua Cao) Date: Fri, 22 Dec 2023 00:46:04 GMT Subject: RFR: 8322490: cleanup CastNode construction [v4] In-Reply-To: References: Message-ID: > It is a common pattern to have: > > > Node* n = new CastNode(...); > n->set_req(control_node); > > > We can modify the constructor to set the control node. It makes the code a little tidier. > > Passes tier1 locally on my Linux machine Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Cleanup make_cast functions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17162/files - new: https://git.openjdk.org/jdk/pull/17162/files/69c796b6..574617ef Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=02-03 Stats: 59 lines in 5 files changed: 7 ins; 31 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/17162.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17162/head:pull/17162 PR: https://git.openjdk.org/jdk/pull/17162 From duke at openjdk.org Fri Dec 22 00:46:04 2023 From: duke at openjdk.org (Joshua Cao) Date: Fri, 22 Dec 2023 00:46:04 GMT Subject: RFR: 8322490: cleanup CastNode construction [v2] In-Reply-To: References: Message-ID: On Thu, 21 Dec 2023 07:09:55 GMT, Christian Hagedorn wrote: >> Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: >> >> Convert some CastIINode instantiations to use the constructor with ctrl >> node > > Otherwise, the cleanup looks good. @chhagedorn I changed my mind. Added the make_cast cleanups to this PR as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17162#issuecomment-1867091490 From jiefu at openjdk.org Fri Dec 22 02:57:07 2023 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 22 Dec 2023 02:57:07 GMT Subject: RFR: 8322661: Missing jvmtiExport.hpp after JDK-8320139 Message-ID: Add jvmtiExport.hpp in jvmciCompilerToVMInit.cpp to fix the build failure. Thanks. ------------- Commit messages: - 8322661: Missing jvmtiExport.hpp after JDK-8320139 Changes: https://git.openjdk.org/jdk/pull/17182/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17182&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322661 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17182.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17182/head:pull/17182 PR: https://git.openjdk.org/jdk/pull/17182 From epeter at openjdk.org Fri Dec 22 09:19:47 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 09:19:47 GMT Subject: RFR: 8310711: [IR Framework] Remove safepoint while printing handling In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 12:47:48 GMT, Christian Hagedorn wrote: > This clean-up PR removes the handling of the `` message in the IR framework. It is no longer required since we dump the output of `PrintIdeal` to the hotspot_pid file differently since [JDK-8306922](https://bugs.openjdk.org/browse/JDK-8306922). There is no interrupting `` message anymore. I removed the corresponding now unneeded code together with the previously added test case for it. > > Testing: tier1-4 > > Thanks, > Christian Thanks for the cleanup @chhagedorn , LGTM! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16921#pullrequestreview-1794196568 From epeter at openjdk.org Fri Dec 22 09:46:48 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 09:46:48 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian src/hotspot/share/opto/loopPredicate.cpp line 398: > 396: Deoptimization::DeoptReason reason, > 397: ParsePredicateSuccessProj* parse_predicate_proj) { > 398: Node* opaque4_node = iff->in(1); `assert(opaque4->Opcode() == Op_Opaque4, "must be Opaque4");` from `create_bool_from_template_assertion_predicate` ? src/hotspot/share/opto/predicates.hpp line 297: > 295: opcode == Op_ConvI2L || > 296: opcode == Op_CastII); > 297: } How did you come up with this exact list? Could this list ever change? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434890967 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434896675 From epeter at openjdk.org Fri Dec 22 09:53:52 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 09:53:52 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian src/hotspot/share/opto/predicates.cpp line 370: > 368: }; > 369: > 370: // This class caches a single OpaqueLoopInitNode and OpaqueLoopStrideNode. If the node is not cached, yet, we clone it Suggestion: // This class caches a single OpaqueLoopInitNode and OpaqueLoopStrideNode. If the node is not cached yet, we clone it src/hotspot/share/opto/predicates.cpp line 372: > 370: // This class caches a single OpaqueLoopInitNode and OpaqueLoopStrideNode. If the node is not cached, yet, we clone it > 371: // and store the clone in the cache to be returned for subsequent calls. > 372: class CachedOpaqueLoopNodes { Maybe it should say that it cashes the cloned opaque loop nodes in the name? Right now I would think you are caching the original loop nodes, at least at first. Suggestion: class ClonedOpaqueLoopNodesCache { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434900721 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434902410 From mli at openjdk.org Fri Dec 22 10:20:50 2023 From: mli at openjdk.org (Hamlin Li) Date: Fri, 22 Dec 2023 10:20:50 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic In-Reply-To: References: Message-ID: On Mon, 11 Dec 2023 15:56:24 GMT, ArsenyBochkarev wrote: > Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board: > `-XX:-UseZba`: Benchmark (count) Mode Cnt Score Error Units CRC32.TestCRC32.testCRC32Update 512 thrpt 12 512.550 1.718 ops/ms CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 130.396 0.341 ops/ms CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 16.319 0.073 ops/ms CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 3.913 0.011 ops/ms `-XX:+UseZba`: Benchmark (count) Mode Cnt Score Error Units CRC32.TestCRC32.testCRC32Update 512 thrpt 12 623.173 0.651 ops/ms CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 158.965 0.376 ops/ms CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 19.934 0.055 ops/ms CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 4.730 0.007 ops/ms `-XX:-UseCRC32Intrinsics`: Benchmark (count) Mode Cnt Score Error Units CRC32.TestCRC32.testCRC32Update 512 thrpt 12 520.965 5.651 ops/ms CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 169.591 0.747 ops/ms CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 22.624 0.139 ops/ms CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.430 0.016 ops/ms Seems there is regression when `count >= 512`, especially when `count >= 2048`. And I suppose that big message is common case for CRC32 usage? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-1867505012 From davleopo at openjdk.org Fri Dec 22 11:23:05 2023 From: davleopo at openjdk.org (David Leopoldseder) Date: Fri, 22 Dec 2023 11:23:05 GMT Subject: RFR: 8322636: [JVMCI] HotSpotSpeculationLog can be inconsistent across a single compile Message-ID: This PR fixes a subtle inconsistency in `HotSpotSpeculationLog` . Normal uses of `HotSpotSpeculationLog` work by using a `SpeculationReason` and asking the speculation log via `maySpeculate` if the speculation can be performed, i.e., if it failed before for the given method. An example for this can be seen in Graal https://github.com/oracle/graal/blob/master/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/nodes/loop/CountedLoopInfo.java#L591C15-L591C15 The implicit assumption is that the speculation log, `HotSpotSpeculationLog` in particular collects failed speculations at the beginning of a compile and then stays consistent during the compile. Why is that? - Because if there are new failed speculations added to the failed speculations during the compile - the compiler would speculate again on those in an inconsistent way. E.g. at the beginning of a compile a certain speculation has not failed yet and the compiler thinks it can do optimization xyz using a speculation - later during the compilation process it consults the speculation log but gets a different answer. All those inconsistent speculations that already failed will anyway later fail code installation in jvmci (they will throw a bailout during `HotSpotCodeCacheProvider#installCode` https://github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java#L192 ). Thus, we should at least return a consistent result during a compile. The problem for consistency here, that also makes troubles on the graal side, is that `maySpeculate` itself can collect failed speculations if there have not been any previously, i.e., `failedSpeculations == null`. In order to make the speculation log consistent across an entire JVMCI compile this PR removes the collection of failed speculations in `maySpeculate`. ------------- Commit messages: - 8322636: [JVMCI] HotSpotSpeculationLog can be inconsistent across a single compile Changes: https://git.openjdk.org/jdk/pull/17183/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17183&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322636 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17183.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17183/head:pull/17183 PR: https://git.openjdk.org/jdk/pull/17183 From davleopo at openjdk.org Fri Dec 22 11:23:05 2023 From: davleopo at openjdk.org (David Leopoldseder) Date: Fri, 22 Dec 2023 11:23:05 GMT Subject: RFR: 8322636: [JVMCI] HotSpotSpeculationLog can be inconsistent across a single compile In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 09:55:16 GMT, David Leopoldseder wrote: > This PR fixes a subtle inconsistency in `HotSpotSpeculationLog` . > > Normal uses of `HotSpotSpeculationLog` work by using a `SpeculationReason` and asking the speculation log via `maySpeculate` if the speculation can be performed, i.e., if it failed before for the given method. An example for this can be seen in Graal https://github.com/oracle/graal/blob/master/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/nodes/loop/CountedLoopInfo.java#L591C15-L591C15 > The implicit assumption is that the speculation log, `HotSpotSpeculationLog` in particular collects failed speculations at the beginning of a compile and then stays consistent during the compile. Why is that? - Because if there are new failed speculations added to the failed speculations during the compile - the compiler would speculate again on those in an inconsistent way. E.g. at the beginning of a compile a certain speculation has not failed yet and the compiler thinks it can do optimization xyz using a speculation - later during the compilation process it consults the speculation log but gets a different answer. All those inconsistent speculations that already failed will anyway later fail code installation in jvmci (they will throw a bailout during `HotSpotCodeCacheProvider#installCode` https://github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java#L192 ). Thus, we should at least return a consistent result during a compile. > The problem for consistency here, that also makes troubles on the graal side, is that `maySpeculate` itself can collect failed speculations if there have not been any previously, i.e., `failedSpeculations == null`. > In order to make the speculation log consistent across an entire JVMCI compile this PR removes the collection of failed speculations in `maySpeculate`. cc @dougxc @tkrodriguez can you have a look ------------- PR Comment: https://git.openjdk.org/jdk/pull/17183#issuecomment-1867567504 From mdoerr at openjdk.org Fri Dec 22 11:31:49 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 22 Dec 2023 11:31:49 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop In-Reply-To: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: On Mon, 18 Dec 2023 22:05:32 GMT, Richard Reingruber wrote: > This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). > > * `frame::get_oop_map()` is moved to shared code > > * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` > > The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. > All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. Nice refactoring! I couldn't spot any bug. Only minor suggestions. src/hotspot/cpu/s390/nativeInst_s390.hpp line 661: > 659: bool check() const { Unimplemented(); return false; } > 660: bool decode(int32_t& oopmap_slot, int32_t& cb_offset) const { return false; } > 661: bool patch(int32_t oopmap_slot, int32_t cb_offset) { Unimplemented() ; return false; } Whitespace between `()` and `;`. src/hotspot/share/runtime/frame.inline.hpp line 109: > 107: inline const ImmutableOopMap* frame::get_oop_map() const { > 108: if (_cb == nullptr) return nullptr; > 109: if (_cb->oop_maps() != nullptr) { Could be shorter: `if (_cb == nullptr || _cb->oop_maps() == nullptr) return nullptr;` ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17150#pullrequestreview-1794338004 PR Review Comment: https://git.openjdk.org/jdk/pull/17150#discussion_r1434966035 PR Review Comment: https://git.openjdk.org/jdk/pull/17150#discussion_r1434968074 From epeter at openjdk.org Fri Dec 22 11:59:57 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 11:59:57 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian Nice work! I'm sending out a first batch of comments before lunch. Will continue through more later. src/hotspot/share/opto/loopPredicate.cpp line 401: > 399: TemplateAssertionPredicateBool template_assertion_predicate_bool(opaque4_node->in(1)); > 400: BoolNode* bol = template_assertion_predicate_bool.clone(parse_predicate_proj, this); > 401: opaque4_node = clone_and_register(opaque4_node, parse_predicate_proj); Why do you now clone the opaque node here, and not as part of `template_assertion_predicate_bool.clone`? src/hotspot/share/opto/loopTransform.cpp line 1390: > 1388: // Is 'n' a node that can be found on the input chain of a Template Assertion Predicate bool (i.e. between a Template > 1389: // Assertion Predicate If node and the OpaqueLoop* nodes)? > 1390: static bool is_part_of_template_assertion_predicate_bool(Node* n) { Yeah, this name was not very good, "expression" would have been better than "bool". src/hotspot/share/opto/loopTransform.cpp line 1462: > 1460: } > 1461: opaque4_node = clone_and_register(opaque4_node, control); > 1462: _igvn.replace_input_of(opaque4_node, 1, new_bool); Again, why do we need to separately clone the `opaque4_node`? And would it not be better to unify the two clone methods, and just pass the `nullptr`, which means clone, and non-nullptr means replace? src/hotspot/share/opto/predicates.cpp line 180: > 178: // Let node s be the next node being visited after node n in the DFS traversal. The following holds: > 179: // n->in(i) = s > 180: class DFSNodeStack : public StackObj { It could be nice to generalize this to a `DFSInputIterator`, which takes some filter function (either as lambda / functional or as a template argument for better inlining). I suspect we would use this in quite a few other places in the code, and this could reduce code duplication in the future, and make code much easier to read. src/hotspot/share/opto/predicates.cpp line 182: > 180: class DFSNodeStack : public StackObj { > 181: Node_Stack _stack; > 182: static const uint _no_inputs_visited_yet = 0; Suggestion: static const uint NO_INPUTS_VISITED_YET = 0; src/hotspot/share/opto/predicates.cpp line 205: > 203: } > 204: } > 205: return false; What if we iterate over a whole list of inputs, and none `could_be_part`? Then we only advance the `index` by one inside `increment_top_node_input_index`. Would it not be better to simply read the `index` at the beginning, and set it just before `return false;`? src/hotspot/share/opto/predicates.cpp line 228: > 226: void increment_top_node_input_index() { > 227: _stack.set_index(_stack.index() + 1); > 228: } could be private src/hotspot/share/opto/predicates.cpp line 237: > 235: // Interface to transform OpaqueLoop* nodes of a Template Assertion Predicate Bool. The transformations must return a > 236: // new or different existing node. > 237: class TransformOpaqueLoopNodes : public StackObj { Suggestion: class TransformStrategyForOpaqueLoopNodes : public StackObj { I would like to have the Strategy in it. And then you could rename: `CloneOpaqueLoopNodes` -> `CloneInitAndStride(Transform)StrategyForOpaqueLoopNodes` `CloneWithNewInit` -> `ReplaceInitAndCloneStride(Transform)StrategyForOpaqueLoopNodes` `ReplaceOpaqueLoopNodes` -> `ReplaceInitAndStride(Transform)StrategyForOpaqueLoopNodes` Yes, the names are longer, but it would be much clearer what they are for, in the places where they are used. I was struggling to read `TemplateAssertionPredicateBool::clone`, and had to dig inside `clone_opaque_loop_nodes` to understand. src/hotspot/share/opto/predicates.cpp line 349: > 347: _index_before_cloning(phase->C->unique()), > 348: _ctrl_for_clones(ctrl_for_clones) > 349: DEBUG_ONLY(COMMA _found_init(false)) {} Detail, but I would prefer if the constructor came right after the member fields. It would help me know how things are initialized quicker. src/hotspot/share/opto/predicates.cpp line 401: > 399: > 400: // The transformations of this class clone the existing OpaqueLoop* nodes without any other update. > 401: class CloneOpaqueLoopNodes : public TransformOpaqueLoopNodes { Would it not be much simpler, if you just had one such Transform class, which can have both a new init and stride node as input, but if they are nullptr, then we just clone instead? It would also simplify the code in `clone_assertion_predicate_and_initialize`. src/hotspot/share/opto/predicates.hpp line 270: > 268: // A Template Assertion Predicate Bool represents the BoolNode for the initial value or the last value of a > 269: // Template Assertion Predicate and all the nodes up to and including the OpaqueLoop* nodes. > 270: class TemplateAssertionPredicateBool : public StackObj { The current name targets the "bool" node, and it is not clear that you actually mean to represent the whole assertion-predicate expression. Hence: `TemplateAssertionPredicateBool` -> `TemplateAssertionPredicateExpression` Then `could_be_part` makes more sense. Maybe still, you could rename it to `maybe_contains`? But is it really only "maybe" or "could"? Maybe add some explanation in what cases it looks like it could be part of it, but is actually not? Also, expression would help to make more sense of clone, since it clones the whole expression! ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16877#pullrequestreview-1794251755 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434968527 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434988447 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434993226 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434950332 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434945062 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434952558 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434958185 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434919725 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434910049 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434991996 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434986923 From epeter at openjdk.org Fri Dec 22 11:59:59 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 11:59:59 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 10:52:13 GMT, Emanuel Peter wrote: >> This patch is intended for JDK 23. >> >> While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. >> >> The patch applies the following cleanup changes: >> - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: >> - `clone()`: Clone without modification >> - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. >> - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. >> >> This refactoring could be extracted from the complete fix. >> - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. >> - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: >> - Renaming >> - Extracting code to separate classes/methods >> - Adding comments >> - Some small refactoring including: >> - Removing unused parameters >> - Renaming variables/parameters/methods >> >> Thanks, >> Christian > > src/hotspot/share/opto/predicates.cpp line 180: > >> 178: // Let node s be the next node being visited after node n in the DFS traversal. The following holds: >> 179: // n->in(i) = s >> 180: class DFSNodeStack : public StackObj { > > It could be nice to generalize this to a `DFSInputIterator`, which takes some filter function (either as lambda / functional or as a template argument for better inlining). > I suspect we would use this in quite a few other places in the code, and this could reduce code duplication in the future, and make code much easier to read. Hmm. You are doing more than a simple traversal though. so not sure how feasible this is. > src/hotspot/share/opto/predicates.cpp line 237: > >> 235: // Interface to transform OpaqueLoop* nodes of a Template Assertion Predicate Bool. The transformations must return a >> 236: // new or different existing node. >> 237: class TransformOpaqueLoopNodes : public StackObj { > > Suggestion: > > class TransformStrategyForOpaqueLoopNodes : public StackObj { > > I would like to have the Strategy in it. > And then you could rename: > `CloneOpaqueLoopNodes` -> `CloneInitAndStride(Transform)StrategyForOpaqueLoopNodes` > `CloneWithNewInit` -> `ReplaceInitAndCloneStride(Transform)StrategyForOpaqueLoopNodes` > `ReplaceOpaqueLoopNodes` -> `ReplaceInitAndStride(Transform)StrategyForOpaqueLoopNodes` > Yes, the names are longer, but it would be much clearer what they are for, in the places where they are used. > > I was struggling to read `TemplateAssertionPredicateBool::clone`, and had to dig inside `clone_opaque_loop_nodes` to understand. Also look at this comment: https://github.com/openjdk/jdk/pull/16877#discussion_r1434991996 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434960576 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1434996269 From jiefu at openjdk.org Fri Dec 22 12:10:57 2023 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 22 Dec 2023 12:10:57 GMT Subject: RFR: 8320139: [JVMCI] VmObjectAlloc is not generated by intrinsics methods which allocate objects [v2] In-Reply-To: <0Um1zwgyjW8IXcISUsf7s2fvfsD03JD7esf112bGa_g=.7b483632-9193-49ec-87d4-0ad5adbab234@github.com> References: <0Um1zwgyjW8IXcISUsf7s2fvfsD03JD7esf112bGa_g=.7b483632-9193-49ec-87d4-0ad5adbab234@github.com> Message-ID: On Tue, 19 Dec 2023 07:31:05 GMT, Raphael Mosaner wrote: >> This PR exports a pointer to `JvmtiExport::_should_notify_object_alloc` via JVMCI to enable intrinsification of unsafe allocations in accordance to C2. > > Raphael Mosaner has updated the pull request incrementally with one additional commit since the last revision: > > [JVMCI] Documentation for _should_notify_object_alloc export. Hi, the build failure was observed after this patch. Please take a look: https://github.com/openjdk/jdk/pull/17182 . Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16980#issuecomment-1867612318 From chagedorn at openjdk.org Fri Dec 22 13:17:51 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 22 Dec 2023 13:17:51 GMT Subject: RFR: 8322490: cleanup CastNode construction [v4] In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 00:46:04 GMT, Joshua Cao wrote: >> It is a common pattern to have: >> >> >> Node* n = new CastNode(...); >> n->set_req(control_node); >> >> >> We can modify the constructor to set the control node. It makes the code a little tidier. >> >> Passes tier1 locally on my Linux machine > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup make_cast functions Otherwise, the update looks good! You can directly integrate/get it sponsored afterward. src/hotspot/share/opto/castnode.cpp line 126: > 124: } > 125: > 126: Node* ConstraintCastNode::make_cast_for_basic_type(Node* c, Node *n, const Type *t, DependencyType dependency, BasicType bt) { While at it, you can also fix the asterisk positions: Suggestion: Node* ConstraintCastNode::make_cast_for_basic_type(Node* c, Node* n, const Type* t, DependencyType dependency, BasicType bt) { src/hotspot/share/opto/castnode.hpp line 72: > 70: bool carry_dependency() const { return _dependency != RegularDependency; } > 71: TypeNode* dominating_cast(PhaseGVN* gvn, PhaseTransform* pt) const; > 72: static Node* make_cast_for_basic_type(Node* c, Node *n, const Type *t, DependencyType dependency, BasicType bt); Suggestion: static Node* make_cast_for_basic_type(Node* c, Node* n, const Type* t, DependencyType dependency, BasicType bt); src/hotspot/share/opto/library_call.cpp line 1142: > 1140: // length is now known positive, add a cast node to make this explicit > 1141: jlong upper_bound = _gvn.type(length)->is_integer(bt)->hi_as_long(); > 1142: Node *casted_length = ConstraintCastNode::make_cast_for_basic_type( Suggestion: Node* casted_length = ConstraintCastNode::make_cast_for_basic_type( src/hotspot/share/opto/library_call.cpp line 1172: > 1170: > 1171: // index is now known to be >= 0 and < length, cast it > 1172: Node *result = ConstraintCastNode::make_cast_for_basic_type( Suggestion: Node* result = ConstraintCastNode::make_cast_for_basic_type( src/hotspot/share/opto/loopTransform.cpp line 3423: > 3421: > 3422: // We need to pin the exact limit to prevent it from floating above the zero trip guard. > 3423: Node *cast_ii = ConstraintCastNode::make_cast_for_basic_type( Suggestion: Node* cast_ii = ConstraintCastNode::make_cast_for_basic_type( ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17162#pullrequestreview-1794200611 PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1434877087 PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1434881955 PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1434880223 PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1434880324 PR Review Comment: https://git.openjdk.org/jdk/pull/17162#discussion_r1434880759 From chagedorn at openjdk.org Fri Dec 22 14:02:48 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 22 Dec 2023 14:02:48 GMT Subject: RFR: 8310711: [IR Framework] Remove safepoint while printing handling In-Reply-To: References: Message-ID: On Fri, 1 Dec 2023 12:47:48 GMT, Christian Hagedorn wrote: > This clean-up PR removes the handling of the `` message in the IR framework. It is no longer required since we dump the output of `PrintIdeal` to the hotspot_pid file differently since [JDK-8306922](https://bugs.openjdk.org/browse/JDK-8306922). There is no interrupting `` message anymore. I removed the corresponding now unneeded code together with the previously added test case for it. > > Testing: tier1-4 > > Thanks, > Christian Thanks Emanuel for your review! I will integrate this in the new year when I'm back again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16921#issuecomment-1867720364 From duke at openjdk.org Fri Dec 22 15:09:41 2023 From: duke at openjdk.org (ArsenyBochkarev) Date: Fri, 22 Dec 2023 15:09:41 GMT Subject: RFR: 8317721: RISC-V: Implement CRC32 intrinsic In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 10:18:16 GMT, Hamlin Li wrote: > > Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board: > > `-XX:-UseZba`: > > ``` > Benchmark (count) Mode Cnt Score Error Units > > CRC32.TestCRC32.testCRC32Update 512 thrpt 12 512.550 1.718 ops/ms > CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 130.396 0.341 ops/ms > CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 16.319 0.073 ops/ms > CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 3.913 0.011 ops/ms > ``` > > `-XX:+UseZba`: > > ``` > Benchmark (count) Mode Cnt Score Error Units > > CRC32.TestCRC32.testCRC32Update 512 thrpt 12 623.173 0.651 ops/ms > CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 158.965 0.376 ops/ms > CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 19.934 0.055 ops/ms > CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 4.730 0.007 ops/ms > ``` > > `-XX:-UseCRC32Intrinsics`: > > ``` > Benchmark (count) Mode Cnt Score Error Units > > CRC32.TestCRC32.testCRC32Update 512 thrpt 12 520.965 5.651 ops/ms > CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 169.591 0.747 ops/ms > CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 22.624 0.139 ops/ms > CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.430 0.016 ops/ms > ``` > > Seems there is regression when `count >= 512`, especially when `count >= 2048`. And I suppose that big message is common case for CRC32 usage? Hmm, I don't know about common CRC32 `count` parameter sizes, maybe others know? ? Or maybe anyone knows if there are other ways to optimize such plain version of intrinsic more? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17046#issuecomment-1867791244 From chagedorn at openjdk.org Fri Dec 22 15:46:56 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 22 Dec 2023 15:46:56 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v57] In-Reply-To: <4g4SbB2RBLU-ZFcrH_ukdqC_QSoSvibNGanasAFl-lw=.731266a6-9974-402e-954e-e441706426ab@github.com> References: <4g4SbB2RBLU-ZFcrH_ukdqC_QSoSvibNGanasAFl-lw=.731266a6-9974-402e-954e-e441706426ab@github.com> Message-ID: On Thu, 21 Dec 2023 15:40:16 GMT, Emanuel Peter wrote: >> I want to push this in JDK23. >> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620). >> >> To calm your nerves: most of the changes are in auto-generated tests, and tests in general. >> >> **Context** >> >> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs). >> >> Alignment is split into two tasks: >> - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`. >> - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width. >> >> **Problem** >> >> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190). >> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine. >> Thanks @fg1417 for confirming this! >> Hence, we need to fix the alignment correctness checks. >> >> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly. >> >> **Problem Details** >> >> Reproducer: >> >> >> static void test(short[] a, short[] b, short mask) { >> for (int i = 0; i < RANGE; i+=8) { >> // Problematic for AlignVector >> b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0 >> >> b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes >> b[i+4] = (short)(a[i+4] & mask); >> b[i+5] = (short)(a[i+5] & mask); >> b[i+6] = (short)(a[i+6] & mask); >> } >> } >> >> >> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all. >> >> This is problemati... > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Apply suggestions from code review by Christian > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). Thanks a lot Emanuel for all the discussions and for addressing all my comments online and offline :-) It looks very good now and it's easy to follow the logic. The proofs are great and really helpful to better understand the (rather simple in the end) code for proving and calculating the alignment solutions. Thanks for putting the extra effort in here. I will have another complete look at the entire PR in the new year. But I think it looks good! ------------- PR Review: https://git.openjdk.org/jdk/pull/14785#pullrequestreview-1794654777 PR Comment: https://git.openjdk.org/jdk/pull/14785#issuecomment-1867826841 From epeter at openjdk.org Fri Dec 22 15:48:11 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 15:48:11 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 13:12:52 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/loopUnswitch.cpp line 117: >> >>> 115: >>> 116: // Perform Loop Unswitching on the loop containing an invariant test that does not exit the loop. The loop is cloned >>> 117: // such that we have two identical loops next to each other - a fast and a slow loop. We modify the loops as follows: >> >> It sounds like they are somehow identical and at the same time modified. >> Suggestion: say that they are "first" cloned to be identical, and then "second" modified such that one becomes the "true-path" and the other the "false-path" loop. I wonder if we can eliminate the "fast / slow" wording, because who really knows which path is faster or slower, right? > > Also: why can the invariant if not be eliminated in the loops as a dominating if, with and independent optimization? Would that not be cleaner? Well I guess maybe we just want to be sure that it happens. Yeah, the naming of "old / new", "true / false", "fast / slow" is confusing and redundant. Would be nice if you said which are the same. Or maybe remove some of these names entirely. I suggest: Old -> True, New -> False. >> src/hotspot/share/opto/loopUnswitch.cpp line 266: >> >>> 264: _selector(create_selector_if(loop, unswitch_if_candidate)), >>> 265: _fast_loop_proj(create_fast_loop_proj()), >>> 266: _slow_loop_proj(create_slow_loop_proj()) {} >> >> It seems to me you could make all fields `const` of some sort, right? > > You will never modify the pointers again. Just for good measure: assert that the `unswitch_if_candidate` is inside the `loop`? >> src/hotspot/share/opto/loopUnswitch.cpp line 325: >> >>> 323: _loop(loop), >>> 324: _old_new(old_new), >>> 325: _phase(loop->_phase) {} >> >> Again, I think it would be nice to have the constructor and public member methods at the beginning, right after the field definition. It just helps to see how they belong together better. > > And the fields could be const. And hence the methods could be const as well ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435061059 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435107903 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435132160 From chagedorn at openjdk.org Fri Dec 22 15:48:08 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 22 Dec 2023 15:48:08 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian Thanks Emanuel for your careful review! I will get back to this PR in the new year. ------------- PR Comment: https://git.openjdk.org/jdk/pull/16877#issuecomment-1867831054 From epeter at openjdk.org Fri Dec 22 15:48:08 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 15:48:08 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian Next batch of comments. More coming. src/hotspot/share/opto/loopUnswitch.cpp line 117: > 115: > 116: // Perform Loop Unswitching on the loop containing an invariant test that does not exit the loop. The loop is cloned > 117: // such that we have two identical loops next to each other - a fast and a slow loop. We modify the loops as follows: It sounds like they are somehow identical and at the same time modified. Suggestion: say that they are "first" cloned to be identical, and then "second" modified such that one becomes the "true-path" and the other the "false-path" loop. I wonder if we can eliminate the "fast / slow" wording, because who really knows which path is faster or slower, right? src/hotspot/share/opto/loopUnswitch.cpp line 232: > 230: IfFalseNode* _slow_loop_proj; > 231: > 232: IfNode* create_selector_if(IdealLoopTree* loop, IfNode* unswitch_if_candidate) { I'd make this `const` if possible. You call this during initialization in the constructor. It would be nice to know that it does not have side-effects. src/hotspot/share/opto/loopUnswitch.cpp line 241: > 239: unswitch_if_candidate->_fcnt) : > 240: new IfNode(_original_loop_entry, unswitching_candidate_bool, unswitch_if_candidate->_prob, > 241: unswitch_if_candidate->_fcnt); Could be nice to have some utility method that creates either, based on opcode, right? Same pattern exists in `PhaseIdealLoop::insert_if_before_proj`. IfNode* new_if = (opcode == Op_If) ? new IfNode(proj2, bol, iff->_prob, iff->_fcnt): new RangeCheckNode(proj2, bol, iff->_prob, iff->_fcnt); src/hotspot/share/opto/loopUnswitch.cpp line 256: > 254: _phase->register_node(slow_loop_proj, _outer_loop, _selector, _dom_depth); > 255: return slow_loop_proj; > 256: } Looks like code duplication. Suggestion: `create_selector_proj(bool con)` src/hotspot/share/opto/loopUnswitch.cpp line 266: > 264: _selector(create_selector_if(loop, unswitch_if_candidate)), > 265: _fast_loop_proj(create_fast_loop_proj()), > 266: _slow_loop_proj(create_slow_loop_proj()) {} It seems to me you could make all fields `const` of some sort, right? src/hotspot/share/opto/loopUnswitch.cpp line 296: > 294: IdealLoopTree* _loop; > 295: Node_List* _old_new; > 296: PhaseIdealLoop* _phase; Now you add yet another term: Original We already have Old/New, True/False, Slow/Fast. I would call this `OldLoop` src/hotspot/share/opto/loopUnswitch.cpp line 300: > 298: void fix_loop_entries(IfProjNode* iffast_pred, IfProjNode* ifslow_pred) { > 299: _phase->replace_loop_entry(_strip_mined_loop_head, iffast_pred); > 300: LoopNode* slow_loop_strip_mined_head = _old_new->at(_strip_mined_loop_head->_idx)->as_Loop(); utility method `old_to_new` could be helpful, and eliminate this line. src/hotspot/share/opto/loopUnswitch.cpp line 325: > 323: _loop(loop), > 324: _old_new(old_new), > 325: _phase(loop->_phase) {} Again, I think it would be nice to have the constructor and public member methods at the beginning, right after the field definition. It just helps to see how they belong together better. src/hotspot/share/opto/loopUnswitch.cpp line 334: > 332: const uint first_slow_loop_node_index = _phase->C->unique(); > 333: _phase->clone_loop(_loop, *_old_new, _phase->dom_depth(_loop_head), > 334: PhaseIdealLoop::CloneIncludesStripMined, loop_selector); I think this used to be: `clone_loop(loop, old_new, dom_depth(head->skip_strip_mined()), mode, iff);` Should it be `dom_depth` of `_strip_mined_loop_head`? src/hotspot/share/opto/loopUnswitch.cpp line 337: > 335: // Fast (true) and Slow (false) control > 336: IfProjNode* iffast_pred = unswitched_loop_selector.fast_loop_proj(); > 337: IfProjNode* ifslow_pred = unswitched_loop_selector.slow_loop_proj(); I'm also not happy that we now have different names: `fast_loop_proj` vs `iffast_pred`. Maybe you need to replace `iffast_pred` wherever it is used? And you could take these 2 lines up to the other one that asks for the selector. src/hotspot/share/opto/loopUnswitch.cpp line 344: > 342: DEBUG_ONLY(verify_unswitched_loops(_loop_head, unswitched_loop_selector, _old_new);) > 343: return loop_selector; > 344: } Should we return the `unswitched_loop_selector` instead? Because where it is used, we will need the projections again, right? src/hotspot/share/opto/loopnode.cpp line 4141: > 4139: void PhaseIdealLoop::collect_useful_template_assertion_predicates_for_loop(IdealLoopTree* loop, > 4140: Unique_Node_List &useful_predicates) { > 4141: Node* entry = loop->_head->as_Loop()->skip_strip_mined()->in(LoopNode::EntryControl); looks like a bug-fix? src/hotspot/share/opto/predicates.cpp line 152: > 150: } > 151: > 152: TemplateAssertionPredicateBool::TemplateAssertionPredicateBool(Node* source_bool) : _source_bool(source_bool->as_Bool()) { Would be cleaner to just require the input to be a `BoolNode` right? src/hotspot/share/opto/predicates.cpp line 166: > 164: } > 165: } > 166: assert(has_template_output, "must find Template Assertion Predicate as output"); Idea: pack this in a method `has_opaque4_output`. Then you do not need the `has_template_output` variable, but can directly `return true` and `return false` at the end. And call like that: `DEBUG_ONLY(source_bool->has_opaque4_output();)` The nice thing: the constructor could be moved to the `hpp` file. src/hotspot/share/opto/predicates.cpp line 212: > 210: } > 211: > 212: uint node_index_to_previously_visited_parent() const { This is the "parent" of the current node, right? Parent is also a bit confusing, I think. Maybe use "input" instead? Because at the beginning you say the DFS traverses the inputs. Why do you say "previously visited"? src/hotspot/share/opto/predicates.cpp line 230: > 228: } > 229: > 230: void replace_top_with(Node* node) { `replace_top_node_with` src/hotspot/share/opto/predicates.cpp line 245: > 243: // Class to clone a Template Assertion Predicate Bool. The BoolNode and all the nodes up to but excluding the OpaqueLoop* > 244: // nodes are cloned. The OpaqueLoop* nodes are transformed by the provided strategy (e.g. cloned or replaced). > 245: class CloneTemplateAssertionPredicateBool : public StackObj { Suggestion: class CloneTemplateAssertionPredicateExpression : public StackObj { src/hotspot/share/opto/predicates.cpp line 250: > 248: uint _index_before_cloning; > 249: Node* _ctrl_for_clones; > 250: DEBUG_ONLY(bool _found_init;) what about stride? src/hotspot/share/opto/predicates.hpp line 271: > 269: // Template Assertion Predicate and all the nodes up to and including the OpaqueLoop* nodes. > 270: class TemplateAssertionPredicateBool : public StackObj { > 271: BoolNode* _source_bool; could be a const pointer ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16877#pullrequestreview-1794461358 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435043499 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435093011 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435100821 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435105058 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435088483 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435082399 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435135799 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435062571 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435125509 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435120810 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435117695 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435141647 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435153913 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435157506 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435162747 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435166023 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435167410 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435167783 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435154277 From epeter at openjdk.org Fri Dec 22 15:48:10 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 15:48:10 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 13:10:56 GMT, Emanuel Peter wrote: >> This patch is intended for JDK 23. >> >> While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. >> >> The patch applies the following cleanup changes: >> - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: >> - `clone()`: Clone without modification >> - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. >> - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. >> >> This refactoring could be extracted from the complete fix. >> - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. >> - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: >> - Renaming >> - Extracting code to separate classes/methods >> - Adding comments >> - Some small refactoring including: >> - Removing unused parameters >> - Renaming variables/parameters/methods >> >> Thanks, >> Christian > > src/hotspot/share/opto/loopUnswitch.cpp line 117: > >> 115: >> 116: // Perform Loop Unswitching on the loop containing an invariant test that does not exit the loop. The loop is cloned >> 117: // such that we have two identical loops next to each other - a fast and a slow loop. We modify the loops as follows: > > It sounds like they are somehow identical and at the same time modified. > Suggestion: say that they are "first" cloned to be identical, and then "second" modified such that one becomes the "true-path" and the other the "false-path" loop. I wonder if we can eliminate the "fast / slow" wording, because who really knows which path is faster or slower, right? Also: why can the invariant if not be eliminated in the loops as a dominating if, with and independent optimization? Would that not be cleaner? Well I guess maybe we just want to be sure that it happens. > src/hotspot/share/opto/loopUnswitch.cpp line 266: > >> 264: _selector(create_selector_if(loop, unswitch_if_candidate)), >> 265: _fast_loop_proj(create_fast_loop_proj()), >> 266: _slow_loop_proj(create_slow_loop_proj()) {} > > It seems to me you could make all fields `const` of some sort, right? You will never modify the pointers again. > src/hotspot/share/opto/loopUnswitch.cpp line 300: > >> 298: void fix_loop_entries(IfProjNode* iffast_pred, IfProjNode* ifslow_pred) { >> 299: _phase->replace_loop_entry(_strip_mined_loop_head, iffast_pred); >> 300: LoopNode* slow_loop_strip_mined_head = _old_new->at(_strip_mined_loop_head->_idx)->as_Loop(); > > utility method `old_to_new` could be helpful, and eliminate this line. It would also make the other code below easier to read, I think > src/hotspot/share/opto/loopUnswitch.cpp line 325: > >> 323: _loop(loop), >> 324: _old_new(old_new), >> 325: _phase(loop->_phase) {} > > Again, I think it would be nice to have the constructor and public member methods at the beginning, right after the field definition. It just helps to see how they belong together better. And the fields could be const. > src/hotspot/share/opto/loopUnswitch.cpp line 334: > >> 332: const uint first_slow_loop_node_index = _phase->C->unique(); >> 333: _phase->clone_loop(_loop, *_old_new, _phase->dom_depth(_loop_head), >> 334: PhaseIdealLoop::CloneIncludesStripMined, loop_selector); > > I think this used to be: > `clone_loop(loop, old_new, dom_depth(head->skip_strip_mined()), mode, iff);` > Should it be `dom_depth` of `_strip_mined_loop_head`? Also there used to be an assert in the old code after the clone: `assert(old_new[head->_idx]->is_Loop(), "" );` not sure if worth keeping? > src/hotspot/share/opto/predicates.cpp line 212: > >> 210: } >> 211: >> 212: uint node_index_to_previously_visited_parent() const { > > This is the "parent" of the current node, right? > Parent is also a bit confusing, I think. Maybe use "input" instead? Because at the beginning you say the DFS traverses the inputs. > Why do you say "previously visited"? Ah. I think it should simply be `top_input_index`. You could then also rename `top` -> `top_node`. And `increment_top_node_input_index` -> `increment_top_index`, if you even think it is worth keeping. > src/hotspot/share/opto/predicates.hpp line 270: > >> 268: // A Template Assertion Predicate Bool represents the BoolNode for the initial value or the last value of a >> 269: // Template Assertion Predicate and all the nodes up to and including the OpaqueLoop* nodes. >> 270: class TemplateAssertionPredicateBool : public StackObj { > > The current name targets the "bool" node, and it is not clear that you actually mean to represent the whole assertion-predicate expression. Hence: > `TemplateAssertionPredicateBool` -> `TemplateAssertionPredicateExpression` > > Then `could_be_part` makes more sense. Maybe still, you could rename it to `maybe_contains`? > But is it really only "maybe" or "could"? Maybe add some explanation in what cases it looks like it could be part of it, but is actually not? > > Also, expression would help to make more sense of clone, since it clones the whole expression! And why not also include the `Opaque4` node below the bool node? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435045548 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435090723 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435136176 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435131946 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435138897 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435165087 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435155525 From epeter at openjdk.org Fri Dec 22 15:48:12 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 15:48:12 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 13:34:15 GMT, Emanuel Peter wrote: >> Also: why can the invariant if not be eliminated in the loops as a dominating if, with and independent optimization? Would that not be cleaner? Well I guess maybe we just want to be sure that it happens. > > Yeah, the naming of "old / new", "true / false", "fast / slow" is confusing and redundant. Would be nice if you said which are the same. Or maybe remove some of these names entirely. > I suggest: Old -> True, New -> False. `create_slow_version_of_loop` should probably be renamed as well. Maybe `clone_loop_and_create_loop_selector`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435079338 From epeter at openjdk.org Fri Dec 22 15:52:03 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 15:52:03 GMT Subject: RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v57] In-Reply-To: References: <4g4SbB2RBLU-ZFcrH_ukdqC_QSoSvibNGanasAFl-lw=.731266a6-9974-402e-954e-e441706426ab@github.com> Message-ID: On Fri, 22 Dec 2023 15:40:47 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply suggestions from code review by Christian >> >> Co-authored-by: Christian Hagedorn > > Thanks a lot Emanuel for all the discussions and for addressing all my comments online and offline :-) > > It looks very good now and it's easy to follow the logic. The proofs are great and really helpful to better understand the (rather simple in the end) code for proving and calculating the alignment solutions. Thanks for putting the extra effort in here. > > I will have another complete look at the entire PR in the new year. But I think it looks good! Thank you very much @chhagedorn for the very extensive review process here ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/14785#issuecomment-1867833613 From chagedorn at openjdk.org Fri Dec 22 15:59:45 2023 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 22 Dec 2023 15:59:45 GMT Subject: RFR: 8322661: Build broken due to missing jvmtiExport.hpp after JDK-8320139 In-Reply-To: References: Message-ID: <1z0sCU5CTD4Jhsq_W5SD-Q6SuC02OdSNslBJ-Dvs0i0=.851755dc-3cd5-45ed-8189-ff6b4868140d@github.com> On Fri, 22 Dec 2023 02:51:07 GMT, Jie Fu wrote: > Add jvmtiExport.hpp in jvmciCompilerToVMInit.cpp to fix the build failure. > Thanks. Looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17182#pullrequestreview-1794670264 From epeter at openjdk.org Fri Dec 22 17:28:55 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 17:28:55 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Wed, 29 Nov 2023 08:42:41 GMT, Christian Hagedorn wrote: > This patch is intended for JDK 23. > > While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. > > The patch applies the following cleanup changes: > - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: > - `clone()`: Clone without modification > - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. > - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. > > This refactoring could be extracted from the complete fix. > - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. > - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: > - Renaming > - Extracting code to separate classes/methods > - Adding comments > - Some small refactoring including: > - Removing unused parameters > - Renaming variables/parameters/methods > > Thanks, > Christian Ok, this is it for now. I think it is awesome how you are refactoring the code, and packing it into classes to break up large methods ? src/hotspot/share/opto/predicates.cpp line 273: > 271: if (must_clone_node_on_top(transformed_opaque_loop_node)) { > 272: clone_and_replace_top_node(); > 273: } Why do you touch the output node here? Is it not post-visited later anyway? src/hotspot/share/opto/predicates.cpp line 276: > 274: // Rewire the current node on top (child of old OpaqueLoop*Node) to the newly transformed node. > 275: rewire_node_on_top_to(transformed_opaque_loop_node); > 276: } I would fuse these two methods: `transform_opaque_loop_node` + `pop_transformed_opaque_loop_node`. That way you do not have to replace what is on the stack (which is nasty IMHO). src/hotspot/share/opto/predicates.cpp line 286: > 284: // Predicate Bool. > 285: // If (1) is true then previously_visited_parent is part of the Template Assertion Predicate Bool. But if top was > 286: // already cloned, we do not need to clone it again to avoid duplicates. I would make `top is not a clone` the first condition: obviously, if it is already cloned we do not need to clone again. The other condition is a little worrying / more difficult to understand. I think the idea is that `previously_visited_parent` would be the clone, if there was cloning on the current input. src/hotspot/share/opto/predicates.cpp line 353: > 351: // Look for the OpaqueLoop* nodes to transform them with the strategy defined with 'transform_opaque_loop_nodes'. > 352: // Clone all nodes in between. > 353: BoolNode* clone(TransformOpaqueLoopNodes* transform_opaque_loop_nodes) { This is a DFS. But how are the nodes cloned? In post-order, so after all inputs are processed and cloned, right? src/hotspot/share/opto/predicates.cpp line 361: > 359: pop_transformed_opaque_loop_node(); > 360: } else if (!_stack.push_next_unvisited_input()) { > 361: pop_node(); So this happens when no new input can be found, right? So a post-order traversal. src/hotspot/share/opto/split_if.cpp line 98: > 96: } > 97: > 98: clone_template_assertion_predicate_bool_down_if_related(n); We need a better name. I don't understand what it does from the name. src/hotspot/share/opto/split_if.cpp line 413: > 411: > 412: // This class clones Template Assertion Predicates Bools down as part of the Split If optimization. > 413: class CloneTemplateAssertionPredicateBoolDown { What does the `down` mean? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16877#pullrequestreview-1794664577 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435190332 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435188973 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435198144 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435184937 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435185761 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435236754 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435234621 From epeter at openjdk.org Fri Dec 22 17:28:56 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 17:28:56 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 15:36:18 GMT, Emanuel Peter wrote: >> This patch is intended for JDK 23. >> >> While preparing the patch for the full fix for Assertion Predicates [JDK-8288981](https://bugs.openjdk.org/browse/JDK-8288981), I still noticed that some changes are not required for the actual fix and could be split off and reviewed separately in this PR. >> >> The patch applies the following cleanup changes: >> - The complete fix had to add slightly different cloning cases in `PhaseIdealLoop::create_bool_from_template_assertion_predicate()` which already has quite some logic to switch between different cases. Additionally, the algorithm in the method itself was already hard to understand and difficult to adapt. I therefore re-implemented it in a separate class `CloneTemplateAssertionPredicateBool` together with some helper classes like `DFSNodeStack`. To use it, I've added a `TemplateAssertionPredicateBool` class that offers three cloning possibilities: >> - `clone()`: Clone without modification >> - `clone_and_replace_opaque_loop_nodes()`: Clone and replace the `OpaqueLoop*Nodes` with a new init and stride node. >> - `clone_and_replace_init()`: Special case of `clone_and_replace_opaque_loop_nodes()` which only replaces `OpaqueLoopInitNode` and clones `OpaqueLoopStrideNode`. >> >> This refactoring could be extracted from the complete fix. >> - The Split If code to detect (`subgraph_has_opaque()`) and clone Template Assertion Predicate Bools was extracted to a separate class `CloneTemplateAssertionPredicateBoolDown` and uses the new `TemplateAssertionPredicateBool` class to do the actual cloning. >> - In the process of coding the complete fix, I've refactored the Loop Unswitching code quite a bit. This change could also be extracted into a separate RFE. Changes include: >> - Renaming >> - Extracting code to separate classes/methods >> - Adding comments >> - Some small refactoring including: >> - Removing unused parameters >> - Renaming variables/parameters/methods >> >> Thanks, >> Christian > > src/hotspot/share/opto/predicates.cpp line 245: > >> 243: // Class to clone a Template Assertion Predicate Bool. The BoolNode and all the nodes up to but excluding the OpaqueLoop* >> 244: // nodes are cloned. The OpaqueLoop* nodes are transformed by the provided strategy (e.g. cloned or replaced). >> 245: class CloneTemplateAssertionPredicateBool : public StackObj { > > Suggestion: > > class CloneTemplateAssertionPredicateExpression : public StackObj { I left quite a few comments below, because I think the algorithm is difficult to understand. I like the idea of splitting it into smaller parts. If I understand the basic algorithm, we do this: DFS, where we traverse the inputs recursively, using the `DFSNodeStack`, which uses the filter `TemplateAssertionPredicateBool::could_be_part`. If you see a init/stride `Opaque1` node, then you obviously clone it. If you come back from an input to a output (use to def), then you may get back a cloned input node. If the input node is not a clone, then we have to do nothing. If the input node is a clone, we also have to clone the output node. Clone it if it is not already a clone. And then update the current input slot, so we do not point to the pre-cloned input, but to the cloned input. > src/hotspot/share/opto/predicates.cpp line 273: > >> 271: if (must_clone_node_on_top(transformed_opaque_loop_node)) { >> 272: clone_and_replace_top_node(); >> 273: } > > Why do you touch the output node here? Is it not post-visited later anyway? Oh, I see. We are eagerly cloning the output node, if it is not yet cloned. Hmm. I guess that is better than trying to fiture out in the post-visit if we need to clone or not, because that would require querying if any input node is a clone? > src/hotspot/share/opto/predicates.cpp line 361: > >> 359: pop_transformed_opaque_loop_node(); >> 360: } else if (!_stack.push_next_unvisited_input()) { >> 361: pop_node(); > > So this happens when no new input can be found, right? So a post-order traversal. I think it would be better to have two methods: `post_visit_opaque_loop_node` `post_visit_other_node` And then call pop explicitly here. That way, we only push and pop here, and the traversal is a bit easier to understand. > src/hotspot/share/opto/split_if.cpp line 413: > >> 411: >> 412: // This class clones Template Assertion Predicates Bools down as part of the Split If optimization. >> 413: class CloneTemplateAssertionPredicateBoolDown { > > What does the `down` mean? I have not yet reviewed this part of the code. I'm out of time for today / this year ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435214644 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435192504 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435188044 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435237075 From epeter at openjdk.org Fri Dec 22 17:28:56 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 17:28:56 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 16:47:52 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/predicates.cpp line 245: >> >>> 243: // Class to clone a Template Assertion Predicate Bool. The BoolNode and all the nodes up to but excluding the OpaqueLoop* >>> 244: // nodes are cloned. The OpaqueLoop* nodes are transformed by the provided strategy (e.g. cloned or replaced). >>> 245: class CloneTemplateAssertionPredicateBool : public StackObj { >> >> Suggestion: >> >> class CloneTemplateAssertionPredicateExpression : public StackObj { > > I left quite a few comments below, because I think the algorithm is difficult to understand. > I like the idea of splitting it into smaller parts. > > If I understand the basic algorithm, we do this: > DFS, where we traverse the inputs recursively, using the `DFSNodeStack`, which uses the filter `TemplateAssertionPredicateBool::could_be_part`. > > If you see a init/stride `Opaque1` node, then you obviously clone it. > > If you come back from an input to a output (use to def), then you may get back a cloned input node. > If the input node is not a clone, then we have to do nothing. > If the input node is a clone, we also have to clone the output node. Clone it if it is not already a clone. And then update the current input slot, so we do not point to the pre-cloned input, but to the cloned input. So we need to do work whenever we walk back down a input->output edge. We can traverse this edge, by taking the post-order traversal, and then waling from the post-order visited node to its output node. Node* current; while (_stack.is_not_empty()) { current = _stack.top(); if (current->is_Opaque1()) { Node* transformed_node = transform_opaque_loop_node(current, transform_opaque_loop_nodes); _stack.replace_top(transformed_node); traverse_edge_back_to_output_and_pop(); } else if (!_stack.push_next_unvisited_input()) { traverse_edge_back_to_output_and_pop(); } // else: we just pushed an new input, go and visit it first } traverse_edge_back_to_output_and_pop: Node* maybe_cloned_input = _stack.pop(); if (_stack.is_empty()) { return; // we just visited the root node, and are now done. Maybe verify we have the root here? } Node* output = _stack.top(); int i = _stack.top_index(); if (!is_clone(output)) { output = output->clone(); _stack.replace_top(output); } output->set_req(i, maybe_cloned_input); >> src/hotspot/share/opto/predicates.cpp line 273: >> >>> 271: if (must_clone_node_on_top(transformed_opaque_loop_node)) { >>> 272: clone_and_replace_top_node(); >>> 273: } >> >> Why do you touch the output node here? Is it not post-visited later anyway? > > Oh, I see. We are eagerly cloning the output node, if it is not yet cloned. Hmm. I guess that is better than trying to fiture out in the post-visit if we need to clone or not, because that would require querying if any input node is a clone? Ah oh dear. And that is why you need to then always rewire below. Hmm. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435231073 PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435193899 From epeter at openjdk.org Fri Dec 22 17:28:57 2023 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 22 Dec 2023 17:28:57 GMT Subject: RFR: 8305638: Refactor Template Assertion Predicate Bool creation and Predicate code in Split If and Loop Unswitching In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 17:14:36 GMT, Emanuel Peter wrote: >> I left quite a few comments below, because I think the algorithm is difficult to understand. >> I like the idea of splitting it into smaller parts. >> >> If I understand the basic algorithm, we do this: >> DFS, where we traverse the inputs recursively, using the `DFSNodeStack`, which uses the filter `TemplateAssertionPredicateBool::could_be_part`. >> >> If you see a init/stride `Opaque1` node, then you obviously clone it. >> >> If you come back from an input to a output (use to def), then you may get back a cloned input node. >> If the input node is not a clone, then we have to do nothing. >> If the input node is a clone, we also have to clone the output node. Clone it if it is not already a clone. And then update the current input slot, so we do not point to the pre-cloned input, but to the cloned input. > > So we need to do work whenever we walk back down a input->output edge. We can traverse this edge, by taking the post-order traversal, and then waling from the post-order visited node to its output node. > > > Node* current; > while (_stack.is_not_empty()) { > current = _stack.top(); > if (current->is_Opaque1()) { > Node* transformed_node = transform_opaque_loop_node(current, transform_opaque_loop_nodes); > _stack.replace_top(transformed_node); > traverse_edge_back_to_output_and_pop(); > } else if (!_stack.push_next_unvisited_input()) { > traverse_edge_back_to_output_and_pop(); > } // else: we just pushed an new input, go and visit it first > } > > traverse_edge_back_to_output_and_pop: > Node* maybe_cloned_input = _stack.pop(); > if (_stack.is_empty()) { > return; // we just visited the root node, and are now done. Maybe verify we have the root here? > } > Node* output = _stack.top(); > int i = _stack.top_index(); > if (!is_clone(output)) { > output = output->clone(); > _stack.replace_top(output); > } > output->set_req(i, maybe_cloned_input); An even easier algorithm would be to have an additional `old_new` data-structure that knows the clones for the old nodes. Then you simply post-order traverse, and whenever you post-visit a node (current, all its inputs were already traversed, and if need be cloned), you check all the inputs, and see if any of them were cloned. If so, clone current also, and add it to `old_new`. That would mean that you do not have to hack the DFS iterator, and could use it in read-only mode. And it would be a much simpler algorithm. But the whole thing comes as the cost of this extra `old_new` data-structure. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16877#discussion_r1435232649 From never at openjdk.org Fri Dec 22 17:32:47 2023 From: never at openjdk.org (Tom Rodriguez) Date: Fri, 22 Dec 2023 17:32:47 GMT Subject: RFR: 8322661: Build broken due to missing jvmtiExport.hpp after JDK-8320139 In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 02:51:07 GMT, Jie Fu wrote: > Add jvmtiExport.hpp in jvmciCompilerToVMInit.cpp to fix the build failure. > Thanks. This looks good. Thanks for fixing it. ------------- Marked as reviewed by never (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17182#pullrequestreview-1794772159 From duke at openjdk.org Fri Dec 22 19:09:59 2023 From: duke at openjdk.org (Joshua Cao) Date: Fri, 22 Dec 2023 19:09:59 GMT Subject: RFR: 8322490: cleanup CastNode construction [v5] In-Reply-To: References: Message-ID: <-tnaIIQR_4bwezZ_9uOfOZkFv1ZJO-zMN71FSGzwGWY=.141d3f85-d1b1-4113-9828-7aa7bc57f8ae@github.com> > It is a common pattern to have: > > > Node* n = new CastNode(...); > n->set_req(control_node); > > > We can modify the constructor to set the control node. It makes the code a little tidier. > > Passes tier1 locally on my Linux machine Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: Fix asterisk formatting ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17162/files - new: https://git.openjdk.org/jdk/pull/17162/files/574617ef..d3bd0b85 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17162&range=03-04 Stats: 5 lines in 4 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/17162.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17162/head:pull/17162 PR: https://git.openjdk.org/jdk/pull/17162 From phh at openjdk.org Fri Dec 22 19:20:48 2023 From: phh at openjdk.org (Paul Hohensee) Date: Fri, 22 Dec 2023 19:20:48 GMT Subject: RFR: 8322490: cleanup CastNode construction [v5] In-Reply-To: <-tnaIIQR_4bwezZ_9uOfOZkFv1ZJO-zMN71FSGzwGWY=.141d3f85-d1b1-4113-9828-7aa7bc57f8ae@github.com> References: <-tnaIIQR_4bwezZ_9uOfOZkFv1ZJO-zMN71FSGzwGWY=.141d3f85-d1b1-4113-9828-7aa7bc57f8ae@github.com> Message-ID: On Fri, 22 Dec 2023 19:09:59 GMT, Joshua Cao wrote: >> It is a common pattern to have: >> >> >> Node* n = new CastNode(...); >> n->set_req(control_node); >> >> >> We can modify the constructor to set the control node. It makes the code a little tidier. >> >> Passes tier1 locally on my Linux machine > > Joshua Cao has updated the pull request incrementally with one additional commit since the last revision: > > Fix asterisk formatting Marked as reviewed by phh (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/17162#pullrequestreview-1794901822 From duke at openjdk.org Fri Dec 22 21:11:54 2023 From: duke at openjdk.org (Joshua Cao) Date: Fri, 22 Dec 2023 21:11:54 GMT Subject: Integrated: 8322490: cleanup CastNode construction In-Reply-To: References: Message-ID: On Tue, 19 Dec 2023 20:27:06 GMT, Joshua Cao wrote: > It is a common pattern to have: > > > Node* n = new CastNode(...); > n->set_req(control_node); > > > We can modify the constructor to set the control node. It makes the code a little tidier. > > Passes tier1 locally on my Linux machine This pull request has now been integrated. Changeset: 7263e25d Author: Joshua Cao Committer: Paul Hohensee URL: https://git.openjdk.org/jdk/commit/7263e25d9b69d67697992a284c75454c479b6ec3 Stats: 120 lines in 10 files changed: 8 ins; 64 del; 48 mod 8322490: cleanup CastNode construction Reviewed-by: chagedorn, phh ------------- PR: https://git.openjdk.org/jdk/pull/17162 From jiefu at openjdk.org Fri Dec 22 23:55:53 2023 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 22 Dec 2023 23:55:53 GMT Subject: RFR: 8322661: Build broken due to missing jvmtiExport.hpp after JDK-8320139 In-Reply-To: <1z0sCU5CTD4Jhsq_W5SD-Q6SuC02OdSNslBJ-Dvs0i0=.851755dc-3cd5-45ed-8189-ff6b4868140d@github.com> References: <1z0sCU5CTD4Jhsq_W5SD-Q6SuC02OdSNslBJ-Dvs0i0=.851755dc-3cd5-45ed-8189-ff6b4868140d@github.com> Message-ID: On Fri, 22 Dec 2023 15:56:37 GMT, Christian Hagedorn wrote: >> Add jvmtiExport.hpp in jvmciCompilerToVMInit.cpp to fix the build failure. >> Thanks. > > Looks good. Thanks @chhagedorn and @tkrodriguez for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17182#issuecomment-1868135994 From jiefu at openjdk.org Fri Dec 22 23:55:54 2023 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 22 Dec 2023 23:55:54 GMT Subject: Integrated: 8322661: Build broken due to missing jvmtiExport.hpp after JDK-8320139 In-Reply-To: References: Message-ID: <8lKpmf3MxiI9Pym9mYLhZVX1d5vd419JGvzrKpmBa7I=.74a15a04-aa97-407c-9233-ff0988c3809f@github.com> On Fri, 22 Dec 2023 02:51:07 GMT, Jie Fu wrote: > Add jvmtiExport.hpp in jvmciCompilerToVMInit.cpp to fix the build failure. > Thanks. This pull request has now been integrated. Changeset: 28c82bf1 Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/28c82bf18d85be00bea45daf81c6a9d665ac676f Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8322661: Build broken due to missing jvmtiExport.hpp after JDK-8320139 Reviewed-by: chagedorn, never ------------- PR: https://git.openjdk.org/jdk/pull/17182 From dnsimon at openjdk.org Sat Dec 23 04:19:41 2023 From: dnsimon at openjdk.org (Doug Simon) Date: Sat, 23 Dec 2023 04:19:41 GMT Subject: RFR: 8322636: [JVMCI] HotSpotSpeculationLog can be inconsistent across a single compile In-Reply-To: References: Message-ID: On Fri, 22 Dec 2023 09:55:16 GMT, David Leopoldseder wrote: > This PR fixes a subtle inconsistency in `HotSpotSpeculationLog` . > > Normal uses of `HotSpotSpeculationLog` work by using a `SpeculationReason` and asking the speculation log via `maySpeculate` if the speculation can be performed, i.e., if it failed before for the given method. An example for this can be seen in Graal https://github.com/oracle/graal/blob/master/compiler/src/jdk.graal.compiler/src/jdk/graal/compiler/nodes/loop/CountedLoopInfo.java#L591C15-L591C15 > The implicit assumption is that the speculation log, `HotSpotSpeculationLog` in particular collects failed speculations at the beginning of a compile and then stays consistent during the compile. Why is that? - Because if there are new failed speculations added to the failed speculations during the compile - the compiler would speculate again on those in an inconsistent way. E.g. at the beginning of a compile a certain speculation has not failed yet and the compiler thinks it can do optimization xyz using a speculation - later during the compilation process it consults the speculation log but gets a different answer. All those inconsistent speculations that already failed will anyway later fail code installation in jvmci (they will throw a bailout during `HotSpotCodeCacheProvider#installCode` https://github.com/openjdk/jdk/blob/master/src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotSpeculationLog.java#L192 ). Thus, we should at least return a consistent result during a compile. > The problem for consistency here, that also makes troubles on the graal side, is that `maySpeculate` itself can collect failed speculations if there have not been any previously, i.e., `failedSpeculations == null`. > In order to make the speculation log consistent across an entire JVMCI compile this PR removes the collection of failed speculations in `maySpeculate`. I think it's worth updating the javadoc for maySpeculate to clarify that it returns consistent results for any given speculation for the lifetime of a SpeculationLog object. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17183#pullrequestreview-1795390189 From rrich at openjdk.org Sat Dec 23 08:13:02 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Sat, 23 Dec 2023 08:13:02 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v2] In-Reply-To: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: > This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). > > * `frame::get_oop_map()` is moved to shared code > > * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` > > The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. > All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. > > EDIT 2023-12-22: Statistics > > The statistical numbers were generated with release builds. For riscv64 I used qemu. > The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. > Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. > > | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | > |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| > | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | > | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | > | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | > | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | > | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | > | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | > | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | > > > | SpecJVM2008 compiler.compiler with fix iterations | x86_64: base | x8... Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Review Martin ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17150/files - new: https://git.openjdk.org/jdk/pull/17150/files/01af2d16..904c2337 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17150&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17150&range=00-01 Stats: 14 lines in 2 files changed: 1 ins; 4 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/17150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17150/head:pull/17150 PR: https://git.openjdk.org/jdk/pull/17150 From rrich at openjdk.org Sat Dec 23 08:13:02 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Sat, 23 Dec 2023 08:13:02 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop In-Reply-To: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: <7WC-kCksFSCiiB_Y1-x4kW8NgNBnaqYCMXOJsi-67mI=.b75a86f8-143b-440c-9333-41c4cf987a48@github.com> On Mon, 18 Dec 2023 22:05:32 GMT, Richard Reingruber wrote: > This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). > > * `frame::get_oop_map()` is moved to shared code > > * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` > > The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. > All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. > > EDIT 2023-12-22: Statistics > > The statistical numbers were generated with release builds. For riscv64 I used qemu. > The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. > Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. > > | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | > |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| > | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | > | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | > | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | > | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | > | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | > | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | > | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | > > > | SpecJVM2008 compiler.compiler with fix iterations | x86_64: base | x8... Thanks for the review Martin. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17150#issuecomment-1868237724 From rrich at openjdk.org Sat Dec 23 08:13:04 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Sat, 23 Dec 2023 08:13:04 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v2] In-Reply-To: References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: On Fri, 22 Dec 2023 11:10:07 GMT, Martin Doerr wrote: >> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: >> >> Review Martin > > src/hotspot/cpu/s390/nativeInst_s390.hpp line 661: > >> 659: bool check() const { Unimplemented(); return false; } >> 660: bool decode(int32_t& oopmap_slot, int32_t& cb_offset) const { return false; } >> 661: bool patch(int32_t oopmap_slot, int32_t cb_offset) { Unimplemented() ; return false; } > > Whitespace between `()` and `;`. Done. I also reverted the change of `make_deopt`. > src/hotspot/share/runtime/frame.inline.hpp line 109: > >> 107: inline const ImmutableOopMap* frame::get_oop_map() const { >> 108: if (_cb == nullptr) return nullptr; >> 109: if (_cb->oop_maps() != nullptr) { > > Could be shorter: `if (_cb == nullptr || _cb->oop_maps() == nullptr) return nullptr;` Yes, that's better. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17150#discussion_r1435525912 PR Review Comment: https://git.openjdk.org/jdk/pull/17150#discussion_r1435525925 From aph at openjdk.org Sat Dec 23 11:43:37 2023 From: aph at openjdk.org (Andrew Haley) Date: Sat, 23 Dec 2023 11:43:37 GMT Subject: RFR: 8290965: PPC64: Implement post-call NOPs In-Reply-To: References: Message-ID: On Wed, 20 Dec 2023 19:56:28 GMT, Richard Reingruber wrote: > #### Implementation of post call nops (PCNs) on ppc64. > > Depends on https://github.com/openjdk/jdk/pull/17150 > > About post call nops: > > - instruction(s) at return addresses of compiled java calls > - emitted iff vm continuations are enabled to support virtual threads > - encode data that can be be used to find the corresponding CodeBlob and oop map faster > - mt-safe patchable to trigger deoptimization > > Background: > > - Frames in continuation StackChunks are not visited if their compiled method is made not entrant (in contrast to frames on stack). > Instead all PCNs of the compiled method are patched to trigger deoptimization when control returns to such frames. > - With vm continuations, stacks are walked and inspected more frequently. This requires lookup of metadata like frame size and oop maps. As an optimization the offset of the CodeBlob to the PCN and the oop map slot are encoded as data in the PCN. > > Post call nops on ppc64 > > - 1 instruction, i.e. 4 bytes (either CMPI or CMPLI[1]) > x86_64: 1 instruction, 8 bytes > aarch64: 3 instruction, 12 bytes > [1] 3.1.10 Fixed Point Compare Instructions in Power ISA 3.1B > https://openpowerfoundation.org/specifications/isa/ > > - 26 bits data payload > x86_64: 32 bits; aarch64: 32 bits > - 9 bits dedicated to oop map slot. With 8 bits there where cases with SPECjvm2008 where the slot could not be encoded (on ppc64 and x86_64). > x86_64: 8 bits; aarch64: 8 bits > - 17 bits dedicated to cb offset. Effectively 19 bits due to instruction alignment. > x86_64: 24 bits; aarch64: 24 bits > - Also used when reconstructing the back chain after thawing continuation frames (see `Thaw::patch_caller_links`) > > - Refactored frame constructors to make use of fast CodeBlob lookup based on PCNs. > The fast lookup may only be used if the pc is known to be in the code cache because `CodeCache::find_blob_fast` can yield wrong results if it finds instructions outside the code cache that look just like PCNs. Callers of the frame class constructors need to pass `frame::kind::native` in that case to avoid errors. Other platforms don't make this explicit which is a problem in my eyes. Picking the wrong constructor can cause errors when porting and in future development. > > - Currently only the PCNs in nmethods are initialized. Therefore we don't even try to make a fast lookup based on PCNs if we know the CodeBlob is, e.g., a RuntimeStub. To achieve this we call the frame constructor passing `frame::kind::code_blob`. > > #### Statistics > > > | SpecJVM2008... src/hotspot/cpu/ppc/frame_ppc.hpp line 399: > 397: native, // The frame's pc is not necessarily in the CodeCache. > 398: // CodeCache::find_blob_fast(void* pc) can yield wrong results in this case and must not be used. > 399: code_blob, // The frames pc is known to be in the CodeCache but it is likely not an nmethod. Suggestion: code_blob, // The frame's pc is known to be in the CodeCache but it is likely not in an nmethod. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1435578193 From rrich at openjdk.org Sat Dec 23 11:56:10 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Sat, 23 Dec 2023 11:56:10 GMT Subject: RFR: 8290965: PPC64: Implement post-call NOPs [v2] In-Reply-To: References: Message-ID: > #### Implementation of post call nops (PCNs) on ppc64. > > Depends on https://github.com/openjdk/jdk/pull/17150 > > About post call nops: > > - instruction(s) at return addresses of compiled java calls > - emitted iff vm continuations are enabled to support virtual threads > - encode data that can be be used to find the corresponding CodeBlob and oop map faster > - mt-safe patchable to trigger deoptimization > > Background: > > - Frames in continuation StackChunks are not visited if their compiled method is made not entrant (in contrast to frames on stack). > Instead all PCNs of the compiled method are patched to trigger deoptimization when control returns to such frames. > - With vm continuations, stacks are walked and inspected more frequently. This requires lookup of metadata like frame size and oop maps. As an optimization the offset of the CodeBlob to the PCN and the oop map slot are encoded as data in the PCN. > > Post call nops on ppc64 > > - 1 instruction, i.e. 4 bytes (either CMPI or CMPLI[1]) > x86_64: 1 instruction, 8 bytes > aarch64: 3 instruction, 12 bytes > [1] 3.1.10 Fixed Point Compare Instructions in Power ISA 3.1B > https://openpowerfoundation.org/specifications/isa/ > > - 26 bits data payload > x86_64: 32 bits; aarch64: 32 bits > - 9 bits dedicated to oop map slot. With 8 bits there where cases with SPECjvm2008 where the slot could not be encoded (on ppc64 and x86_64). > x86_64: 8 bits; aarch64: 8 bits > - 17 bits dedicated to cb offset. Effectively 19 bits due to instruction alignment. > x86_64: 24 bits; aarch64: 24 bits > - Also used when reconstructing the back chain after thawing continuation frames (see `Thaw::patch_caller_links`) > > - Refactored frame constructors to make use of fast CodeBlob lookup based on PCNs. > The fast lookup may only be used if the pc is known to be in the code cache because `CodeCache::find_blob_fast` can yield wrong results if it finds instructions outside the code cache that look just like PCNs. Callers of the frame class constructors need to pass `frame::kind::native` in that case to avoid errors. Other platforms don't make this explicit which is a problem in my eyes. Picking the wrong constructor can cause errors when porting and in future development. > > - Currently only the PCNs in nmethods are initialized. Therefore we don't even try to make a fast lookup based on PCNs if we know the CodeBlob is, e.g., a RuntimeStub. To achieve this we call the frame constructor passing `frame::kind::code_blob`. > > #### Statistics > > > | SpecJVM2008... Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Fix comment Co-authored-by: Andrew Haley ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17171/files - new: https://git.openjdk.org/jdk/pull/17171/files/bad3ab7f..2d743469 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17171&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17171&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17171.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17171/head:pull/17171 PR: https://git.openjdk.org/jdk/pull/17171 From duke at openjdk.org Sat Dec 23 15:30:48 2023 From: duke at openjdk.org (ExE Boss) Date: Sat, 23 Dec 2023 15:30:48 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v2] In-Reply-To: References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: On Sat, 23 Dec 2023 08:13:02 GMT, Richard Reingruber wrote: >> This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). >> >> * `frame::get_oop_map()` is moved to shared code >> >> * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` >> >> The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. >> All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. >> >> EDIT 2023-12-22: Statistics >> >> The statistical numbers were generated with release builds. For riscv64 I used qemu. >> The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. >> Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. >> >> | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | >> |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| >> | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | >> | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | >> | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | >> | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | >> | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | >> | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | >> | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | >> >> >> | SpecJVM2008 compil... > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Review Martin src/hotspot/share/runtime/frame.inline.hpp line 108: > 106: > 107: inline const ImmutableOopMap* frame::get_oop_map() const { > 108: if (_cb == nullptr || _cb->oop_maps() == nullptr) return nullptr; Maybe?add a?newline after?this to?visually separate the?guard from?the?body: Suggestion: if (_cb == nullptr || _cb->oop_maps() == nullptr) return nullptr; ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17150#discussion_r1435631692 From rrich at openjdk.org Sat Dec 23 23:29:11 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Sat, 23 Dec 2023 23:29:11 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v3] In-Reply-To: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: <7X0tS6YuApHTHrhicD7cZ6fXc7o3lg54gQ2Y7LiwA7w=.a8e50e61-3e9a-44be-996c-43a476e63f13@github.com> > This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). > > * `frame::get_oop_map()` is moved to shared code > > * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` > > The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. > All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. > > EDIT 2023-12-22: Statistics > > The statistical numbers were generated with release builds. For riscv64 I used qemu. > The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. > Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. > > | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | > |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| > | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | > | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | > | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | > | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | > | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | > | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | > | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | > > > | SpecJVM2008 compiler.compiler with fix iterations | x86_64: base | x8... Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Add newline ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17150/files - new: https://git.openjdk.org/jdk/pull/17150/files/904c2337..bbeac689 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17150&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17150&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17150/head:pull/17150 PR: https://git.openjdk.org/jdk/pull/17150 From qamai at openjdk.org Sun Dec 24 00:07:17 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 24 Dec 2023 00:07:17 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v40] In-Reply-To: References: Message-ID: <6_ulNkpSH-sOMnC2MzYRsrYkLs_Kaq63u2fqvYXizpQ=.ff7e2531-8524-4ebf-8c5a-97feb842839e@github.com> > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 83 commits: - address reviews - Merge branch 'master' into unsignedDiv - missing revert - missing include - remove static - address reviews - missing include - isolate javaArithmetic changes - Merge branch 'master' into unsignedDiv - fix proof - ... and 73 more: https://git.openjdk.org/jdk/compare/28c82bf1...05981133 ------------- Changes: https://git.openjdk.org/jdk/pull/9947/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=39 Stats: 2323 lines in 13 files changed: 1847 ins; 289 del; 187 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Sun Dec 24 00:07:17 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 24 Dec 2023 00:07:17 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v39] In-Reply-To: References: <4aXL_qh1epRWCwufaHiKXJ3wuPqG0xZSF6i-8r6OgcU=.97a5ff7e-5a19-47e5-b14d-af16ef5c56d5@github.com> Message-ID: <9oni8cblZSM7V9glr8OwMwN18kfNVwKu_9KOgUEVMXk=.e5b2af24-ee85-403f-8541-77b9ca6e90c2@github.com> On Mon, 18 Dec 2023 17:09:06 GMT, Kim Barrett wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> missing revert > > src/hotspot/share/opto/divconstants.cpp line 28: > >> 26: #include >> 27: #include >> 28: #include "utilities/powerOfTwo.hpp" > > Comment about include order was marked resolved, but no change was made. Sorry I missed this file. It is fixed now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1435742898 From qamai at openjdk.org Sun Dec 24 00:07:17 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 24 Dec 2023 00:07:17 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v35] In-Reply-To: <0vhJhWdel74-hzt5aQ7ykhOFvaVMBTNjH3daKZCjdM8=.97b6817b-f5f3-4b91-9e94-c6794a21b198@github.com> References: <0vhJhWdel74-hzt5aQ7ykhOFvaVMBTNjH3daKZCjdM8=.97b6817b-f5f3-4b91-9e94-c6794a21b198@github.com> Message-ID: On Mon, 18 Dec 2023 17:07:32 GMT, Kim Barrett wrote: >> I see there are multiple cases where a header is defined in multiple source files, and these are used exclusively for `DivNode`s so putting them here seems logical. > > If it were true these are only used in divnode.cpp then the declarations could > be in that file. But these are also used in the gtest. That gtest duplicates > the declaration. Better would be for it to include the suggested > divconstants.hpp. > > Having examples of an unusual style doesn't mean we want more, at least not > without some fairly good reason. I see, I moved it to another file then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1435742851 From qamai at openjdk.org Sun Dec 24 00:12:11 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Sun, 24 Dec 2023 00:12:11 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v39] In-Reply-To: References: <4aXL_qh1epRWCwufaHiKXJ3wuPqG0xZSF6i-8r6OgcU=.97a5ff7e-5a19-47e5-b14d-af16ef5c56d5@github.com> Message-ID: On Mon, 18 Dec 2023 17:13:48 GMT, Kim Barrett wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> missing revert > > test/hotspot/gtest/opto/test_constant_division.cpp line 66: > >> 64: >> 65: template >> 66: void magic_divide_constants(T d, T N_neg, T N_pos, juint min_s, T& c, bool& c_ovf, juint& s); > > I don't see any tests here of magic_divide_constants_round_down. I took the formula quite literally from the paper so I don't think there is a need for a separate test for those cases. It is also covered in the transformation tests from the Java side. ![image](https://github.com/openjdk/jdk/assets/49088128/cc1b1c5e-a37b-4510-9aff-839f0337a532) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1435743303 From jrose at openjdk.org Sun Dec 24 01:26:11 2023 From: jrose at openjdk.org (John R Rose) Date: Sun, 24 Dec 2023 01:26:11 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v39] In-Reply-To: References: <4aXL_qh1epRWCwufaHiKXJ3wuPqG0xZSF6i-8r6OgcU=.97a5ff7e-5a19-47e5-b14d-af16ef5c56d5@github.com> Message-ID: On Sun, 24 Dec 2023 00:08:50 GMT, Quan Anh Mai wrote: >> test/hotspot/gtest/opto/test_constant_division.cpp line 66: >> >>> 64: >>> 65: template >>> 66: void magic_divide_constants(T d, T N_neg, T N_pos, juint min_s, T& c, bool& c_ovf, juint& s); >> >> I don't see any tests here of magic_divide_constants_round_down. > > I took the formula quite literally from the paper so I don't think there is a need for a separate test for those cases. It is also covered in the transformation tests from the Java side. > > ![image](https://github.com/openjdk/jdk/assets/49088128/cc1b1c5e-a37b-4510-9aff-839f0337a532) It is great that we are factoring out these algorithm steps into their own separately reviewable and maintainable API points. I would even say it is *necessary* to do this, compared with the alternatives, such as (what we used to do) write random file-local helper functions or even hand-inlined statements. A big part of the advantage is that we can catch bugs at the subroutine level, rather than at the system level. But this only works if we write (or at least try hard to write) unit tests for each subroutine. It doesn?t matter very much whether the subroutine comes from a published source or whether we created it ourselves somehow. (The risk model for where failures come from differs a little, since the published source is presumably better reviewed than our own work.) In either case a unit test (gtest) adds a lot of value. The gtest can detect either of two interesting problems: 1. an error in the algorithm (this happens even if it is published) or 2. an error in our encoding of the algorithm. The second is probably more likely. It can happen due to source code errors or due to compiler bugs. Either way, the gtest adds value by making it more likely that, if something goes wrong, we will find it before system integration and deployment. So, based on what I can see here, I recommend writing at least a simple gtest for each subroutine we are writing, regardless of its source. (Reminder: When using random numbers as test inputs, please ensure that the gtests are seeded reproducibly. Double check for pre-existing uses of random generators in the gtests that satisfy this requirement.) Thanks for this very good work; I?m very glad you tackled it. Getting numerics correct is always tricky, but this will pay off. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1435748637 From aph at openjdk.org Sun Dec 24 10:27:51 2023 From: aph at openjdk.org (Andrew Haley) Date: Sun, 24 Dec 2023 10:27:51 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v3] In-Reply-To: <7X0tS6YuApHTHrhicD7cZ6fXc7o3lg54gQ2Y7LiwA7w=.a8e50e61-3e9a-44be-996c-43a476e63f13@github.com> References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> <7X0tS6YuApHTHrhicD7cZ6fXc7o3lg54gQ2Y7LiwA7w=.a8e50e61-3e9a-44be-996c-43a476e63f13@github.com> Message-ID: On Sat, 23 Dec 2023 23:29:11 GMT, Richard Reingruber wrote: >> This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). >> >> * `frame::get_oop_map()` is moved to shared code >> >> * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` >> >> The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. >> All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. >> >> EDIT 2023-12-22: Statistics >> >> The statistical numbers were generated with release builds. For riscv64 I used qemu. >> The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. >> Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. >> >> | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | >> |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| >> | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | >> | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | >> | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | >> | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | >> | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | >> | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | >> | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | >> >> >> | SpecJVM2008 compil... > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Add newline src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp line 567: > 565: return false; // cannot encode > 566: } > 567: uint32_t data = (oopmap_slot << 24) | cb_offset; Suggestion: uint32_t data = ((uint32_t)oopmap_slot << 24) | cb_offset; ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17150#discussion_r1435800388 From aph at openjdk.org Sun Dec 24 10:43:48 2023 From: aph at openjdk.org (Andrew Haley) Date: Sun, 24 Dec 2023 10:43:48 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v3] In-Reply-To: <7X0tS6YuApHTHrhicD7cZ6fXc7o3lg54gQ2Y7LiwA7w=.a8e50e61-3e9a-44be-996c-43a476e63f13@github.com> References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> <7X0tS6YuApHTHrhicD7cZ6fXc7o3lg54gQ2Y7LiwA7w=.a8e50e61-3e9a-44be-996c-43a476e63f13@github.com> Message-ID: On Sat, 23 Dec 2023 23:29:11 GMT, Richard Reingruber wrote: >> This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). >> >> * `frame::get_oop_map()` is moved to shared code >> >> * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` >> >> The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. >> All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. >> >> EDIT 2023-12-22: Statistics >> >> The statistical numbers were generated with release builds. For riscv64 I used qemu. >> The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. >> Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. >> >> | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | >> |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| >> | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | >> | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | >> | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | >> | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | >> | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | >> | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | >> | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | >> >> >> | SpecJVM2008 compil... > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Add newline Looks good. I was initially a bit worried it'd restrict the flexibility of ports to use custom encodings, but I don't think that's a problem. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17150#pullrequestreview-1795553240 From ddong at openjdk.org Mon Dec 25 15:40:54 2023 From: ddong at openjdk.org (Denghui Dong) Date: Mon, 25 Dec 2023 15:40:54 GMT Subject: RFR: 8322735: C2: minor improvements of bubble sort used in SuperWord::packset_sort Message-ID: A minor improvement could be made for bubble sort in SuperWord::packset_sort to reduce the comparison count in bad cases. See https://en.wikipedia.org/wiki/Bubble_sort ------------- Commit messages: - 8322735: C2: minor improvements of bubble sort used in SuperWord::packset_sort Changes: https://git.openjdk.org/jdk/pull/17190/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17190&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322735 Stats: 6 lines in 1 file changed: 0 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/17190.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17190/head:pull/17190 PR: https://git.openjdk.org/jdk/pull/17190 From ddong at openjdk.org Mon Dec 25 15:48:07 2023 From: ddong at openjdk.org (Denghui Dong) Date: Mon, 25 Dec 2023 15:48:07 GMT Subject: RFR: 8322694: C1: Handle Constant and IfOp in NullCheckEliminator Message-ID: <1vvyuwLRjlWItKwCyighjCSM5SNbO4CSEE59hQtCU24=.b4783e52-328a-4ce3-8c92-7b736cea7546@github.com> This patch added the support for Constant and IfOn in NullCheckEliminator to eliminate more null check. testing: tier1-4 in progress ------------- Commit messages: - 8322694: C1: Handle Constant and IfOp in NullCheckEliminator Changes: https://git.openjdk.org/jdk/pull/17191/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17191&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322694 Stats: 28 lines in 1 file changed: 25 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/17191.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17191/head:pull/17191 PR: https://git.openjdk.org/jdk/pull/17191 From qamai at openjdk.org Mon Dec 25 17:15:28 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 25 Dec 2023 17:15:28 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v41] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: test for round down ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/05981133..57265bbd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=40 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=39-40 Stats: 107 lines in 2 files changed: 75 ins; 25 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From qamai at openjdk.org Mon Dec 25 17:57:28 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 25 Dec 2023 17:57:28 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v42] In-Reply-To: References: Message-ID: > This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. > > In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: > > floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) > ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) > > The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. > > For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: > > c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) > c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) > > which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. > > For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. > > More tests are added to cover the possible patterns. > > Please take a look and have some reviews. Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: power of 2 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9947/files - new: https://git.openjdk.org/jdk/pull/9947/files/57265bbd..0f2c57c7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=41 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9947&range=40-41 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/9947/head:pull/9947 PR: https://git.openjdk.org/jdk/pull/9947 From rrich at openjdk.org Tue Dec 26 07:29:55 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 26 Dec 2023 07:29:55 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v4] In-Reply-To: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: > This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). > > * `frame::get_oop_map()` is moved to shared code > > * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` > > The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. > All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. > > EDIT 2023-12-22: Statistics > > The statistical numbers were generated with release builds. For riscv64 I used qemu. > The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. > Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. > > | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | > |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| > | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | > | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | > | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | > | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | > | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | > | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | > | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | > > > | SpecJVM2008 compiler.compiler with fix iterations | x86_64: base | x8... Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Suggstion Andrew Co-authored-by: Andrew Haley ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17150/files - new: https://git.openjdk.org/jdk/pull/17150/files/bbeac689..6c1fd588 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17150&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17150&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17150/head:pull/17150 PR: https://git.openjdk.org/jdk/pull/17150 From qamai at openjdk.org Tue Dec 26 07:34:12 2023 From: qamai at openjdk.org (Quan Anh Mai) Date: Tue, 26 Dec 2023 07:34:12 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v39] In-Reply-To: References: <4aXL_qh1epRWCwufaHiKXJ3wuPqG0xZSF6i-8r6OgcU=.97a5ff7e-5a19-47e5-b14d-af16ef5c56d5@github.com> Message-ID: On Sun, 24 Dec 2023 01:23:08 GMT, John R Rose wrote: >> I took the formula quite literally from the paper so I don't think there is a need for a separate test for those cases. It is also covered in the transformation tests from the Java side. >> >> ![image](https://github.com/openjdk/jdk/assets/49088128/cc1b1c5e-a37b-4510-9aff-839f0337a532) > > It is great that we are factoring out these algorithm steps into their own separately reviewable and maintainable API points. I would even say it is *necessary* to do this, compared with the alternatives, such as (what we used to do) write random file-local helper functions or even hand-inlined statements. > > A big part of the advantage is that we can catch bugs at the subroutine level, rather than at the system level. But this only works if we write (or at least try hard to write) unit tests for each subroutine. It doesn?t matter very much whether the subroutine comes from a published source or whether we created it ourselves somehow. (The risk model for where failures come from differs a little, since the published source is presumably better reviewed than our own work.) In either case a unit test (gtest) adds a lot of value. The gtest can detect either of two interesting problems: 1. an error in the algorithm (this happens even if it is published) or 2. an error in our encoding of the algorithm. The second is probably more likely. It can happen due to source code errors or due to compiler bugs. Either way, the gtest adds value by making it more likely that, if something goes wrong, we will find it before system integration and deployment. > > So, based on what I can see here, I recommend writing at least a simple gtest for each subroutine we are writing, regardless of its source. > > (Reminder: When using random numbers as test inputs, please ensure that the gtests are seeded reproducibly. Double check for pre-existing uses of random generators in the gtests that satisfy this requirement.) > > Thanks for this very good work; I?m very glad you tackled it. Getting numerics correct is always tricky, but this will pay off. > > P.S. another reason to use a unit test on a published algorithm: the system validity can be demonstrated without reference to the publication. Self contained proof is better for us. @rose00 Thanks for your input, I have added a unit test for this case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1436298767 From rrich at openjdk.org Tue Dec 26 08:52:43 2023 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 26 Dec 2023 08:52:43 GMT Subject: RFR: 8322294: Cleanup NativePostCallNop [v4] In-Reply-To: References: <6LS57mCF2fgaosnyfnNydaqfT3cD3F42xsDOujG5SgY=.2db5f614-f64d-4fe4-8e68-1c06e70205d3@github.com> Message-ID: <6dWfyA4aksWNB4DuNyTwBK2F-06ovAa_bR7jQm6g-04=.c7e30d54-9a7b-4282-92b4-b83336ada5b3@github.com> On Tue, 26 Dec 2023 07:29:55 GMT, Richard Reingruber wrote: >> This is a refactoring/cleanup of `NativePostCallNop` that simplifies the ppc64 port (dependent pr https://github.com/openjdk/jdk/pull/17171). >> >> * `frame::get_oop_map()` is moved to shared code >> >> * encoding / decoding details of the oopmap slot and the CodeBlob offset are moved from shared code to the platform dependent implementations of `bool NativePostCallNop::patch(int32_t oopmap_slot, int32_t cb_offset)` and `bool NativePostCallNop::decode(int32_t& oopmap_slot, int32_t& cb_offset)` >> >> The change passed our CI testing. JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests. >> All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le and AIX. >> >> EDIT 2023-12-22: Statistics >> >> The statistical numbers were generated with release builds. For riscv64 I used qemu. >> The variance is high on all platforms. Up to 80% I think. Numbers with fastdebug are also very different. >> Nevertheless, they are consistent within one run, and I'd expect errors in encoding or decoding to manifest in the numbers. >> >> | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | x86_64: base | x86_64: pr | aarch64: base | aarch64: pr | riscv64: base | riscv64: pr | >> |------------------------------------------------------|--------------|------------|---------------|-------------|---------------|-------------| >> | PCN lookup success | 17517455 | 15339681 | 13179049 | 15980253 | 19400110 | 30017193 | >> | PCN lookup failure | 328164 | 372555 | 237617 | 138164 | 415341 | 586476 | >> | PCN decode success | 17513991 | 15336485 | 13176061 | 15977651 | 19397398 | 30014226 | >> | PCN decode failure | 3464 | 3196 | 2988 | 2602 | 2712 | 2967 | >> | PCN patch success | 2676 | 2465 | 2459 | 2089 | 2214 | 2259 | >> | PCN patch cb offset failure | 0 | 0 | 0 | 0 | 0 | 0 | >> | PCN patch oopmap slot failure | 0 | 0 | 0 | 0 | 0 | 0 | >> >> >> | SpecJVM2008 compil... > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Suggstion Andrew > > Co-authored-by: Andrew Haley Thanks for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/17150#issuecomment-1869376562 From mdoerr at openjdk.org Wed Dec 27 18:39:55 2023 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 27 Dec 2023 18:39:55 GMT Subject: RFR: 8290965: PPC64: Implement post-call NOPs [v2] In-Reply-To: References: Message-ID: On Sat, 23 Dec 2023 11:56:10 GMT, Richard Reingruber wrote: >> #### Implementation of post call nops (PCNs) on ppc64. >> >> Depends on https://github.com/openjdk/jdk/pull/17150 >> >> About post call nops: >> >> - instruction(s) at return addresses of compiled java calls >> - emitted iff vm continuations are enabled to support virtual threads >> - encode data that can be be used to find the corresponding CodeBlob and oop map faster >> - mt-safe patchable to trigger deoptimization >> >> Background: >> >> - Frames in continuation StackChunks are not visited if their compiled method is made not entrant (in contrast to frames on stack). >> Instead all PCNs of the compiled method are patched to trigger deoptimization when control returns to such frames. >> - With vm continuations, stacks are walked and inspected more frequently. This requires lookup of metadata like frame size and oop maps. As an optimization the offset of the CodeBlob to the PCN and the oop map slot are encoded as data in the PCN. >> >> Post call nops on ppc64 >> >> - 1 instruction, i.e. 4 bytes (either CMPI or CMPLI[1]) >> x86_64: 1 instruction, 8 bytes >> aarch64: 3 instruction, 12 bytes >> [1] 3.1.10 Fixed Point Compare Instructions in Power ISA 3.1B >> https://openpowerfoundation.org/specifications/isa/ >> >> - 26 bits data payload >> x86_64: 32 bits; aarch64: 32 bits >> - 9 bits dedicated to oop map slot. With 8 bits there where cases with SPECjvm2008 where the slot could not be encoded (on ppc64 and x86_64). >> x86_64: 8 bits; aarch64: 8 bits >> - 17 bits dedicated to cb offset. Effectively 19 bits due to instruction alignment. >> x86_64: 24 bits; aarch64: 24 bits >> - Also used when reconstructing the back chain after thawing continuation frames (see `Thaw::patch_caller_links`) >> >> - Refactored frame constructors to make use of fast CodeBlob lookup based on PCNs. >> The fast lookup may only be used if the pc is known to be in the code cache because `CodeCache::find_blob_fast` can yield wrong results if it finds instructions outside the code cache that look just like PCNs. Callers of the frame class constructors need to pass `frame::kind::native` in that case to avoid errors. Other platforms don't make this explicit which is a problem in my eyes. Picking the wrong constructor can cause errors when porting and in future development. >> >> - Currently only the PCNs in nmethods are initialized. Therefore we don't even try to make a fast lookup based on PCNs if we know the CodeBlob is, e.g., a RuntimeStub. To achieve this we call the frame cons... > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Fix comment > > Co-authored-by: Andrew Haley Usage of CMPI/CMPLI looks great. Assuming `kind::nmethod` by default will likely work, but I wonder if we could avoid that without measurable performance loss (see comments below). src/hotspot/cpu/ppc/frame_ppc.hpp line 398: > 396: enum class kind { > 397: native, // The frame's pc is not necessarily in the CodeCache. > 398: // CodeCache::find_blob_fast(void* pc) can yield wrong results in this case and must not be used. I'd probably call it `unknown`. src/hotspot/cpu/ppc/frame_ppc.hpp line 414: > 412: // Constructors > 413: inline frame(intptr_t* sp, intptr_t* fp, address pc); > 414: inline frame(intptr_t* sp, address pc, kind knd = kind::nmethod); I think using `kind::nmethod` by default is potentially dangerous. The pc may be outside of the code cache and calling find_blob_fast would be unreliable. It's used by pns for debugging code. It doesn't look performance critical and we could use a conservative default. I guess that we don't see issues because native code doesn't set bit 9 in CMPI/CMPLI. src/hotspot/cpu/ppc/frame_ppc.inline.hpp line 44: > 42: } > 43: > 44: if (_cb == nullptr ) { Please remove the whitespace! src/hotspot/cpu/ppc/frame_ppc.inline.hpp line 45: > 43: > 44: if (_cb == nullptr ) { > 45: _cb = knd == kind::nmethod ? CodeCache::find_blob_fast(_pc) : CodeCache::find_blob(_pc); `(knd == kind::nmethod)` would look better. src/hotspot/cpu/ppc/frame_ppc.inline.hpp line 92: > 90: _on_heap(false), DEBUG_ONLY(_frame_index(-1) COMMA) _unextended_sp(nullptr), _fp(nullptr) {} > 91: > 92: inline frame::frame(intptr_t* sp) : frame(sp, nullptr, kind::nmethod) {} Same here. Potentially dangerous default value. Not performance critical AFAICS. src/hotspot/cpu/ppc/frame_ppc.inline.hpp line 105: > 103: : _sp(sp), _pc(pc), _cb(cb), _oop_map(nullptr), > 104: _on_heap(false), DEBUG_ONLY(_frame_index(-1) COMMA) _unextended_sp(unextended_sp), _fp(fp) { > 105: setup(kind::nmethod); I think `kind::nmethod` should only be used if cb != nullptr which is not checked, here. Is this one performance critical? src/hotspot/cpu/ppc/macroAssembler_ppc.cpp line 1191: > 1189: } > 1190: // We use CMPI/CMPLI instructions to encode post call nops. > 1191: // We set bit 9 to distinguish post call nops from real CMPI/CMPI instructions Should be CMPI/CMPLI. Maybe add that CMPI and CMPLI opcodes only differ in one bit which we use to encode data. src/hotspot/cpu/ppc/nativeInst_ppc.hpp line 519: > 517: // | |4 bits | | 22 bits | > 518: // > 519: // Bit 9 is alwys 1 for PCNs to distinguish them from CMPI/CMPLI `always`, maybe distinguish from "regular CMPI/CMPLI". ------------- PR Review: https://git.openjdk.org/jdk/pull/17171#pullrequestreview-1797379733 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437165683 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437192737 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437161661 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437161983 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437192999 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437167565 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437171154 PR Review Comment: https://git.openjdk.org/jdk/pull/17171#discussion_r1437172336 From aph-open at littlepinkcloud.com Thu Dec 28 19:58:13 2023 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Thu, 28 Dec 2023 19:58:13 +0000 Subject: RFR: 8290965: PPC64: Implement post-call NOPs In-Reply-To: References: Message-ID: <08fd98f7-4a8b-458d-a2da-7ef615bf94e9@littlepinkcloud.com> On 12/20/23 20:36, Richard Reingruber wrote: > | test/jdk/java/lang/Thread/virtual/stress/Skynet.java | ppc64le | x86_64 | > |------------------------------------------------------|-----------|-----------| > | PCN lookup success | 306955525 | 247185016 | > | PCN lookup failure | 500975 | 421098 | > | PCN decode success (C2) | 306951893 | 247181691 | > | PCN decode failure | 3168 | 59 | > | PCN patch success | 2080 | 2662 | > | PCN patch cb offset failure | 0 | 0 | > | PCN patch oopmap slot failure | 0 | 0 | These data are really interesting. How did you gather them? Thanks. From kbarrett at openjdk.org Fri Dec 29 02:07:02 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 29 Dec 2023 02:07:02 GMT Subject: RFR: 8322758: Eliminate -Wparentheses warnings in C2 code Message-ID: Please review this change to eliminate some -Wparentheses warnings. In most cases, this involved simply adding a few parentheses to make some implicit operator precedence explicit. In PhaseIdealLoop::rc_predicate, I also added a comment describing the test being performed, since it didn't seem obvious even with the additional parentheses. Testing: mach5 tier1 Also ran mach5 tier1 with these changes in conjunction enabling -Wparentheses and other changes needed to make that work. ------------- Commit messages: - fix -Wparentheses warnings in C2 code Changes: https://git.openjdk.org/jdk/pull/17199/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17199&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322758 Stats: 18 lines in 9 files changed: 2 ins; 0 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/17199.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17199/head:pull/17199 PR: https://git.openjdk.org/jdk/pull/17199 From kbarrett at openjdk.org Fri Dec 29 03:39:08 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 29 Dec 2023 03:39:08 GMT Subject: RFR: 8322759: Eliminate -Wparentheses warnings in compiler code Message-ID: <496tGkQ1KUCrW1IHOETyvhqopkNYjsEoupxjo0Ze3Wg=.0223f494-fd34-4e4a-a31a-5030603f2113@github.com> Please review this change to eliminate some -Wparentheses warnings. This involved simply adding a few parentheses to make some implicit operator precedence explicit. This change addresses non-C2 parts of the compiler component. Testing: mach5 tier1 Also ran mach5 tier1 with these changes in conjunction enabling -Wparentheses and other changes needed to make that work. ------------- Commit messages: - fix -Wparentheses warnings in non-C2 compiler code Changes: https://git.openjdk.org/jdk/pull/17200/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17200&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322759 Stats: 12 lines in 5 files changed: 0 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/17200.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17200/head:pull/17200 PR: https://git.openjdk.org/jdk/pull/17200 From ddong at openjdk.org Fri Dec 29 14:48:07 2023 From: ddong at openjdk.org (Denghui Dong) Date: Fri, 29 Dec 2023 14:48:07 GMT Subject: RFR: 8322779: C1: Remove the unused counter 'totalInstructionNodes' Message-ID: Hi, Could I have a review of this small cleanup patch that removes the unused counter 'totalInstructionNodes'. JDK-8058968 refactored the Compiler time traces and deleted the only place that read the counter. Thanks ------------- Commit messages: - 8322779: C1: Remove the unused counter 'totalInstructionNodes' Changes: https://git.openjdk.org/jdk/pull/17204/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17204&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322779 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17204.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17204/head:pull/17204 PR: https://git.openjdk.org/jdk/pull/17204 From ddong at openjdk.org Fri Dec 29 15:08:13 2023 From: ddong at openjdk.org (Denghui Dong) Date: Fri, 29 Dec 2023 15:08:13 GMT Subject: RFR: 8322781: C1: Debug build crash in GraphBuilder::vmap() when print stats Message-ID: Hi, Could I have a review of this fix patch that fixes a crash problem in the debug build when -XX:+PrintValueNumbering -XX:+Verbose -XX:-UseLocalValueNumbering Thanks ------------- Commit messages: - 8322781: C1: Debug build crash in GraphBuilder::vmap() when print stats Changes: https://git.openjdk.org/jdk/pull/17205/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17205&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322781 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/17205.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17205/head:pull/17205 PR: https://git.openjdk.org/jdk/pull/17205 From aph at openjdk.org Fri Dec 29 18:23:38 2023 From: aph at openjdk.org (Andrew Haley) Date: Fri, 29 Dec 2023 18:23:38 GMT Subject: RFR: 8322758: Eliminate -Wparentheses warnings in C2 code In-Reply-To: References: Message-ID: On Fri, 29 Dec 2023 02:01:08 GMT, Kim Barrett wrote: > Please review this change to eliminate some -Wparentheses warnings. In most > cases, this involved simply adding a few parentheses to make some implicit > operator precedence explicit. > > In PhaseIdealLoop::rc_predicate, I also added a comment describing the test > being performed, since it didn't seem obvious even with the additional > parentheses. > > Testing: mach5 tier1 > > Also ran mach5 tier1 with these changes in conjunction enabling -Wparentheses > and other changes needed to make that work. Marked as reviewed by aph (Reviewer). src/hotspot/share/opto/loopPredicate.cpp line 801: > 799: const TypeInt* idx_type = TypeInt::INT; > 800: // same signs and upper, or different signs and not upper. > 801: if (((stride > 0) == (scale > 0)) == upper) { This is rather l33t code, but I guess it's OK with the comment. This Suggestion: _Bool same_signs = (stride > 0) == (scale > 0); if ((same_signs & upper) || (!same_signs && !upper)) { generates slightly more code with GCC -O2. I'd be happy with either. ------------- PR Review: https://git.openjdk.org/jdk/pull/17199#pullrequestreview-1799153875 PR Review Comment: https://git.openjdk.org/jdk/pull/17199#discussion_r1438363723 From duke at openjdk.org Fri Dec 29 20:36:56 2023 From: duke at openjdk.org (duke) Date: Fri, 29 Dec 2023 20:36:56 GMT Subject: Withdrawn: 8315361: C2: Create a superclass of SuperWord In-Reply-To: <1TvO1Yb11BjAn4X6jux459nNuDfFrc_6-8lkHgcNigs=.8607af09-c25c-4f41-845c-c9a5900de1a0@github.com> References: <1TvO1Yb11BjAn4X6jux459nNuDfFrc_6-8lkHgcNigs=.8607af09-c25c-4f41-845c-c9a5900de1a0@github.com> Message-ID: On Wed, 25 Oct 2023 01:58:13 GMT, Fei Gao wrote: > As discussed in [JDK-8308994](https://bugs.openjdk.org/browse/JDK-8308994), we should first do some refactoring work before proceeding with the new post loop vectorization. In this patch, we have done the following refactoring. (Most of changes are just moving the code around without real change on logic.) > > 1) We have created a superclass for shared data structures and utilities for C2's auto-vectorization. > > 2) We have moved data structures for basic loop info and the field _vector_loop_debug to the superclass. We also drop the class member "_visited" and "_post_visited", and instead use local variables, namely allocating them when using them. > > 3) Both two vectorizers traverse and store loop body nodes in RPO (Reverse Post-Order) separately. So we withdraw the logic into a new function `collect_nodes_in_reverse_postorder()`, and move the function and related data structures to the superclass. Before, the code for counting the number of reduction uses is mixed in the RPO logic. Now, we have decoupled the code and put it into SuperWord separately. > > Tested tier1~3 on x86 and AArch64. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/16353 From kbarrett at openjdk.org Sat Dec 30 18:48:03 2023 From: kbarrett at openjdk.org (Kim Barrett) Date: Sat, 30 Dec 2023 18:48:03 GMT Subject: RFR: 8282365: Consolidate and improve division by constant idealizations [v42] In-Reply-To: References: Message-ID: On Mon, 25 Dec 2023 17:57:28 GMT, Quan Anh Mai wrote: >> This patch implements idealisation for unsigned divisions to change a division by a constant into a series of multiplication and shift. I also change the idealisation of `DivI` to get a more efficient series when the magic constant overflows an int32. >> >> In general, the idea behind a signed division transformation is that for a positive constant `d`, we would need to find constants `c` and `m` so that: >> >> floor(x / d) = floor(x * c / 2**m) for 0 < x < 2**(N - 1) (1) >> ceil(x / d) = floor(x * c / 2**m) + 1 for -2**(N - 1) <= x < 0 (2) >> >> The implementation in the original book takes into consideration that the machine may not be able to perform the full multiplication `x * c`, so the constant overflow and we need to add back the dividend as in `DivLNode::Ideal` cases. However, for int32 division, `x * c` cannot overflow an int64. As a result, it is always feasible to just calculate the product and shift the result. >> >> For unsigned multiplication, the situation is somewhat trickier because the condition needs to be twice as strong (the condition (1) and (2) above are mostly the same). This results in the magic constant `c` calculated based on the method presented in Hacker's Delight by Henry S. Warren, Jr. may overflow an uintN. For int division, we can depend on the theorem devised by Arch D. Robison in N-Bit Unsigned Division Via N-Bit Multiply-Add, which states that there exists either: >> >> c1 in uint32 and m1, such that floor(x / d) = floor(x * c1 / 2**m1) for 0 < x < 2**32 (3) >> c2 in uint32 and m2, such that floor(x / d) = floor((x + 1) * c2 / 2**m2) for 0 < x < 2**32 (4) >> >> which means that either `x * c1` never overflows an uint64 or `(x + 1) * c2` never overflows an uint64. And we can perform a full multiplication. >> >> For longs, there is no way to do a full multiplication so we do some basic transformations to achieve a computable formula. The details I have written as comments in the overflow case. >> >> More tests are added to cover the possible patterns. >> >> Please take a look and have some reviews. Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > power of 2 Changes requested by kbarrett (Reviewer). test/hotspot/gtest/opto/test_constant_division.cpp line 33: > 31: > 32: // Generate a random positive integer of type T in a way that biases > 33: // towards smaller values Why is there a bias toward smaller numbers? Maybe it should be named differently to indicate that bias? test/hotspot/gtest/opto/test_constant_division.cpp line 54: > 52: template <> > 53: julong random() { > 54: juint bits = juint(os::random()) % 63 + 1; This change (`&` => `%`, and the similar change below) go a long way toward explaining why I couldn't puzzle out what this function was intended to do. Note that `&` has lower precedence than `+`, so the earlier version was masking with 64. The new version doesn't have that operator precedence mistake, though I'd prefer the precedence be made explicit using parens. test/hotspot/gtest/opto/test_constant_division.cpp line 132: > 130: for (int i = 0; i < iter_num;) { > 131: UT d = random(); > 132: if ((d & (d - 1)) == 0) { We have `is_power_of_2` for this. test/hotspot/gtest/opto/test_constant_division.cpp line 139: > 137: UT N_pos = random(); > 138: if (N_neg < d && N_pos < d) { > 139: continue; With sufficiently bad luck, we could spin here for a long time. (Similarly, though much less likely above with the power-of-2 case.) That doesn't seem great. Of course, if one does count these skipped cases against the iteration limit then with sufficiently bad luck one might not test anything. Rather than skipping the test here, could you instead modify one of the values and proceed with the test? ------------- PR Review: https://git.openjdk.org/jdk/pull/9947#pullrequestreview-1799579259 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1438680076 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1438679875 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1438680376 PR Review Comment: https://git.openjdk.org/jdk/pull/9947#discussion_r1438681980 From igavrilin at openjdk.org Sat Dec 30 20:12:56 2023 From: igavrilin at openjdk.org (Ilya Gavrilin) Date: Sat, 30 Dec 2023 20:12:56 GMT Subject: RFR: 8322790: RISC-V: Tune costs for shuffles with no conversion Message-ID: Hi all, please review this small change to RISC-V nodes insertion costs. Now we have several nodes which provide shuffles without conversion: https://github.com/openjdk/jdk/blob/32d80e2caf6063b58128bd5f3dc87b276f3bd0cb/src/hotspot/cpu/riscv/riscv.ad#L8525-L8741 On most RISC-V cpu`s we prefer reg<->reg operations, because they are faster, but now stack<->reg operations used (for details about reasons, please, visit connected jbs issue). After changing insertion costs reg<->reg operations selected, and we can see performance improvements for benchmarks, which use such shuffles (tested on thead C910 board): | Benchmark | Upstream build (ops/ms) | Patched build (ops/ms) | difference (%) | |:-----------------------------------:|:-----------------------:|:----------------------:|:--------------:| | MathBench.doubleToRawLongBitsDouble | 30935.139 | 32171.761 | +4.00 | | StrictMathBench.ceilDouble | 24682.810 | 29782.050 | +20.66 | | StrictMathBench.cosDouble | 6948.309 | 6938.276 | -0.14 | | StrictMathBench.expDouble | 6816.143 | 7211.021 | +5.79 | | StrictMathBench.floorDouble | 30699.630 | 34189.509 | +11.37 | | StrictMathBench.maxDouble | 35157.355 | 34675.191 | -1.37 | | StrictMathBench.minDouble | 35192.135 | 35183.015 | -0.03 | | StrictMathBench.sinDouble | 6698.405 | 6721.809 | +0.35 | New benchmark for changed nodes: --- a/test/micro/org/openjdk/bench/java/lang/MathBench.java +++ b/test/micro/org/openjdk/bench/java/lang/MathBench.java @@ -540,4 +540,11 @@ public class MathBench { return Math.ulp(float7); } + @Benchmark + public long doubleToRawLongBitsDouble() { + double dbl162Dot5 = double81 * 2.0d + double0Dot5; + double dbl3 = double2 + double1; + return Double.doubleToRawLongBits(dbl162Dot5) + Double.doubleToRawLongBits(dbl3); + } + ------------- Commit messages: - Change costs for shuffles with no conversion Changes: https://git.openjdk.org/jdk/pull/17206/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17206&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8322790 Stats: 13 lines in 1 file changed: 1 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/17206.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17206/head:pull/17206 PR: https://git.openjdk.org/jdk/pull/17206