From kbarrett at openjdk.org Tue Oct 1 01:23:57 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Oct 2024 01:23:57 GMT Subject: RFR: 8337389: Parallel: Remove unnecessary forward declarations in psScavenge.hpp In-Reply-To: <6pGuQAwh4aO2dyQcgugU0uz9ys8hulq1lyAckrKRDK0=.2f124adb-05d5-4a81-97ed-971a7d8b3a6b@github.com> References: <6pGuQAwh4aO2dyQcgugU0uz9ys8hulq1lyAckrKRDK0=.2f124adb-05d5-4a81-97ed-971a7d8b3a6b@github.com> Message-ID: On Tue, 30 Jul 2024 18:14:36 GMT, joejackson1993 wrote: > trivial cleanup Looks good, and trivial. ------------- Marked as reviewed by kbarrett (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20393#pullrequestreview-2338952343 From tschatzl at openjdk.org Tue Oct 1 07:32:38 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 1 Oct 2024 07:32:38 GMT Subject: RFR: 8337389: Parallel: Remove unnecessary forward declarations in psScavenge.hpp In-Reply-To: <6pGuQAwh4aO2dyQcgugU0uz9ys8hulq1lyAckrKRDK0=.2f124adb-05d5-4a81-97ed-971a7d8b3a6b@github.com> References: <6pGuQAwh4aO2dyQcgugU0uz9ys8hulq1lyAckrKRDK0=.2f124adb-05d5-4a81-97ed-971a7d8b3a6b@github.com> Message-ID: On Tue, 30 Jul 2024 18:14:36 GMT, joejackson1993 wrote: > trivial cleanup trivial ------------- Marked as reviewed by tschatzl (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20393#pullrequestreview-2339339563 From rcastanedalo at openjdk.org Tue Oct 1 11:24:52 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 1 Oct 2024 11:24:52 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

Message-ID: On Mon, 30 Sep 2024 16:56:30 GMT, Vladimir Kozlov wrote: > Good. Thanks, Vladimir! ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2385515540 From shade at openjdk.org Tue Oct 1 13:32:41 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Oct 2024 13:32:41 GMT Subject: Integrated: 8341242: Shenandoah: LRB node is not matched as GC barrier after JDK-8340183 In-Reply-To: References: Message-ID: On Mon, 30 Sep 2024 14:54:30 GMT, Aleksey Shipilev wrote: > [JDK-8340183](https://bugs.openjdk.org/browse/JDK-8340183) introduced a regression: `ShenandoahBarrierSetC2::is_gc_barrier_node` is now answering `false` for the actual `ShenandoahLoadReferenceBarrierNode`. The fix reinstates the check for LRB node. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `hotspot_gc_shenandoah` > - [x] Linux x86_64 server fastdebug, `jdk/jfr/api/consumer/ ` (100x) -- used to fail intermittently, now it does not This pull request has now been integrated. Changeset: 684d246c Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/684d246ccf497f599ffcd498f2fbe4b1b2357e27 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8341242: Shenandoah: LRB node is not matched as GC barrier after JDK-8340183 Reviewed-by: rkennke, phh ------------- PR: https://git.openjdk.org/jdk/pull/21266 From duke at openjdk.org Tue Oct 1 14:05:43 2024 From: duke at openjdk.org (joejackson1993) Date: Tue, 1 Oct 2024 14:05:43 GMT Subject: Integrated: 8337389: Parallel: Remove unnecessary forward declarations in psScavenge.hpp In-Reply-To: <6pGuQAwh4aO2dyQcgugU0uz9ys8hulq1lyAckrKRDK0=.2f124adb-05d5-4a81-97ed-971a7d8b3a6b@github.com> References: <6pGuQAwh4aO2dyQcgugU0uz9ys8hulq1lyAckrKRDK0=.2f124adb-05d5-4a81-97ed-971a7d8b3a6b@github.com> Message-ID: On Tue, 30 Jul 2024 18:14:36 GMT, joejackson1993 wrote: > trivial cleanup This pull request has now been integrated. Changeset: 7b1e6f8e Author: joseph.jackson Committer: Zhengyu Gu URL: https://git.openjdk.org/jdk/commit/7b1e6f8ed9dbc07158717a32d341393afaa54b66 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod 8337389: Parallel: Remove unnecessary forward declarations in psScavenge.hpp Reviewed-by: kbarrett, tschatzl ------------- PR: https://git.openjdk.org/jdk/pull/20393 From tschatzl at openjdk.org Tue Oct 1 14:57:36 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 1 Oct 2024 14:57:36 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in compact GC In-Reply-To: References: Message-ID: On Thu, 19 Sep 2024 14:01:39 GMT, Zhengyu Gu wrote: > Please review this patch that adopts `PartialArrayState`introduced by [JDK-8337709](https://bugs.openjdk.org/browse/JDK-8337709) to consolidate `_oop_task_queues` and `_objarray_task_queues` into single `_marking_stacks`. > > The change mirrors Kim's [JDK-8311163](https://bugs.openjdk.org/browse/JDK-8311163) work, therefore, there are methods can be consolidated and simplified, but I would like defer to a followup CR. Changes requested by tschatzl (Reviewer). src/hotspot/share/gc/parallel/psCompactionManager.cpp line 33: > 31: #include "gc/parallel/psParallelCompact.inline.hpp" > 32: #include "gc/shared/partialArrayState.hpp" > 33: Suggestion: Unnecessary newline. src/hotspot/share/gc/parallel/psCompactionManager.hpp line 90: > 88: bool is_partial_array_state() const { return ((uintptr_t)_holder & PartialArrayStateBit) != 0; } > 89: }; > 90: Please use `ScannerTask` instead; it seems to be completely serviceable for that purpose. In fact, a search&replace seems just fine. https://github.com/openjdk/jdk/compare/pr/21089...tschatzl:jdk:pull/21089-recommendations?expand=1 I am going to experiment with refactoring the other duplicated (statistics) code src/hotspot/share/gc/parallel/psCompactionManager.hpp line 122: > 120: size_t _arrays_chunked; > 121: size_t _array_chunks_processed; > 122: #endif // TASKQUEUE_STATS This is a separate issue, but we did not add these counters when doing this change for g1 young gen. Filed [JDK-8341331](https://bugs.openjdk.org/browse/JDK-8341331) for that. There is a fair amount of code duplication (definition of members, management and printing) which is not great (but you mentioned it). For now I filed [JDK-8341332](https://bugs.openjdk.org/browse/JDK-8341332) for this, but I would really prefer some refactoring in this area in _this_ change. Initially I was kind of okay to do that separately, but the copy&paste seems too much. ------------- PR Review: https://git.openjdk.org/jdk/pull/21089#pullrequestreview-2339670822 PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1782454836 PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1782794766 PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1782710439 From rkennke at openjdk.org Tue Oct 1 15:48:54 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 1 Oct 2024 15:48:54 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v11] In-Reply-To: References:

Message-ID: On Mon, 30 Sep 2024 12:38:03 GMT, Roberto Casta?eda Lozano wrote: > test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationNotRun.java: I think I would disable the tests for now. Is there a good way to say 'run this when UCOH is off OR UseSSE>3? ------------- PR Comment: https://git.openjdk.org/jdk/pull/20677#issuecomment-2386370790 From kbarrett at openjdk.org Tue Oct 1 16:42:35 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 1 Oct 2024 16:42:35 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in compact GC In-Reply-To: References:

Message-ID: On Tue, 1 Oct 2024 13:22:52 GMT, Thomas Schatzl wrote: >> Please review this patch that adopts `PartialArrayState`introduced by [JDK-8337709](https://bugs.openjdk.org/browse/JDK-8337709) to consolidate `_oop_task_queues` and `_objarray_task_queues` into single `_marking_stacks`. >> >> The change mirrors Kim's [JDK-8311163](https://bugs.openjdk.org/browse/JDK-8311163) work, therefore, there are methods can be consolidated and simplified, but I would like defer to a followup CR. > > src/hotspot/share/gc/parallel/psCompactionManager.hpp line 90: > >> 88: bool is_partial_array_state() const { return ((uintptr_t)_holder & PartialArrayStateBit) != 0; } >> 89: }; >> 90: > > Please use `ScannerTask` instead; it seems to be completely serviceable for that purpose. In fact, a search&replace seems just fine. > > https://github.com/openjdk/jdk/compare/pr/21089...tschatzl:jdk:pull/21089-recommendations?expand=1 > > I am going to experiment with refactoring the other duplicated (statistics) code I ran into the same problem for G1 Full GC that @zhengyu123 has run into here. ScannerTask deals in _pointers_ to `oop` (and `narrowOop`). For the separate marking cases we have `oop`s in the tasks. I added a class just like this one in my work, except I put mine in gc/shared and gave it a different name (OopScannerTask, which I don't love). Clearly some coalescing is needed there. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1783181344 From zgu at openjdk.org Tue Oct 1 17:04:39 2024 From: zgu at openjdk.org (Zhengyu Gu) Date: Tue, 1 Oct 2024 17:04:39 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in compact GC In-Reply-To: References:

Message-ID: On Tue, 1 Oct 2024 16:40:13 GMT, Kim Barrett wrote: > OopScannerTask I certainly can adopt your implementation once it is integrated. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1783207830 From zgu at openjdk.org Tue Oct 1 19:23:35 2024 From: zgu at openjdk.org (Zhengyu Gu) Date: Tue, 1 Oct 2024 19:23:35 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC In-Reply-To: References:

Message-ID: <7ZdrpVVGG1hRYudHG_8VA8-lFIJGhSHy0YzmUgtFB9w=.246a5c89-7fb0-402d-92e8-3930c8d430cd@github.com> On Tue, 1 Oct 2024 17:02:19 GMT, Zhengyu Gu wrote: >> I ran into the same problem for G1 Full GC that @zhengyu123 has run into here. ScannerTask deals in >> _pointers_ to `oop` (and `narrowOop`). For the separate marking cases we have `oop`s in the tasks. >> I added a class just like this one in my work, except I put mine in gc/shared and gave it a different name >> (OopScannerTask, which I don't love). Clearly some coalescing is needed there. > >> OopScannerTask > > I certainly can adopt your implementation once it is integrated. > Please use `ScannerTask` instead; it seems to be completely serviceable for that purpose. In fact, a search&replace seems just fine. > > https://github.com/openjdk/jdk/compare/pr/21089...tschatzl:jdk:pull/21089-recommendations?expand=1 > > I am going to experiment with refactoring the other duplicated (statistics) code I believe I ran into alignment assertion failure with `ScannerTask` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1783371791 From zgu at openjdk.org Tue Oct 1 19:37:07 2024 From: zgu at openjdk.org (Zhengyu Gu) Date: Tue, 1 Oct 2024 19:37:07 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC [v2] In-Reply-To: References: Message-ID: > Please review this patch that adopts `PartialArrayState`introduced by [JDK-8337709](https://bugs.openjdk.org/browse/JDK-8337709) to consolidate `_oop_task_queues` and `_objarray_task_queues` into single `_marking_stacks`. > > The change mirrors Kim's [JDK-8311163](https://bugs.openjdk.org/browse/JDK-8311163) work, therefore, there are methods can be consolidated and simplified, but I would like defer to a followup CR. Zhengyu Gu has updated the pull request incrementally with two additional commits since the last revision: - @tschatzl's comment - v8 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21089/files - new: https://git.openjdk.org/jdk/pull/21089/files/732cf8f7..15b998f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21089&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21089&range=00-01 Stats: 5 lines in 2 files changed: 2 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/21089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21089/head:pull/21089 PR: https://git.openjdk.org/jdk/pull/21089 From wkemper at openjdk.org Tue Oct 1 21:44:48 2024 From: wkemper at openjdk.org (William Kemper) Date: Tue, 1 Oct 2024 21:44:48 GMT Subject: RFR: 8332697: ubsan: shenandoahSimpleBitMap.inline.hpp:68:23: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int' [v4] In-Reply-To: References: Message-ID: > Use a template version of `right_n_bits` to use the same type for minuend and subtrahend. William Kemper has updated the pull request incrementally with one additional commit since the last revision: Inline unnecessary usages of right_n_bits ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21236/files - new: https://git.openjdk.org/jdk/pull/21236/files/97d1272b..fdff7d68 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21236&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21236&range=02-03 Stats: 37 lines in 3 files changed: 7 ins; 4 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/21236.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21236/head:pull/21236 PR: https://git.openjdk.org/jdk/pull/21236 From wkemper at openjdk.org Tue Oct 1 21:55:36 2024 From: wkemper at openjdk.org (William Kemper) Date: Tue, 1 Oct 2024 21:55:36 GMT Subject: RFR: 8332697: ubsan: shenandoahSimpleBitMap.inline.hpp:68:23: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int' [v4] In-Reply-To: References:

Message-ID: On Tue, 1 Oct 2024 21:44:48 GMT, William Kemper wrote: >> Use a template version of `right_n_bits` to use the same type for minuend and subtrahend. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Inline unnecessary usages of right_n_bits I'd prefer to keep the scope of these changes limited to this file. This is the only place where this seems to be a problem in the JDK and I think it's just because we abused a macro we shouldn't have. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21236#issuecomment-2387142276 From kdnilsen at openjdk.org Wed Oct 2 00:54:53 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Oct 2024 00:54:53 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup Message-ID: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. Efficiency improvements include: 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. Below, each trial runs for 1 hour, processing 28,000 transactions per second. Without this change, latency for 4 un-named business services is represented by the following chart: ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) With this change, latency for the same services is much better: ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) A comparison of the two is provided by the following: ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) ------------- Commit messages: - Respond to reviewer feedback - Tidy up comments and remove debug instrumentation - Recycle multiple regions before checking deadline - Merge branch 'openjdk:master' into master - Merge branch 'openjdk:master' into master - Merge branch 'openjdk:master' into master - Merge branch 'openjdk:master' into master - Revert "Make GC logging less verbose" - Make GC logging less verbose - Merge branch 'openjdk:master' into master - ... and 15 more: https://git.openjdk.org/jdk/compare/5d062e24...acf517f5 Changes: https://git.openjdk.org/jdk/pull/21211/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21211&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8341379 Stats: 28 lines in 1 file changed: 22 ins; 3 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/21211.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21211/head:pull/21211 PR: https://git.openjdk.org/jdk/pull/21211 From wkemper at openjdk.org Wed Oct 2 00:54:53 2024 From: wkemper at openjdk.org (William Kemper) Date: Wed, 2 Oct 2024 00:54:53 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: <_lvg95UW7HqXx5iAkndxKUNGHTmsXCLj7Ee5t-0SRCE=.47af2a9e-80b3-40d6-9348-407ee1518442@github.com> On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 944: > 942: // yields. Yielding more frequently when there is heavy contention for the heap lock or for CPU cores is considered the > 943: // right thing to do. > 944: const size_t REGION_STRIDE = 32; Maybe call this `REGIONS_PER_BATCH`? When I see `stride` I think how far the loop index moves on each iteration. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21211#discussion_r1783520423 From kdnilsen at openjdk.org Wed Oct 2 00:54:53 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Oct 2024 00:54:53 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <_lvg95UW7HqXx5iAkndxKUNGHTmsXCLj7Ee5t-0SRCE=.47af2a9e-80b3-40d6-9348-407ee1518442@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <_lvg95UW7HqXx5iAkndxKUNGHTmsXCLj7Ee5t-0SRCE=.47af2a9e-80b3-40d6-9348-407ee1518442@github.com> Message-ID: On Tue, 1 Oct 2024 20:53:01 GMT, William Kemper wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 944: > >> 942: // yields. Yielding more frequently when there is heavy contention for the heap lock or for CPU cores is considered the >> 943: // right thing to do. >> 944: const size_t REGION_STRIDE = 32; > > Maybe call this `REGIONS_PER_BATCH`? When I see `stride` I think how far the loop index moves on each iteration. Good suggestion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21211#discussion_r1783668735 From kdnilsen at openjdk.org Wed Oct 2 00:54:53 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Oct 2024 00:54:53 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <_lvg95UW7HqXx5iAkndxKUNGHTmsXCLj7Ee5t-0SRCE=.47af2a9e-80b3-40d6-9348-407ee1518442@github.com> Message-ID: On Wed, 2 Oct 2024 00:24:00 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 944: >> >>> 942: // yields. Yielding more frequently when there is heavy contention for the heap lock or for CPU cores is considered the >>> 943: // right thing to do. >>> 944: const size_t REGION_STRIDE = 32; >> >> Maybe call this `REGIONS_PER_BATCH`? When I see `stride` I think how far the loop index moves on each iteration. > > Good suggestion. Thanks. I've made this change. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21211#discussion_r1783698490 From thomas.schatzl at oracle.com Wed Oct 2 06:47:52 2024 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Wed, 2 Oct 2024 08:47:52 +0200 Subject: Aligning the Serial collector with ZGC In-Reply-To: References: <49D0FB58-94AB-46B6-A8A6-F9F08773770E@kodewerk.com> <44230e90-d8bc-491f-a5c4-2c483646fc3e@oracle.com> Message-ID: <712c118e-642e-4bb7-a48b-b0b11ba3a4f2@oracle.com> Hi, On 28.09.24 00:55, Kirk Pepperdine wrote: > Hi Thomas, > > I wanted to respond to all of your comments but I thought better of it > given one response deserves it?s own email. The focus is mostly on that > one question. > >> > >> > - Introduce an adaptive size policy that takes into account memory and >> > CPU pressure along with global memory pressure. >> > ????- Heap should be large enough to minimize GC overhead but not >> > large enough to trigger OOM. >> >> (probably meant "small enough" the second time) > > I actually did mean large but in the context of OOM killer?. But to your > point, smaller but avoid OOME is also a concern. > >> >> > ????- Introduce -XX:SerialPressure=[0-100] to support this work. >> >> (Fwiw, regards to the other discussion, I agree that if we have a flag >> with the same "meaning" across collectors it might be useful to use >> the same name). > > I think we have deadly agreement on this one. > >> >> > ????- introduce a smoothing algorythm to avoid excessive small >> > resizes. >> >> One option is to split this further into parts: >> >> * list what actions Serial GC could do in reaction to memory pressure >> on an abstract level, and which make sense; from that see what >> functionality is needed. > > I built a chart some time ago and this is an expanded version of it. > [...] > > Some of my thoughts used to construct the table. > [...] > All of the resizing decisions need to be moderated by the availability > of (global) memory. If global memory is scarce, then the decision should > favour releasing (uncommitting) memory. This may come at the expense of > higher GC overhead. Resizing to smaller pool sizes is not without risk > and in the case of young, both high global memory pressure and high > allocation pressure add to the risk. > Thank you for sharing your detailed thoughts. > > >> >> * provide functionality that tries to keep some kind of GC/mutator >> time ratio; I would start with looking at G1 does because Serial GC's >> behaviour is probably closer to G1 than ZGC, but ymmv. >> (Obviously improvements are welcome :)) > > I would agree. Here's some old code for implementing https://bugs.openjdk.org/browse/JDK-8238687: Uncommit at every GC that improves a bit on the current G1 policy which implements both signalling for under/over-cpu usage ratio, which is maybe better (documented) than the existing code. https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:investigate-memory-uncommit-every-gc-only [...] >> > - Introduce manageable flag SoftMaxHeapSize to define a target heap >> > size nd set the default max heap size to 100% of available. >> >> I am a bit torn about SoftMaxHeapSize in Serial GC. What do you >> envision that Serial GC would do when the SoftMaxHeapSize has been >> reached, and what if old gen occupancy permanently stays above that value? > > At the moment, SoftMaxHeapSize is an implementation in Z. I?d first like > to pull a (rough) spec out of the implementation and then try to answer > your question. It?s currently not clear to me how this should work with > any collector. >> >> The usefulness of SoftMaxHeapSize kind of relies on having a minimally >> invasive old gen collection that tries to get old gen usage back below >> that value. > > Well, the LDS is what it is and running a speculative collection would > likely clean up (prematurely) promoted transients? but that?s about it. > Whereas it would clean both transients and floating garbage for the > concurrent collectors. I?m not at fan of speculative collections given > all of the time I?ve spent getting rid of them :-) IMO, a DGC triggered > full collections was rarely necessary (all overhead with very little > return). This also applied to the G1 patch that speculatively ran to > counter to-space overflows and it also applied to running a young gen > prior to remark with CMS collector. Long story sort, loads of extra > overhead with very little to no payback. SoftMaxHeapSize is a bit different as it is non-speculative but supposedly based on the users intent. >> Serial GC has no "minimally invasive" way to collect old generation. >> It is either Full GC or nothing. This is the only option for Serial, >> but always doing Full collections after reaching that threshold seems >> very heavy handed, expensive and undesirable to me (ymmv). >> >> That reaction would follow the spirit of the flag though. >> >> Maybe at the small heaps Serial GC targets, this makes sense, and full >> gc is not that costly anyway. > > Yeah, for small heap this shouldn?t be a big deal. But this is one of > the reasons why I believe we should treat young and old separately. We > can cheaply and safely return memory from young gen and leave the sizing > of tenured to when a full is really needed. I grant you that this may > not be very timely but I?m not sure that we need this to happen on > demand? I think we can wait for natural cycles to take their course. > But, maybe I?m wrong on this point. We plan to experiment with this. Please do and report back. >> >> It might be useful to enumerate what actions could be performed on >> global pressure. > > That?s in the table? > >> >> > - Add in the ability to uncommit memory (to reduce global memory >> > pressure). >> > >> >> The following imo outlines a compdoneletely separate idea, and should >> be discussed separately: >> >> > >> > While working through the details of this work I noted that there >> > appear ?to opportunities to offer new defaults for other settings. For >> > example, [...] >> >> That seems to be some more elaborate way of finding "optimal" >> generation size for a given heap size (which may follow from what the >> gc/mutator time ratio algorithm gives you). > > I?m trying to apply my years of experience tuning 100s of collectors > across 100s of applications. > Very much appreciated. Hth, Thomas From tschatzl at openjdk.org Wed Oct 2 07:22:37 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 2 Oct 2024 07:22:37 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC [v2] In-Reply-To: <7ZdrpVVGG1hRYudHG_8VA8-lFIJGhSHy0YzmUgtFB9w=.246a5c89-7fb0-402d-92e8-3930c8d430cd@github.com> References:

<7ZdrpVVGG1hRYudHG_8VA8-lFIJGhSHy0YzmUgtFB9w=.246a5c89-7fb0-402d-92e8-3930c8d430cd@github.com> Message-ID: On Tue, 1 Oct 2024 19:21:11 GMT, Zhengyu Gu wrote: >>> OopScannerTask >> >> I certainly can adopt your implementation once it is integrated. > >> Please use `ScannerTask` instead; it seems to be completely serviceable for that purpose. In fact, a search&replace seems just fine. >> >> https://github.com/openjdk/jdk/compare/pr/21089...tschatzl:jdk:pull/21089-recommendations?expand=1 >> >> I am going to experiment with refactoring the other duplicated (statistics) code > > I believe I ran into alignment assertion failure with `ScannerTask` >I ran into the same problem for G1 Full GC that @zhengyu123 has run into here. ScannerTask deals in pointers to oop (and narrowOop). For the separate marking cases we have oops in the tasks. I added a class just like this one in my work, except I put mine in gc/shared and gave it a different name (OopScannerTask, which I don't love). Clearly some coalescing is needed there. The suggested patch just adds constructor to allow a regular `oop` and the associated getter. Internally `ScannerTask` uses a `void*` anyway. Probably I am overlooking something trivial here why having an interface to store and retrieve an `oop` is not possible.... seems to pass very basic testing though. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1783940286 From rcastanedalo at openjdk.org Wed Oct 2 08:29:55 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 2 Oct 2024 08:29:55 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v11] In-Reply-To: References:

Message-ID: On Tue, 1 Oct 2024 15:46:01 GMT, Roman Kennke wrote: > > test/hotspot/jtreg/compiler/c2/irTests/TestVectorizationNotRun.java: > > I think I would disable the tests for now. Is there a good way to say 'run this when UCOH is off OR UseSSE>3? I don't think so, due to a [limitation in the IR framework precondition language](https://bugs.openjdk.org/browse/JDK-8294279): `UseCompactObjectHeaders` can only appear within a ["flag precondition"](https://github.com/openjdk/jdk/blob/efe3573b9b4ecec0630fdc1c61c765713a5b68e6/test/hotspot/jtreg/compiler/lib/ir_framework/IR.java#L109) whereas `UseSSE>3` needs to be expressed as a ["CPU feature precondition"](https://github.com/openjdk/jdk/blob/efe3573b9b4ecec0630fdc1c61c765713a5b68e6/test/hotspot/jtreg/compiler/lib/ir_framework/IR.java#L137C14-L137C31) for portability (`UseSSE` is not defined for aarch64), and these two cannot be combined with logical operators. I suggest to disable the IR checks of the failing tests using `applyIf = {"UseCompactObjectHeaders", "false"}` as you did for other similar tests (e.g. `TestMulAddS2I.java`), and document it in [JDK-8340010](https://bugs.openjdk.org/browse/JDK-8340010). Maybe also comment in the tests that the failure happens only with `-XX:UseSSE<=3`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20677#issuecomment-2387906401 From tschatzl at openjdk.org Wed Oct 2 09:29:40 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 2 Oct 2024 09:29:40 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC [v2] In-Reply-To: References:

<7ZdrpVVGG1hRYudHG_8VA8-lFIJGhSHy0YzmUgtFB9w=.246a5c89-7fb0-402d-92e8-3930c8d430cd@github.com> Message-ID: On Wed, 2 Oct 2024 07:18:49 GMT, Thomas Schatzl wrote: >>> Please use `ScannerTask` instead; it seems to be completely serviceable for that purpose. In fact, a search&replace seems just fine. >>> >>> https://github.com/openjdk/jdk/compare/pr/21089...tschatzl:jdk:pull/21089-recommendations?expand=1 >>> >>> I am going to experiment with refactoring the other duplicated (statistics) code >> >> I believe I ran into alignment assertion failure with `ScannerTask` > >>I ran into the same problem for G1 Full GC that @zhengyu123 has run into here. ScannerTask deals in > pointers to oop (and narrowOop). For the separate marking cases we have oops in the tasks. > I added a class just like this one in my work, except I put mine in gc/shared and gave it a different name > (OopScannerTask, which I don't love). Clearly some coalescing is needed there. > > The suggested patch just adds constructor to allow a regular `oop` and the associated getter. Internally `ScannerTask` uses a `void*` anyway. > Probably I am overlooking something trivial here why having an interface to store and retrieve an `oop` is not possible.... seems to pass very basic testing though. (Fwiw, https://github.com/tschatzl/jdk/tree/submit/pull/21089-recommendations-test with the suggested changes seems to pass GHA....) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1784114619 From tschatzl at openjdk.org Wed Oct 2 09:29:42 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 2 Oct 2024 09:29:42 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC [v2] In-Reply-To: References:

Message-ID: On Tue, 1 Oct 2024 12:48:40 GMT, Thomas Schatzl wrote: >> Zhengyu Gu has updated the pull request incrementally with two additional commits since the last revision: >> >> - @tschatzl's comment >> - v8 > > src/hotspot/share/gc/parallel/psCompactionManager.hpp line 122: > >> 120: size_t _arrays_chunked; >> 121: size_t _array_chunks_processed; >> 122: #endif // TASKQUEUE_STATS > > This is a separate issue, but we did not add these counters when doing this change for g1 young gen. Filed [JDK-8341331](https://bugs.openjdk.org/browse/JDK-8341331) for that. > > There is a fair amount of code duplication (definition of members, management and printing) which is not great (but you mentioned it). For now I filed [JDK-8341332](https://bugs.openjdk.org/browse/JDK-8341332) for this, but I would really prefer some refactoring in this area in _this_ change. > > Initially I was kind of okay to do that separately, but the copy&paste seems too much. Here's a commit that only touches up the worst issues in this change: https://github.com/openjdk/jdk/commit/ab6e77ed909c13458a24e9663e830cfac9c4f18e (branch including the other suggestions so far: https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:pull/21089-taskqueue-refactor) This makes the change a lot more readable imo (and in total reduces LOC count). The only missing improvement is to have the array statistics in the same log lines as the other statistics; that could be achieved by specializing the queues/queue set, but that requires the unified `ScannerTask` in the other change and more work. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1784112754 From mli at openjdk.org Wed Oct 2 10:15:54 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 2 Oct 2024 10:15:54 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

Message-ID: <7L7jYDlFa0WnVvgiyNHI9KZrcffYwNnBB899AuMS56Q=.40b031e7-07b8-4a15-b319-c53b38a17a49@github.com> On Wed, 2 Oct 2024 10:10:12 GMT, Hamlin Li wrote: >> Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 53 additional commits since the last revision: >> >> - Merge remote-tracking branch 'feilongjiang/JEP-475-RISC-V' into JDK-8334060-g1-late-barrier-expansion >> - riscv port refactor >> - Remove temporary support code >> - Merge jdk-24+17 >> - Relax 'must match' assertion in ppc's g1StoreN after limiting pinning bypass optimization >> - Remove unnecessary reg-to-reg moves in aarch64's g1CompareAndX instructions >> - Reintroduce matcher assert and limit pinning bypass optimization to non-shared EncodeP nodes >> - Merge jdk-24+16 >> - Ensure that detected encode-and-store patterns are matched >> - Merge remote-tracking branch 'snazarkin/arm32-JDK-8334060-g1-late-barrier-expansion' into JDK-8334060-g1-late-barrier-expansion >> - ... and 43 more: https://git.openjdk.org/jdk/compare/486c5b0d...14483b83 > > src/hotspot/cpu/riscv/gc/g1/g1_riscv.ad line 55: > >> 53: } >> 54: for (RegSetIterator reg = no_preserve.begin(); *reg != noreg; ++reg) { >> 55: stub->dont_preserve(*reg); > > Could `no_preserve` and `preserve` overlap? > If false, then seems it's not necessary to do `dont_preserve` for `no_preserve` > If true, seems it's not safe to `dont_preserve` these regs? I'm not sure. In the G1 case, the use of `dont_preserve` is an optimization to avoid spilling and reloading, in the slow path of the pre-barrier, registers (`res`) that are not live at that point. It is not necessary for correctness, but saves a few bytes in the generated code. If `res` was not marked as `dont_preserve`, it would be included in the pre-barrier stub's preserve set (`BarrierStubC2::preserve_set()`) because it is live out of the entire AD instruction (as computed by `BarrierSetC2::compute_liveness_at_stubs()`). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1784346898 From rcastanedalo at openjdk.org Wed Oct 2 11:53:52 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 2 Oct 2024 11:53:52 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

Message-ID: On Wed, 2 Oct 2024 09:58:29 GMT, Hamlin Li wrote: >> Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 53 additional commits since the last revision: >> >> - Merge remote-tracking branch 'feilongjiang/JEP-475-RISC-V' into JDK-8334060-g1-late-barrier-expansion >> - riscv port refactor >> - Remove temporary support code >> - Merge jdk-24+17 >> - Relax 'must match' assertion in ppc's g1StoreN after limiting pinning bypass optimization >> - Remove unnecessary reg-to-reg moves in aarch64's g1CompareAndX instructions >> - Reintroduce matcher assert and limit pinning bypass optimization to non-shared EncodeP nodes >> - Merge jdk-24+16 >> - Ensure that detected encode-and-store patterns are matched >> - Merge remote-tracking branch 'snazarkin/arm32-JDK-8334060-g1-late-barrier-expansion' into JDK-8334060-g1-late-barrier-expansion >> - ... and 43 more: https://git.openjdk.org/jdk/compare/0dc16d16...14483b83 > > src/hotspot/cpu/riscv/gc/g1/g1_riscv.ad line 169: > >> 167: predicate(UseG1GC && n->as_LoadStore()->barrier_data() != 0); >> 168: match(Set res (CompareAndExchangeP mem (Binary oldval newval))); >> 169: effect(TEMP res, TEMP tmp1, TEMP tmp2); > > should `res` be `TEMP_DEF`? It could, but the effect would be the same (see [JDK-8058880](https://bugs.openjdk.org/browse/JDK-8058880)). I went with `TEMP` for the x64 and aarch64 platforms for consistency with the analogous ZGC ADL code, see e.g. https://github.com/openjdk/jdk/blob/855c8a7def21025bc2fc47594f7285a55924c213/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad#L182-L204. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1784358586 From mbaesken at openjdk.org Wed Oct 2 12:05:05 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Wed, 2 Oct 2024 12:05:05 GMT Subject: RFR: 8336911: ZGC: Division by zero in heuristics after JDK-8332717 Message-ID: When running with ubsan enabled binaries, the following issue is reported, e.g. in test compiler/uncommontrap/TestDeoptOOM_ZGenerational.jtr also in gc/z/TestSmallHeap.jtr jdk/src/hotspot/share/gc/z/zDirector.cpp:537:84: runtime error: division by zero #0 0x7f422495bd1f in calculate_young_to_old_worker_ratio src/hotspot/share/gc/z/zDirector.cpp:537 #1 0x7f422495bd1f in select_worker_threads src/hotspot/share/gc/z/zDirector.cpp:694 #2 0x7f42282a0d97 in select_worker_threads src/hotspot/share/gc/z/zDirector.cpp:689 #3 0x7f42282a0d97 in initial_workers src/hotspot/share/gc/z/zDirector.cpp:784 #4 0x7f42282a2485 in initial_workers src/hotspot/share/gc/z/zDirector.cpp:795 #5 0x7f42282a2485 in start_minor_gc src/hotspot/share/gc/z/zDirector.cpp:797 #6 0x7f42282a2485 in start_gc src/hotspot/share/gc/z/zDirector.cpp:826 #7 0x7f42282a2485 in ZDirector::run_thread() src/hotspot/share/gc/z/zDirector.cpp:912 #8 0x7f422840bdd8 in ZThread::run_service() src/hotspot/share/gc/z/zThread.cpp:29 #9 0x7f4225ab6979 in ConcurrentGCThread::run() src/hotspot/share/gc/shared/concurrentGCThread.cpp:48 #10 0x7f4227e1137a in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 #11 0x7f42274619b1 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:858 #12 0x7f422c8d36e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 9a146bd267419cb6a8cf08d7c602953a0f2e12c5) #13 0x7f422c1dc58e in clone (/lib64/libc.so.6+0x11858e) (BuildId: f2d1cb1ef49f8c47d43a4053910ba6137673ccce) The division by 0 leads to 'infinity' on most of our platforms. So instead of relying on this behavior, we can add a small check and set 'infinity' for divisor == 0. ------------- Commit messages: - JDK-8336911 Changes: https://git.openjdk.org/jdk/pull/21304/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21304&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8336911 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/21304.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21304/head:pull/21304 PR: https://git.openjdk.org/jdk/pull/21304 From mli at openjdk.org Wed Oct 2 12:57:48 2024 From: mli at openjdk.org (Hamlin Li) Date: Wed, 2 Oct 2024 12:57:48 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: <7L7jYDlFa0WnVvgiyNHI9KZrcffYwNnBB899AuMS56Q=.40b031e7-07b8-4a15-b319-c53b38a17a49@github.com> References:

<7L7jYDlFa0WnVvgiyNHI9KZrcffYwNnBB899AuMS56Q=.40b031e7-07b8-4a15-b319-c53b38a17a49@github.com> Message-ID: On Wed, 2 Oct 2024 11:40:18 GMT, Roberto Casta?eda Lozano wrote: > If `res` was not marked as `dont_preserve`, it would be included in the pre-barrier stub's preserve set (`BarrierStubC2::preserve_set()`) because it is live out of the entire AD instruction (as computed by `BarrierSetC2::compute_liveness_at_stubs()`). Thanks for explanation! I did not realize this, if that's the case, then it's good. >> src/hotspot/cpu/riscv/gc/g1/g1_riscv.ad line 169: >> >>> 167: predicate(UseG1GC && n->as_LoadStore()->barrier_data() != 0); >>> 168: match(Set res (CompareAndExchangeP mem (Binary oldval newval))); >>> 169: effect(TEMP res, TEMP tmp1, TEMP tmp2); >> >> should `res` be `TEMP_DEF`? > > It could, but the effect would be the same (see [JDK-8058880](https://bugs.openjdk.org/browse/JDK-8058880)). I went with `TEMP` for the x64 and aarch64 platforms for consistency with the analogous ZGC ADL code, see e.g. https://github.com/openjdk/jdk/blob/855c8a7def21025bc2fc47594f7285a55924c213/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad#L182-L204. I saw the riscv one in z_riscv.ad is: `effect(TEMP oldval_tmp, TEMP newval_tmp, TEMP tmp1, TEMP_DEF res);`, maybe it's good to change riscv one? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1784479784 PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1784479526 From zgu at openjdk.org Wed Oct 2 14:26:15 2024 From: zgu at openjdk.org (Zhengyu Gu) Date: Wed, 2 Oct 2024 14:26:15 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC [v3] In-Reply-To: References: Message-ID: > Please review this patch that adopts `PartialArrayState`introduced by [JDK-8337709](https://bugs.openjdk.org/browse/JDK-8337709) to consolidate `_oop_task_queues` and `_objarray_task_queues` into single `_marking_stacks`. > > The change mirrors Kim's [JDK-8311163](https://bugs.openjdk.org/browse/JDK-8311163) work, therefore, there are methods can be consolidated and simplified, but I would like defer to a followup CR. Zhengyu Gu has updated the pull request incrementally with one additional commit since the last revision: @tschatzl's ScannerTask changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21089/files - new: https://git.openjdk.org/jdk/pull/21089/files/15b998f6..fd756b3b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21089&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21089&range=01-02 Stats: 55 lines in 5 files changed: 8 ins; 33 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/21089.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21089/head:pull/21089 PR: https://git.openjdk.org/jdk/pull/21089 From zgu at openjdk.org Wed Oct 2 14:26:15 2024 From: zgu at openjdk.org (Zhengyu Gu) Date: Wed, 2 Oct 2024 14:26:15 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC [v3] In-Reply-To: References:

<7ZdrpVVGG1hRYudHG_8VA8-lFIJGhSHy0YzmUgtFB9w=.246a5c89-7fb0-402d-92e8-3930c8d430cd@github.com>

Message-ID: <9lFtFPURRs-aTb2yCDbogZdvJnLd8bnH86SkWnY2WBw=.f5ef2cfe-3799-40ad-befc-5d8b532a2d4e@github.com> On Wed, 2 Oct 2024 09:26:48 GMT, Thomas Schatzl wrote: >>>I ran into the same problem for G1 Full GC that @zhengyu123 has run into here. ScannerTask deals in >> pointers to oop (and narrowOop). For the separate marking cases we have oops in the tasks. >> I added a class just like this one in my work, except I put mine in gc/shared and gave it a different name >> (OopScannerTask, which I don't love). Clearly some coalescing is needed there. >> >> The suggested patch just adds constructor to allow a regular `oop` and the associated getter. Internally `ScannerTask` uses a `void*` anyway. >> Probably I am overlooking something trivial here why having an interface to store and retrieve an `oop` is not possible.... seems to pass very basic testing though. > > (Fwiw, https://github.com/tschatzl/jdk/tree/submit/pull/21089-recommendations-test with the suggested changes seems to pass GHA....) I missed `ScannerTask` constructor change :-( the suggested changes also passed tier1 with Parallel GC. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1784622909 From zgu at openjdk.org Wed Oct 2 14:34:36 2024 From: zgu at openjdk.org (Zhengyu Gu) Date: Wed, 2 Oct 2024 14:34:36 GMT Subject: RFR: 8339668: Parallel: Adopt PartialArrayState to consolidate marking stack in Full GC [v3] In-Reply-To: References:

Message-ID: <90BJHj19LxcGg3LnDvZWKeNwkToEBcKuC9TfHo8BQkY=.cca23a09-b629-4a63-a506-0a28506bd6c1@github.com> On Wed, 2 Oct 2024 09:25:18 GMT, Thomas Schatzl wrote: >> src/hotspot/share/gc/parallel/psCompactionManager.hpp line 122: >> >>> 120: size_t _arrays_chunked; >>> 121: size_t _array_chunks_processed; >>> 122: #endif // TASKQUEUE_STATS >> >> This is a separate issue, but we did not add these counters when doing this change for g1 young gen. Filed [JDK-8341331](https://bugs.openjdk.org/browse/JDK-8341331) for that. >> >> There is a fair amount of code duplication (definition of members, management and printing) which is not great (but you mentioned it). For now I filed [JDK-8341332](https://bugs.openjdk.org/browse/JDK-8341332) for this, but I would really prefer some refactoring in this area in _this_ change. >> >> Initially I was kind of okay to do that separately, but the copy&paste seems too much. > > Here's a commit that only touches up the worst issues in this change: https://github.com/openjdk/jdk/commit/ab6e77ed909c13458a24e9663e830cfac9c4f18e > > (branch including the other suggestions so far: https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:pull/21089-taskqueue-refactor) > > This makes the change a lot more readable imo (and in total reduces LOC count). > > The only missing improvement is to have the array statistics in the same log lines as the other statistics; that could be achieved by specializing the queues/queue set, but that requires the unified `ScannerTask` in the other change and more work. I will take a shot of [JDK-8341332](https://bugs.openjdk.org/browse/JDK-8341332) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21089#discussion_r1784643598 From rkennke at openjdk.org Wed Oct 2 15:37:40 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 2 Oct 2024 15:37:40 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v29] In-Reply-To: References: Message-ID: > This is the main body of the JEP 450: Compact Object Headers (Experimental). > > It is also a follow-up to #20640, which now also includes (and supersedes) #20603 and #20605, plus the Tiny Class-Pointers parts that have been previously missing. > > Main changes: > - Introduction of the (experimental) flag UseCompactObjectHeaders. All changes in this PR are protected by this flag. The purpose of the flag is to provide a fallback, in case that users unexpectedly observe problems with the new implementation. The intention is that this flag will remain experimental and opt-in for at least one release, then make it on-by-default and diagnostic (?), and eventually deprecate and obsolete it. However, there are a few unknowns in that plan, specifically, we may want to further improve compact headers to 4 bytes, we are planning to enhance the Klass* encoding to support virtually unlimited number of Klasses, at which point we could also obsolete UseCompressedClassPointers. > - The compressed Klass* can now be stored in the mark-word of objects. In order to be able to do this, we are add some changes to GC forwarding (see below) to protect the relevant (upper 22) bits of the mark-word. Significant parts of this PR deal with loading the compressed Klass* from the mark-word. This PR also changes some code paths (mostly in GCs) to be more careful when accessing Klass* (or mark-word or size) to be able to fetch it from the forwardee in case the object is forwarded. > - Self-forwarding in GCs (which is used to deal with promotion failure) now uses a bit to indicate 'self-forwarding'. This is needed to preserve the crucial Klass* bits in the header. This also allows to get rid of preserved-header machinery in SerialGC and G1 (Parallel GC abuses preserved-marks to also find all other relevant oops). > - Full GC forwarding now uses an encoding similar to compressed-oops. We have 40 bits for that, and can encode up to 8TB of heap. When exceeding 8TB, we turn off UseCompressedClassPointers (except in ZGC, which doesn't use the GC forwarding at all). > - Instances can now have their base-offset (the offset where the field layouter starts to place fields) at offset 8 (instead of 12 or 16). > - Arrays will now store their length at offset 8. > - CDS can now write and read archives with the compressed header. However, it is not possible to read an archive that has been written with an opposite setting of UseCompactObjectHeaders. Some build machinery is added so that _coh variants of CDS archiv... Roman Kennke has updated the pull request incrementally with three additional commits since the last revision: - Revert "Disable TestSplitPacks::test4a, failing on aarch64" This reverts commit 059b1573db26d1d2902ca6dadc8413f445234c2a. - Simplify object init code in interpreter - Disable some vectorization tests that fail with +UCOH and UseSSE<=3 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/20677/files - new: https://git.openjdk.org/jdk/pull/20677/files/059b1573..aea8f00c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=20677&range=28 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20677&range=27-28 Stats: 47 lines in 6 files changed: 18 ins; 13 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/20677.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20677/head:pull/20677 PR: https://git.openjdk.org/jdk/pull/20677 From aboldtch at openjdk.org Wed Oct 2 15:52:41 2024 From: aboldtch at openjdk.org (Axel Boldt-Christmas) Date: Wed, 2 Oct 2024 15:52:41 GMT Subject: RFR: 8336911: ZGC: Division by zero in heuristics after JDK-8332717 In-Reply-To: References: Message-ID: On Wed, 2 Oct 2024 12:00:19 GMT, Matthias Baesken wrote: > When running with ubsan enabled binaries, the following issue is reported, > e.g. in test > compiler/uncommontrap/TestDeoptOOM_ZGenerational.jtr > also in gc/z/TestSmallHeap.jtr > > > jdk/src/hotspot/share/gc/z/zDirector.cpp:537:84: runtime error: division by zero > #0 0x7f422495bd1f in calculate_young_to_old_worker_ratio src/hotspot/share/gc/z/zDirector.cpp:537 > #1 0x7f422495bd1f in select_worker_threads src/hotspot/share/gc/z/zDirector.cpp:694 > #2 0x7f42282a0d97 in select_worker_threads src/hotspot/share/gc/z/zDirector.cpp:689 > #3 0x7f42282a0d97 in initial_workers src/hotspot/share/gc/z/zDirector.cpp:784 > #4 0x7f42282a2485 in initial_workers src/hotspot/share/gc/z/zDirector.cpp:795 > #5 0x7f42282a2485 in start_minor_gc src/hotspot/share/gc/z/zDirector.cpp:797 > #6 0x7f42282a2485 in start_gc src/hotspot/share/gc/z/zDirector.cpp:826 > #7 0x7f42282a2485 in ZDirector::run_thread() src/hotspot/share/gc/z/zDirector.cpp:912 > #8 0x7f422840bdd8 in ZThread::run_service() src/hotspot/share/gc/z/zThread.cpp:29 > #9 0x7f4225ab6979 in ConcurrentGCThread::run() src/hotspot/share/gc/shared/concurrentGCThread.cpp:48 > #10 0x7f4227e1137a in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 > #11 0x7f42274619b1 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:858 > #12 0x7f422c8d36e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 9a146bd267419cb6a8cf08d7c602953a0f2e12c5) > #13 0x7f422c1dc58e in clone (/lib64/libc.so.6+0x11858e) (BuildId: f2d1cb1ef49f8c47d43a4053910ba6137673ccce) > > > The division by 0 leads to 'infinity' on most of our platforms. So instead of relying on this behavior, we can add a small check and set 'infinity' for divisor == 0. I do not think `infinity` is the solution here. There are more problems with the heuristics when no young collection has reclaimed any memory. I added a comment about this in an earlier PR (JDK-8339648 / #20888) https://github.com/openjdk/jdk/pull/20888#discussion_r1758502503. I proposed a solution to this specific issue that makes more sense to me, and avoid the NaN issues here. But will have to talk it over. Regardless I think we need to do an overhaul of this code to handle the extreme case of no GC having reclaimed any memory. _Also this must have been an issue before JDK-8332717 as well?_ src/hotspot/share/gc/z/zDirector.cpp line 539: > 537: const double current_old_bytes_freed_per_gc_time = double(reclaimed_per_old_gc) / double(old_gc_time); > 538: const double old_vs_young_efficiency_ratio = current_young_bytes_freed_per_gc_time == 0 ? std::numeric_limits::infinity() > 539: : current_old_bytes_freed_per_gc_time / current_young_bytes_freed_per_gc_time; I think returning infinity here will cause problems with NaN down the line. It is also unclear what this means if both are `0`. To me something like the following makes sense. But I will discus this with my team. Suggestion: if (current_young_bytes_freed_per_gc_time == 0.0) { if (current_old_bytes_freed_per_gc_time == 0.0) { // Neither young nor old collections have reclaimed any memory. // Give them equal priority. return 1.0; } // Only old collections have reclaimed memory. // Prioritize old. return ZOldGCThreads; } ------------- Changes requested by aboldtch (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21304#pullrequestreview-2343363648 PR Review Comment: https://git.openjdk.org/jdk/pull/21304#discussion_r1784803080 From kdnilsen at openjdk.org Wed Oct 2 16:47:37 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Oct 2024 16:47:37 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: <_V337A0U9_TeBPs26Q6QyOfvmYoaqxOxl3MaRrhI16s=.dc7e86a3-a8ff-4617-995c-1417cac18bb8@github.com> On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) I've got a bit more information about the differences in behavior between no-batch trial 1 and trial 2: 1. Note that trial2 has much worse p9999 latency than trial1 2. The difference is NOT safepoint behavior. Trial 1 actually had more safepoints that lasted longer than 1 ms, with the longest lasting 5.658220ms. The longest safepoint in trial 2 was 3.420009 ms. 3. There is evidence to suggest that the difference stems from concurrent cleanup: trial1 had 1 concurrent cleanup event taking more than 1 ms, with time of 1.142 ms, average cleanup time of 85.1 us; trial 3 had 3 concurrent cleanup events taking more than 1 ms, with the max of 1.377 ms, average cleanup time of 85.8 us. 4. For comparison, the three runs with this fix had an average concurrent cleanup event time of 69.8 us. Qualitative assessment: This fix allows concurrent cleanup to happen on average in 18.3% less time. This means it is less likely to collide with a mutator thread in access to the shared heap lock. When a collision does occur, it is resolved more quickly, allowing the the mutator to proceed in no more than 8 us plus the time to process one batch of 32 regions rather than having to wait a max of 30 us. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2389135297 From coleenp at openjdk.org Wed Oct 2 17:37:54 2024 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 2 Oct 2024 17:37:54 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v29] In-Reply-To: References:

Message-ID: On Wed, 2 Oct 2024 15:37:40 GMT, Roman Kennke wrote: >> This is the main body of the JEP 450: Compact Object Headers (Experimental). >> >> It is also a follow-up to #20640, which now also includes (and supersedes) #20603 and #20605, plus the Tiny Class-Pointers parts that have been previously missing. >> >> Main changes: >> - Introduction of the (experimental) flag UseCompactObjectHeaders. All changes in this PR are protected by this flag. The purpose of the flag is to provide a fallback, in case that users unexpectedly observe problems with the new implementation. The intention is that this flag will remain experimental and opt-in for at least one release, then make it on-by-default and diagnostic (?), and eventually deprecate and obsolete it. However, there are a few unknowns in that plan, specifically, we may want to further improve compact headers to 4 bytes, we are planning to enhance the Klass* encoding to support virtually unlimited number of Klasses, at which point we could also obsolete UseCompressedClassPointers. >> - The compressed Klass* can now be stored in the mark-word of objects. In order to be able to do this, we are add some changes to GC forwarding (see below) to protect the relevant (upper 22) bits of the mark-word. Significant parts of this PR deal with loading the compressed Klass* from the mark-word. This PR also changes some code paths (mostly in GCs) to be more careful when accessing Klass* (or mark-word or size) to be able to fetch it from the forwardee in case the object is forwarded. >> - Self-forwarding in GCs (which is used to deal with promotion failure) now uses a bit to indicate 'self-forwarding'. This is needed to preserve the crucial Klass* bits in the header. This also allows to get rid of preserved-header machinery in SerialGC and G1 (Parallel GC abuses preserved-marks to also find all other relevant oops). >> - Full GC forwarding now uses an encoding similar to compressed-oops. We have 40 bits for that, and can encode up to 8TB of heap. When exceeding 8TB, we turn off UseCompressedClassPointers (except in ZGC, which doesn't use the GC forwarding at all). >> - Instances can now have their base-offset (the offset where the field layouter starts to place fields) at offset 8 (instead of 12 or 16). >> - Arrays will now store their length at offset 8. >> - CDS can now write and read archives with the compressed header. However, it is not possible to read an archive that has been written with an opposite setting of UseCompactObjectHeaders. Some build machinery is added so that _co... > > Roman Kennke has updated the pull request incrementally with three additional commits since the last revision: > > - Revert "Disable TestSplitPacks::test4a, failing on aarch64" > > This reverts commit 059b1573db26d1d2902ca6dadc8413f445234c2a. > - Simplify object init code in interpreter > - Disable some vectorization tests that fail with +UCOH and UseSSE<=3 Thanks for making this change. I've reviewed runtime, oops and metaspace code. It looks good. ------------- Marked as reviewed by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/20677#pullrequestreview-2343632318 From wkemper at openjdk.org Wed Oct 2 18:26:35 2024 From: wkemper at openjdk.org (William Kemper) Date: Wed, 2 Oct 2024 18:26:35 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) Looks good. Appreciate the comprehensive analysis. ------------- Marked as reviewed by wkemper (Committer). PR Review: https://git.openjdk.org/jdk/pull/21211#pullrequestreview-2343780503 From rcastanedalo at openjdk.org Wed Oct 2 19:43:50 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 2 Oct 2024 19:43:50 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

<7L7jYDlFa0WnVvgiyNHI9KZrcffYwNnBB899AuMS56Q=.40b031e7-07b8-4a15-b319-c53b38a17a49@github.com> Message-ID: On Wed, 2 Oct 2024 12:55:13 GMT, Hamlin Li wrote: >> It could, but the effect would be the same (see [JDK-8058880](https://bugs.openjdk.org/browse/JDK-8058880)). I went with `TEMP` for the x64 and aarch64 platforms for consistency with the analogous ZGC ADL code, see e.g. https://github.com/openjdk/jdk/blob/855c8a7def21025bc2fc47594f7285a55924c213/src/hotspot/cpu/aarch64/gc/z/z_aarch64.ad#L182-L204. > > I saw the riscv one in z_riscv.ad is: `effect(TEMP oldval_tmp, TEMP newval_tmp, TEMP tmp1, TEMP_DEF res);`, maybe it's good to change riscv one? I suggest to postpone these types of refactorings to follow-up enhancements, given that the pull request in its current form is stable, thoroughly tested, and approved by reviewers. I intend to integrate it within the following 24 hours, provided final test results look good. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1785135652 From xpeng at openjdk.org Wed Oct 2 19:47:34 2024 From: xpeng at openjdk.org (Xiaolong Peng) Date: Wed, 2 Oct 2024 19:47:34 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) Marked as reviewed by xpeng (Author). Looks good to me, thanks for the detailed analysis! My understanding: the major benefits from this PR: 1. Minimize the cost on system call os::javaTimeNanos() 2. Not to start next batch if next it has hold the heap lock over 8us when current batch finishes. Maybe we should calculate the next batch size based on the reminding time and speed after a batch, the assumption 200ns/region is unlikely to be true for all hardwares, in the worse case, it may hold the lock for up to 15us~16us, ------------- PR Review: https://git.openjdk.org/jdk/pull/21211#pullrequestreview-2343932607 PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2389542138 From kdnilsen at openjdk.org Wed Oct 2 19:55:35 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Oct 2024 19:55:35 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) It's hard to "predict" the cost of the next batch. It depends on how many of the regions within the batch might have already been recycled by on-demand actions of mutator threads. In general, its seems the cost of a batch is typically less than 4 us, because I gathered some stats with instrumentation (no longer present) that says typical number of batches processed between yield is 3. The first two batches, including the time to yield and acquire lock, must have completed in less than 8 us, or we would not have allowed ourself to start another batch. True, a different hardware might take more or less time to process batch, but I don't think the behavior will be too sensitive to this. If it takes less than 2.66 us, then we might get an average of 4 batches processed between yields. If it takes longer than 8 us, we'll only get 1 batch processed between yields. But we'll still be making good progress on recycling the trashed regions while remaining very responsive to mutator needs to access the heap lock. Thanks for reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2389557309 PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2389559197 From phh at openjdk.org Wed Oct 2 19:56:36 2024 From: phh at openjdk.org (Paul Hohensee) Date: Wed, 2 Oct 2024 19:56:36 GMT Subject: RFR: 8332697: ubsan: shenandoahSimpleBitMap.inline.hpp:68:23: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int' [v4] In-Reply-To: References:

Message-ID: <55A-Hu86kNPGpQ24KiuIeefJ8ewqt2AJQXt-rYZTxbA=.6e61a005-4511-41fb-ace4-84fec3875cf9@github.com> On Tue, 1 Oct 2024 21:44:48 GMT, William Kemper wrote: >> Use a template version of `right_n_bits` to use the same type for minuend and subtrahend. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Inline unnecessary usages of right_n_bits Thanks. ------------- Marked as reviewed by kdnilsen (Author). PR Review: https://git.openjdk.org/jdk/pull/21236#pullrequestreview-2343969312 From ysr at openjdk.org Wed Oct 2 21:03:36 2024 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 2 Oct 2024 21:03:36 GMT Subject: RFR: 8332697: ubsan: shenandoahSimpleBitMap.inline.hpp:68:23: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int' [v4] In-Reply-To: References:

Message-ID: On Wed, 2 Oct 2024 21:01:20 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Inline unnecessary usages of right_n_bits > > src/hotspot/share/gc/shenandoah/shenandoahSimpleBitMap.inline.hpp line 33: > >> 31: if (bit_number >= BitsPerWord) { >> 32: return -1; >> 33: } > > When would we call here with `bit_number >= BitsPerWord` ? If never, may be we assert that? It's called with `bit_number == BitsPerWord` (I tried an assertion first). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21236#discussion_r1785256564 From phh at openjdk.org Wed Oct 2 21:16:37 2024 From: phh at openjdk.org (Paul Hohensee) Date: Wed, 2 Oct 2024 21:16:37 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) I worry that we're hard-coding assumed task durations. I'm ok with this PR as it is, but I suggest we add a facility to GC initialization that does dummy tasks such as this in order to get somewhat-realistic times for use in this kind of situation. ------------- Marked as reviewed by phh (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21211#pullrequestreview-2344117544 From xpeng at openjdk.org Wed Oct 2 21:29:35 2024 From: xpeng at openjdk.org (Xiaolong Peng) Date: Wed, 2 Oct 2024 21:29:35 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 935: > 933: // Avoid another call to javaTimeNanos() if we already know time at which last batch ended > 934: batch_start_time = batch_end_time; > 935: const jlong deadline = batch_start_time + deadline_ns; Nit: Maybe before taking the heap lock? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21211#discussion_r1785267490 From sviswanathan at openjdk.org Wed Oct 2 21:31:52 2024 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 2 Oct 2024 21:31:52 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v26] In-Reply-To: References:

<6PTWMepIDuZDdPfN3xNKV1vqUyO_R4yCSeiSTpYIyyQ=.61a5b462-7114-4385-a6d7-40e5c7b0005d@github.com>

Message-ID: On Mon, 30 Sep 2024 17:48:13 GMT, Roman Kennke wrote: >> Wait a second, I've probably not been clear. `UseCompactObjectHeaders` is slated to become *on by default* and then slated to go away. That means that array base offets <= 16 bytes will become the default. The generated code will be something like: >> >> >> if (haystack_len <= 8) { >> // Copy 8 bytes onto stack >> } else if (haystack_len <= 16) { >> // Copy 16 bytes onto stack >> } else { >> // Copy 32 bytes onto stack >> } >> >> >> So that is 2 branches in this prologue code instead of originally 1. >> >> However, I just noticed that what I proposed is not enough. Consider what happens when haystack_len is 17. This would take the last case and copy 32 bytes. But we only have 17+8=25 bytes that we can guarantee to be available for copying. If this happens to be the array at the very beginning of the heap (very rare/unlikely), this would segfault. >> >> I think I need to mull over it some more to come up with a correct fix. > > I changed the header<16 version to be a small loop: https://github.com/rkennke/jdk/commit/bcba264ea5c15581647933db1163ca1dae39b6c5 > > The idea is the same as before, except it's made as a small loop with a maximum of 4 iterations (backward-branches), and it copies 8 bytes at a time, such that 1. it may copy up to 7 bytes that precede the array and 2. doesn't run over the end of the array (which would potentially crash). > > I am not sure if using XMM_TMP1 and XMM_TMP2 there is ok, or if it would encode better to use one of the regular registers.? > > Also, this new implementation could simply replace the old one, instead of being an alternative. I am not sure if if would make any difference performance-wise. @rkennke The small loop looks to me that it will run over the end of the array. Say the haystack_len is 7, the index below would be 0 after the shrq instruction, and the movq(XMM_TMP1, Address(haystack, index, Address::times_8)) in the loop will read 8 bytes i.e. one byte past the end of the array: // num_words (zero-based) = (haystack_len - 1) / 8; __ movq(index, haystack_len); __ subq(index, 1); __ shrq(index, LogBytesPerWord); __ bind(L_loop); __ movq(XMM_TMP1, Address(haystack, index, Address::times_8)); __ movq(Address(rsp, index, Address::times_8), XMM_TMP1); __ subq(index, 1); __ jcc(Assembler::positive, L_loop); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20677#discussion_r1785269849 From wkemper at openjdk.org Wed Oct 2 22:56:42 2024 From: wkemper at openjdk.org (William Kemper) Date: Wed, 2 Oct 2024 22:56:42 GMT Subject: Integrated: 8332697: ubsan: shenandoahSimpleBitMap.inline.hpp:68:23: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int' In-Reply-To: References: Message-ID: On Fri, 27 Sep 2024 21:29:37 GMT, William Kemper wrote: > Use a template version of `right_n_bits` to use the same type for minuend and subtrahend. This pull request has now been integrated. Changeset: 57c1db58 Author: William Kemper URL: https://git.openjdk.org/jdk/commit/57c1db5843db5f2c864318f3234767f436a836e3 Stats: 34 lines in 3 files changed: 7 ins; 0 del; 27 mod 8332697: ubsan: shenandoahSimpleBitMap.inline.hpp:68:23: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int' Reviewed-by: phh, kdnilsen ------------- PR: https://git.openjdk.org/jdk/pull/21236 From kdnilsen at openjdk.org Wed Oct 2 23:09:36 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Oct 2024 23:09:36 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) I'll modify the code to adjust for current localized behavior of the host computer. (e.g. not hard-code assumptions about task durations.) ------------- PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2389933286 From kdnilsen at openjdk.org Wed Oct 2 23:09:37 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Oct 2024 23:09:37 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Wed, 2 Oct 2024 21:26:28 GMT, Xiaolong Peng wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 935: > >> 933: // Avoid another call to javaTimeNanos() if we already know time at which last batch ended >> 934: batch_start_time = batch_end_time; >> 935: const jlong deadline = batch_start_time + deadline_ns; > > Nit: Maybe before taking the heap lock? I'll adjust this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21211#discussion_r1785368950 From mli at openjdk.org Thu Oct 3 06:50:48 2024 From: mli at openjdk.org (Hamlin Li) Date: Thu, 3 Oct 2024 06:50:48 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

<7L7jYDlFa0WnVvgiyNHI9KZrcffYwNnBB899AuMS56Q=.40b031e7-07b8-4a15-b319-c53b38a17a49@github.com>

Message-ID: <4S2raWNwXSaEN1p2bAXEUKlHdqSY9AqrR7cBZDhs2QI=.e6ecddb3-be2b-4bda-88ac-8cd9fcb1301b@github.com> On Wed, 2 Oct 2024 19:41:26 GMT, Roberto Casta?eda Lozano wrote: >> I saw the riscv one in z_riscv.ad is: `effect(TEMP oldval_tmp, TEMP newval_tmp, TEMP tmp1, TEMP_DEF res);`, maybe it's good to change riscv one? > > I suggest to postpone these types of refactorings to follow-up enhancements, given that the pull request in its current form is stable, thoroughly tested, and approved by reviewers. I intend to integrate it within the following 24 hours, provided final test results look good. Sounds good too. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1785711504 From rcastanedalo at openjdk.org Thu Oct 3 08:35:54 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 3 Oct 2024 08:35:54 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

Message-ID: On Mon, 30 Sep 2024 05:02:12 GMT, Roberto Casta?eda Lozano wrote: >> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. >> >> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: >> >> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and >> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. >> >> ## Summary of the Changes >> >> ### Platform-Independent Changes (`src/hotspot/share`) >> >> These consist mainly of: >> >> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; >> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and >> - temporary support for porting the JEP to the remaining platforms. >> >> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. >> >> ### Platform-Dependent Changes (`src/hotspot/cpu`) >> >> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. >> >> #### ADL Changes >> >> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. >> >> #### `G1BarrierSetAssembler` Changes >> >> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ... > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 53 additional commits since the last revision: > > - Merge remote-tracking branch 'feilongjiang/JEP-475-RISC-V' into JDK-8334060-g1-late-barrier-expansion > - riscv port refactor > - Remove temporary support code > - Merge jdk-24+17 > - Relax 'must match' assertion in ppc's g1StoreN after limiting pinning bypass optimization > - Remove unnecessary reg-to-reg moves in aarch64's g1CompareAndX instructions > - Reintroduce matcher assert and limit pinning bypass optimization to non-shared EncodeP nodes > - Merge jdk-24+16 > - Ensure that detected encode-and-store patterns are matched > - Merge remote-tracking branch 'snazarkin/arm32-JDK-8334060-g1-late-barrier-expansion' into JDK-8334060-g1-late-barrier-expansion > - ... and 43 more: https://git.openjdk.org/jdk/compare/0cf6df31...14483b83 Thanks to everyone who contributed to this JEP, integrating now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/19746#issuecomment-2390833194 From rcastanedalo at openjdk.org Thu Oct 3 08:39:57 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 3 Oct 2024 08:39:57 GMT Subject: Integrated: 8334060: Implementation of Late Barrier Expansion for G1 In-Reply-To: References: Message-ID: On Mon, 17 Jun 2024 09:49:25 GMT, Roberto Casta?eda Lozano wrote: > This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail. > > We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold: > > - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and > - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work. > > ## Summary of the Changes > > ### Platform-Independent Changes (`src/hotspot/share`) > > These consist mainly of: > > - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early; > - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and > - temporary support for porting the JEP to the remaining platforms. > > The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed. > > ### Platform-Dependent Changes (`src/hotspot/cpu`) > > These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms. > > #### ADL Changes > > The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy. > > #### `G1BarrierSetAssembler` Changes > > Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This c... This pull request has now been integrated. Changeset: 0b467e90 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.org/jdk/commit/0b467e902d591ae9feeec1669918d1588987cd1c Stats: 7372 lines in 58 files changed: 5924 ins; 985 del; 463 mod 8334060: Implementation of Late Barrier Expansion for G1 Co-authored-by: Roberto Casta?eda Lozano Co-authored-by: Erik ?sterlund Co-authored-by: Siyao Liu Co-authored-by: Kim Barrett Co-authored-by: Amit Kumar Co-authored-by: Martin Doerr Co-authored-by: Feilong Jiang Co-authored-by: Sergey Nazarkin Reviewed-by: kvn, tschatzl, fyang, ayang, kbarrett ------------- PR: https://git.openjdk.org/jdk/pull/19746 From kdnilsen at openjdk.org Thu Oct 3 16:19:53 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Thu, 3 Oct 2024 16:19:53 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v2] In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: Predict next batch time and enforce predictive deadline ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21211/files - new: https://git.openjdk.org/jdk/pull/21211/files/acf517f5..c3f1b080 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21211&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21211&range=00-01 Stats: 20 lines in 1 file changed: 11 ins; 2 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/21211.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21211/head:pull/21211 PR: https://git.openjdk.org/jdk/pull/21211 From xpeng at openjdk.org Thu Oct 3 16:51:37 2024 From: xpeng at openjdk.org (Xiaolong Peng) Date: Thu, 3 Oct 2024 16:51:37 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v2] In-Reply-To: References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 3 Oct 2024 16:19:53 GMT, Kelvin Nilsen wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > Predict next batch time and enforce predictive deadline Marked as reviewed by xpeng (Author). src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 957: > 955: batch_end_time = os::javaTimeNanos(); > 956: // Estimate includes historic combination of yield times and heap lock acquisition times. > 957: batch_process_time_estimate = (batch_end_time - recycle_trash_start_time) / total_batches;; Nit: double semicolons ------------- PR Review: https://git.openjdk.org/jdk/pull/21211#pullrequestreview-2346133650 PR Review Comment: https://git.openjdk.org/jdk/pull/21211#discussion_r1786541498 From shade at openjdk.org Thu Oct 3 17:15:03 2024 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 3 Oct 2024 17:15:03 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

Message-ID: On Wed, 2 Oct 2024 15:37:40 GMT, Roman Kennke wrote: >> This is the main body of the JEP 450: Compact Object Headers (Experimental). >> >> It is also a follow-up to #20640, which now also includes (and supersedes) #20603 and #20605, plus the Tiny Class-Pointers parts that have been previously missing. >> >> Main changes: >> - Introduction of the (experimental) flag UseCompactObjectHeaders. All changes in this PR are protected by this flag. The purpose of the flag is to provide a fallback, in case that users unexpectedly observe problems with the new implementation. The intention is that this flag will remain experimental and opt-in for at least one release, then make it on-by-default and diagnostic (?), and eventually deprecate and obsolete it. However, there are a few unknowns in that plan, specifically, we may want to further improve compact headers to 4 bytes, we are planning to enhance the Klass* encoding to support virtually unlimited number of Klasses, at which point we could also obsolete UseCompressedClassPointers. >> - The compressed Klass* can now be stored in the mark-word of objects. In order to be able to do this, we are add some changes to GC forwarding (see below) to protect the relevant (upper 22) bits of the mark-word. Significant parts of this PR deal with loading the compressed Klass* from the mark-word. This PR also changes some code paths (mostly in GCs) to be more careful when accessing Klass* (or mark-word or size) to be able to fetch it from the forwardee in case the object is forwarded. >> - Self-forwarding in GCs (which is used to deal with promotion failure) now uses a bit to indicate 'self-forwarding'. This is needed to preserve the crucial Klass* bits in the header. This also allows to get rid of preserved-header machinery in SerialGC and G1 (Parallel GC abuses preserved-marks to also find all other relevant oops). >> - Full GC forwarding now uses an encoding similar to compressed-oops. We have 40 bits for that, and can encode up to 8TB of heap. When exceeding 8TB, we turn off UseCompressedClassPointers (except in ZGC, which doesn't use the GC forwarding at all). >> - Instances can now have their base-offset (the offset where the field layouter starts to place fields) at offset 8 (instead of 12 or 16). >> - Arrays will now store their length at offset 8. >> - CDS can now write and read archives with the compressed header. However, it is not possible to read an archive that has been written with an opposite setting of UseCompactObjectHeaders. Some build machinery is added so that _co... > > Roman Kennke has updated the pull request incrementally with three additional commits since the last revision: > > - Revert "Disable TestSplitPacks::test4a, failing on aarch64" > > This reverts commit 059b1573db26d1d2902ca6dadc8413f445234c2a. > - Simplify object init code in interpreter > - Disable some vectorization tests that fail with +UCOH and UseSSE<=3 I posted a patch for JDK-8341044 for CDSPluginTest.java that was failing in our testing with the Lilliput patch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20677#issuecomment-2392273233 From kdnilsen at openjdk.org Thu Oct 3 21:30:12 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Thu, 3 Oct 2024 21:30:12 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3] In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21211/files - new: https://git.openjdk.org/jdk/pull/21211/files/c3f1b080..055ad411 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21211&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21211&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/21211.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21211/head:pull/21211 PR: https://git.openjdk.org/jdk/pull/21211 From kdnilsen at openjdk.org Thu Oct 3 21:30:12 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Thu, 3 Oct 2024 21:30:12 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v2] In-Reply-To: References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com>

Message-ID: On Thu, 3 Oct 2024 16:48:52 GMT, Xiaolong Peng wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Predict next batch time and enforce predictive deadline > > src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 957: > >> 955: batch_end_time = os::javaTimeNanos(); >> 956: // Estimate includes historic combination of yield times and heap lock acquisition times. >> 957: batch_process_time_estimate = (batch_end_time - recycle_trash_start_time) / total_batches;; > > Nit: double semicolons Thanks. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21211#discussion_r1786862316 From kdnilsen at openjdk.org Thu Oct 3 21:46:36 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Thu, 3 Oct 2024 21:46:36 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3] In-Reply-To: <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> Message-ID: On Thu, 3 Oct 2024 21:30:12 GMT, Kelvin Nilsen wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > fix typo I restructured the mechanism that decides whether to start processing another batch of regions. In the code as originally structured, we would start another batch whenever the time since previous yield is less than 8 us. In this new version of the code, we start another batch whenever the time since previous yield plus the predicted time to process another batch is less than 10 us. Prediction of how long a batch will take to process is based on most recent history. Note that this is highly dependent on the state of the application. If the application has lots of mutator threads that are urgently allocating, there will be contention for the global heap lock and there will be contention for CPU cores, and both of these will cause the batch processing time to increase. As expected, this new version does a better job of constraining the amount of time between yields. With this new code, we will never start processing another batch unless we have some confidence that the new batch will complete on schedule. Previously, we might have completed processing a first batch at time 7.5 us. Then, we would have immediately started to process a second patch, even though a prediction of batch processing time would have allowed us to predict that this second batch would not finish until time 15 us. The new version of the code will sometimes cause mutators to wait slightly longer for yield and the heap lock when the system is "lightly loaded". This is shown to effect p50 latencies. When the system is lightly loaded, the time to process a batch (on current hardware) is approximately 1.5 us. Suppose we finish processing a batch at time 8.25 us. In the original implementation, we would have immediately yielded. However, in this new version of the code, we'll take the next batch, because this is predicted to complete at time 9.75 us, which is less than deadline 10 us. Making this choice allows the GC thread that is recycling trash regions to have higher throughput. Here are the results of testing this new version of the code: ![image](https://github.com/user-attachments/assets/5acacae4-0604-400c-a3f7-3899274414a8) and this is how this new version compares to mainline without this PR: ![image](https://github.com/user-attachments/assets/44df0fc2-04c7-4593-923e-05a50f6767ed) Most of the p99.99 percentile latencies are considerably improved. There is a slight degradation of p50 latencies. Which would be considered the preferred solution? I'll vote for the new code, but can be persuaded either way. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2392391683 From xpeng at openjdk.org Thu Oct 3 22:14:36 2024 From: xpeng at openjdk.org (Xiaolong Peng) Date: Thu, 3 Oct 2024 22:14:36 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3] In-Reply-To: <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> Message-ID: On Thu, 3 Oct 2024 21:30:12 GMT, Kelvin Nilsen wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Marked as reviewed by xpeng (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/21211#pullrequestreview-2346777439 From xpeng at openjdk.org Thu Oct 3 22:14:37 2024 From: xpeng at openjdk.org (Xiaolong Peng) Date: Thu, 3 Oct 2024 22:14:37 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3] In-Reply-To: References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> Message-ID: On Thu, 3 Oct 2024 21:42:51 GMT, Kelvin Nilsen wrote: > Which would be considered the preferred solution? I'll vote for the new code, but can be persuaded either way. New code looks good to me, the test result also looks pretty good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2392428849 From mbaesken at openjdk.org Fri Oct 4 07:35:34 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 4 Oct 2024 07:35:34 GMT Subject: RFR: 8336911: ZGC: Division by zero in heuristics after JDK-8332717 In-Reply-To: References:

Message-ID: On Wed, 2 Oct 2024 15:42:12 GMT, Axel Boldt-Christmas wrote: >> When running with ubsan enabled binaries, the following issue is reported, >> e.g. in test >> compiler/uncommontrap/TestDeoptOOM_ZGenerational.jtr >> also in gc/z/TestSmallHeap.jtr >> >> >> jdk/src/hotspot/share/gc/z/zDirector.cpp:537:84: runtime error: division by zero >> #0 0x7f422495bd1f in calculate_young_to_old_worker_ratio src/hotspot/share/gc/z/zDirector.cpp:537 >> #1 0x7f422495bd1f in select_worker_threads src/hotspot/share/gc/z/zDirector.cpp:694 >> #2 0x7f42282a0d97 in select_worker_threads src/hotspot/share/gc/z/zDirector.cpp:689 >> #3 0x7f42282a0d97 in initial_workers src/hotspot/share/gc/z/zDirector.cpp:784 >> #4 0x7f42282a2485 in initial_workers src/hotspot/share/gc/z/zDirector.cpp:795 >> #5 0x7f42282a2485 in start_minor_gc src/hotspot/share/gc/z/zDirector.cpp:797 >> #6 0x7f42282a2485 in start_gc src/hotspot/share/gc/z/zDirector.cpp:826 >> #7 0x7f42282a2485 in ZDirector::run_thread() src/hotspot/share/gc/z/zDirector.cpp:912 >> #8 0x7f422840bdd8 in ZThread::run_service() src/hotspot/share/gc/z/zThread.cpp:29 >> #9 0x7f4225ab6979 in ConcurrentGCThread::run() src/hotspot/share/gc/shared/concurrentGCThread.cpp:48 >> #10 0x7f4227e1137a in Thread::call_run() src/hotspot/share/runtime/thread.cpp:225 >> #11 0x7f42274619b1 in thread_native_entry src/hotspot/os/linux/os_linux.cpp:858 >> #12 0x7f422c8d36e9 in start_thread (/lib64/libpthread.so.0+0xa6e9) (BuildId: 9a146bd267419cb6a8cf08d7c602953a0f2e12c5) >> #13 0x7f422c1dc58e in clone (/lib64/libc.so.6+0x11858e) (BuildId: f2d1cb1ef49f8c47d43a4053910ba6137673ccce) >> >> >> The division by 0 leads to 'infinity' on most of our platforms. So instead of relying on this behavior, we can add a small check and set 'infinity' for divisor == 0. > > src/hotspot/share/gc/z/zDirector.cpp line 539: > >> 537: const double current_old_bytes_freed_per_gc_time = double(reclaimed_per_old_gc) / double(old_gc_time); >> 538: const double old_vs_young_efficiency_ratio = current_young_bytes_freed_per_gc_time == 0 ? std::numeric_limits::infinity() >> 539: : current_old_bytes_freed_per_gc_time / current_young_bytes_freed_per_gc_time; > > I think returning infinity here will cause problems with NaN down the line. It is also unclear what this means if both are `0`. To me something like the following makes sense. But I will discus this with my team. > Suggestion: > > > if (current_young_bytes_freed_per_gc_time == 0.0) { > if (current_old_bytes_freed_per_gc_time == 0.0) { > // Neither young nor old collections have reclaimed any memory. > // Give them equal priority. > return 1.0; > } > > // Only old collections have reclaimed memory. > // Prioritize old. > return ZOldGCThreads; > } > > const double old_vs_young_efficiency_ratio = current_old_bytes_freed_per_gc_time / current_young_bytes_freed_per_gc_time; Hi Axel, thanks for the suggestion. Please discuss it in your team and tell me the outcome :-) ! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21304#discussion_r1787287674 From jsikstro at openjdk.org Fri Oct 4 08:03:36 2024 From: jsikstro at openjdk.org (Joel =?UTF-8?B?U2lrc3Ryw7Zt?=) Date: Fri, 4 Oct 2024 08:03:36 GMT Subject: RFR: 8340426: ZGC: Move defragment out of the allocation path [v2] In-Reply-To: References:

Message-ID: On Fri, 27 Sep 2024 08:34:19 GMT, Stefan Johansson wrote: >> Please review this change to move defragmentation of small pages out of the allocation path, >> >> **Summary** >> In ZGC small pages are allocated at lower addresses while medium and large pages are allocated at higher addresses. Sometimes when memory usage is high or the distribution of pages in the cache demand it, small pages can be split out from medium or large pages. When this happens the small page is defragmented by remapping it to a lower address if it resides at a too high address. This is done to avoid fragmentation, but doing it in the allocation path comes with a cost since the remapping involves system calls. >> >> This change moves the remapping away from the allocation path to when the pages are freed after being garbage collected. This can increase the fragmentation a bit, since small pages are allowed to reside at higher addresses for a period time. The expectation is that since small pages are frequently collected the period should be short enough to not cause any problems. >> >> I've done detailed experiments with temporary JFR events to look at the cost of doing the defrag/remapping of pages. One problem is that by default we don't get a lot of defrags (unless we run out of memory), and to better investigate this I also altered the allocation path to force more defragmentation. These tests showed that moving defrag out of the allocation path not only have the positive effect of reducing the time spent allocating, but also spaced out the defragmentation a bit more. The reason for this is that now we will defrag when a page is freed by the GC, instead of when the page is allocated and the tests show that often many threads allocate at the same time (and then also defrag at the same time). This in turn leads to many concurrent calls down to the system to remap the memory, and this increases the cost of the actual mapping (seen by adding JFR events). >> >> **Additional testing** >> >> - Functional testing in mach5 tier1-7 >> - Sanity performance testing in aurora > > Stefan Johansson has updated the pull request incrementally with two additional commits since the last revision: > > - Additional changes > - StefanK review Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/21191#issuecomment-2393129646 From sjohanss at openjdk.org Fri Oct 4 08:29:42 2024 From: sjohanss at openjdk.org (Stefan Johansson) Date: Fri, 4 Oct 2024 08:29:42 GMT Subject: Integrated: 8340426: ZGC: Move defragment out of the allocation path In-Reply-To: References: Message-ID: On Wed, 25 Sep 2024 20:05:17 GMT, Stefan Johansson wrote: > Please review this change to move defragmentation of small pages out of the allocation path, > > **Summary** > In ZGC small pages are allocated at lower addresses while medium and large pages are allocated at higher addresses. Sometimes when memory usage is high or the distribution of pages in the cache demand it, small pages can be split out from medium or large pages. When this happens the small page is defragmented by remapping it to a lower address if it resides at a too high address. This is done to avoid fragmentation, but doing it in the allocation path comes with a cost since the remapping involves system calls. > > This change moves the remapping away from the allocation path to when the pages are freed after being garbage collected. This can increase the fragmentation a bit, since small pages are allowed to reside at higher addresses for a period time. The expectation is that since small pages are frequently collected the period should be short enough to not cause any problems. > > I've done detailed experiments with temporary JFR events to look at the cost of doing the defrag/remapping of pages. One problem is that by default we don't get a lot of defrags (unless we run out of memory), and to better investigate this I also altered the allocation path to force more defragmentation. These tests showed that moving defrag out of the allocation path not only have the positive effect of reducing the time spent allocating, but also spaced out the defragmentation a bit more. The reason for this is that now we will defrag when a page is freed by the GC, instead of when the page is allocated and the tests show that often many threads allocate at the same time (and then also defrag at the same time). This in turn leads to many concurrent calls down to the system to remap the memory, and this increases the cost of the actual mapping (seen by adding JFR events). > > **Additional testing** > > - Functional testing in mach5 tier1-7 > - Sanity performance testing in aurora This pull request has now been integrated. Changeset: ec020f3f Author: Stefan Johansson URL: https://git.openjdk.org/jdk/commit/ec020f3fc988553ad1eda460d889b5ba24e76e8e Stats: 94 lines in 5 files changed: 61 ins; 17 del; 16 mod 8340426: ZGC: Move defragment out of the allocation path Reviewed-by: aboldtch, jsikstro, eosterlund ------------- PR: https://git.openjdk.org/jdk/pull/21191 From rcastanedalo at openjdk.org Fri Oct 4 09:20:59 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 4 Oct 2024 09:20:59 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

Message-ID: On Thu, 3 Oct 2024 17:12:04 GMT, Aleksey Shipilev wrote: >> Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 53 additional commits since the last revision: >> >> - Merge remote-tracking branch 'feilongjiang/JEP-475-RISC-V' into JDK-8334060-g1-late-barrier-expansion >> - riscv port refactor >> - Remove temporary support code >> - Merge jdk-24+17 >> - Relax 'must match' assertion in ppc's g1StoreN after limiting pinning bypass optimization >> - Remove unnecessary reg-to-reg moves in aarch64's g1CompareAndX instructions >> - Reintroduce matcher assert and limit pinning bypass optimization to non-shared EncodeP nodes >> - Merge jdk-24+16 >> - Ensure that detected encode-and-store patterns are matched >> - Merge remote-tracking branch 'snazarkin/arm32-JDK-8334060-g1-late-barrier-expansion' into JDK-8334060-g1-late-barrier-expansion >> - ... and 43 more: https://git.openjdk.org/jdk/compare/0165cb32...14483b83 > > src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 335: > >> 333: assert(!use_ReduceInitialCardMarks(), >> 334: "post-barriers are only needed for tightly-coupled initialization stores when ReduceInitialCardMarks is disabled"); >> 335: access.set_barrier_data(access.barrier_data() ^ G1C2BarrierPre); > > I have been looking at this code after integration, and I wonder if `^` is really correct here? Was the intend to remove `G1C2BarrierPre` from the barrier data? What happens if `get_store_barrier` does not actually set it? Do we flip the bit back? Yes, the intend (and actual effect) is to remove `G1C2BarrierPre` from the barrier data. Using an XOR (`^`) is correct because at that program point `G1C2BarrierPre` is guaranteed to be set. This is because an `access` corresponding to a tightly-coupled initialization store is always of type `C2OptAccess`, hence `!access.is_parse_access()` and `get_store_barrier(access)` trivially returns `G1C2BarrierPre | G1C2BarrierPost`. Having said this, it would be clearly less convoluted to simply clear `G1C2BarrierPre` instead of flipping it. I will file a RFE, thanks. As a side note, this complexity is necessary to handle `!ReduceInitialCardMarks`. I keep wondering if the benefit of being able to disable `ReduceInitialCardMarks` [1,2,3] is worth the significant complexity required in the GC-C2 interface to deal with it. [1] https://docs.oracle.com/en/java/javase/23/gctuning/garbage-first-garbage-collector-tuning.html [2] https://bugs.openjdk.org/browse/JDK-8166899 [3] https://bugs.openjdk.org/browse/JDK-8167077 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1787425169 From rcastanedalo at openjdk.org Fri Oct 4 09:37:53 2024 From: rcastanedalo at openjdk.org (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 4 Oct 2024 09:37:53 GMT Subject: RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v27] In-Reply-To: References:

Message-ID: On Fri, 4 Oct 2024 09:17:47 GMT, Roberto Casta?eda Lozano wrote: >> src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp line 335: >> >>> 333: assert(!use_ReduceInitialCardMarks(), >>> 334: "post-barriers are only needed for tightly-coupled initialization stores when ReduceInitialCardMarks is disabled"); >>> 335: access.set_barrier_data(access.barrier_data() ^ G1C2BarrierPre); >> >> I have been looking at this code after integration, and I wonder if `^` is really correct here? Was the intend to remove `G1C2BarrierPre` from the barrier data? What happens if `get_store_barrier` does not actually set it? Do we flip the bit back? > > Yes, the intend (and actual effect) is to remove `G1C2BarrierPre` from the barrier data. Using an XOR (`^`) is correct because at that program point `G1C2BarrierPre` is guaranteed to be set. This is because an `access` corresponding to a tightly-coupled initialization store is always of type `C2OptAccess`, hence `!access.is_parse_access()` and `get_store_barrier(access)` trivially returns `G1C2BarrierPre | G1C2BarrierPost`. Having said this, it would be clearly less convoluted to simply clear `G1C2BarrierPre` instead of flipping it. I will file a RFE, thanks. > > As a side note, this complexity is necessary to handle `!ReduceInitialCardMarks`. I keep wondering if the benefit of being able to disable `ReduceInitialCardMarks` [1,2,3] is worth the significant complexity required in the GC-C2 interface to deal with it. > > [1] https://docs.oracle.com/en/java/javase/23/gctuning/garbage-first-garbage-collector-tuning.html > [2] https://bugs.openjdk.org/browse/JDK-8166899 > [3] https://bugs.openjdk.org/browse/JDK-8167077 Reported here: [JDK-8341525](https://bugs.openjdk.org/browse/JDK-8341525). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1787448241 From rkennke at openjdk.org Fri Oct 4 10:44:53 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 4 Oct 2024 10:44:53 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v26] In-Reply-To: References:

<6PTWMepIDuZDdPfN3xNKV1vqUyO_R4yCSeiSTpYIyyQ=.61a5b462-7114-4385-a6d7-40e5c7b0005d@github.com>

Message-ID: On Wed, 2 Oct 2024 21:29:28 GMT, Sandhya Viswanathan wrote: >> I changed the header<16 version to be a small loop: https://github.com/rkennke/jdk/commit/bcba264ea5c15581647933db1163ca1dae39b6c5 >> >> The idea is the same as before, except it's made as a small loop with a maximum of 4 iterations (backward-branches), and it copies 8 bytes at a time, such that 1. it may copy up to 7 bytes that precede the array and 2. doesn't run over the end of the array (which would potentially crash). >> >> I am not sure if using XMM_TMP1 and XMM_TMP2 there is ok, or if it would encode better to use one of the regular registers.? >> >> Also, this new implementation could simply replace the old one, instead of being an alternative. I am not sure if if would make any difference performance-wise. > > @rkennke The small loop looks to me that it will run over the end of the array. > Say the haystack_len is 7, the index below would be 0 after the shrq instruction, and the movq(XMM_TMP1, Address(haystack, index, Address::times_8)) in the loop will read 8 bytes i.e. one byte past the end of the array: > // num_words (zero-based) = (haystack_len - 1) / 8; > __ movq(index, haystack_len); > __ subq(index, 1); > __ shrq(index, LogBytesPerWord); > > __ bind(L_loop); > __ movq(XMM_TMP1, Address(haystack, index, Address::times_8)); > __ movq(Address(rsp, index, Address::times_8), XMM_TMP1); > __ subq(index, 1); > __ jcc(Assembler::positive, L_loop); Yes, and that is intentional. Say, haystack_len is 7, then the first block computes the adjustment of the haystack, which is 8 - (7 % 8) = 1. We adjust the haystack pointer one byte down, so that when we copy (multiple of) 8 bytes, we land on the last byte. We do copy a few bytes that are preceding the array, which is part of the object header and guaranteed to be >= 8 bytes. Then we compute the number of words to copy, but make it 0-based. That is '0' is 1 word, '1' is 2 words, etc. It makes the loop nicer. In this example we get 0, which means we copy one word from the adjusted haystack, which is correct. Then comes the actual loop. Afterwards we adjust the haystack pointer so that it points to the first array element that we just copied onto the stack, ignoring the few garbage bytes that we also copied. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20677#discussion_r1787528501 From rkennke at openjdk.org Fri Oct 4 11:15:37 2024 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 4 Oct 2024 11:15:37 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v30] In-Reply-To: References: Message-ID: > This is the main body of the JEP 450: Compact Object Headers (Experimental). > > It is also a follow-up to #20640, which now also includes (and supersedes) #20603 and #20605, plus the Tiny Class-Pointers parts that have been previously missing. > > Main changes: > - Introduction of the (experimental) flag UseCompactObjectHeaders. All changes in this PR are protected by this flag. The purpose of the flag is to provide a fallback, in case that users unexpectedly observe problems with the new implementation. The intention is that this flag will remain experimental and opt-in for at least one release, then make it on-by-default and diagnostic (?), and eventually deprecate and obsolete it. However, there are a few unknowns in that plan, specifically, we may want to further improve compact headers to 4 bytes, we are planning to enhance the Klass* encoding to support virtually unlimited number of Klasses, at which point we could also obsolete UseCompressedClassPointers. > - The compressed Klass* can now be stored in the mark-word of objects. In order to be able to do this, we are add some changes to GC forwarding (see below) to protect the relevant (upper 22) bits of the mark-word. Significant parts of this PR deal with loading the compressed Klass* from the mark-word. This PR also changes some code paths (mostly in GCs) to be more careful when accessing Klass* (or mark-word or size) to be able to fetch it from the forwardee in case the object is forwarded. > - Self-forwarding in GCs (which is used to deal with promotion failure) now uses a bit to indicate 'self-forwarding'. This is needed to preserve the crucial Klass* bits in the header. This also allows to get rid of preserved-header machinery in SerialGC and G1 (Parallel GC abuses preserved-marks to also find all other relevant oops). > - Full GC forwarding now uses an encoding similar to compressed-oops. We have 40 bits for that, and can encode up to 8TB of heap. When exceeding 8TB, we turn off UseCompressedClassPointers (except in ZGC, which doesn't use the GC forwarding at all). > - Instances can now have their base-offset (the offset where the field layouter starts to place fields) at offset 8 (instead of 12 or 16). > - Arrays will now store their length at offset 8. > - CDS can now write and read archives with the compressed header. However, it is not possible to read an archive that has been written with an opposite setting of UseCompactObjectHeaders. Some build machinery is added so that _coh variants of CDS archiv... Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 76 commits: - Merge remote-tracking branch 'rkennke/JDK-8305895-v4' into JDK-8305895-v4 - Revert "Disable TestSplitPacks::test4a, failing on aarch64" This reverts commit 059b1573db26d1d2902ca6dadc8413f445234c2a. - Simplify object init code in interpreter - Disable some vectorization tests that fail with +UCOH and UseSSE<=3 - Fix for CDSPluginTest.java - Merge tag 'jdk-24+18' into JDK-8305895-v4 Added tag jdk-24+18 for changeset 19642bd3 - Disable TestSplitPacks::test4a, failing on aarch64 - @robcasloz review comments - Improve CollectedHeap::is_oop() - Allow LM_MONITOR on 32-bit platforms - ... and 66 more: https://git.openjdk.org/jdk/compare/19642bd3...8742f3c1 ------------- Changes: https://git.openjdk.org/jdk/pull/20677/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20677&range=29 Stats: 4560 lines in 196 files changed: 3207 ins; 724 del; 629 mod Patch: https://git.openjdk.org/jdk/pull/20677.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/20677/head:pull/20677 PR: https://git.openjdk.org/jdk/pull/20677 From mbaesken at openjdk.org Fri Oct 4 11:49:37 2024 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 4 Oct 2024 11:49:37 GMT Subject: RFR: 8340945: Ubsan: oopStorage.cpp:374:8: runtime error: applying non-zero offset 18446744073709551168 to null pointer In-Reply-To: <1uTdfa0c-Nrk1-6t0IPbYRmwy9URAHyz7-V4YHx5hDo=.377970e5-1b31-4658-94fc-e232bfbdec3b@github.com> References: <1uTdfa0c-Nrk1-6t0IPbYRmwy9URAHyz7-V4YHx5hDo=.377970e5-1b31-4658-94fc-e232bfbdec3b@github.com> Message-ID: <9zmxxLvVwPj0HZF0wbCB1MZ3zNuBzhDGJlw1GYBSxcE=.183e74d7-6a16-44f7-b65c-253eee196368@github.com> On Sat, 28 Sep 2024 05:20:23 GMT, Kim Barrett wrote: > Please review this change to the OopStorage handling of storage block lookup, > now being more careful about pointer arithmetic to avoid UB. > > As an initial cleanup, renamed OopStorage::find_block_or_null to > block_for_ptr, for consistency with the Block function that implements it. > Also moved the precondition assert that the argument is non-null into the > Block function, where the requirement is located. > > Changed OopStorage::Block::block_for_ptr to avoid pointer arithmetic that > might invoke UB, instead converting the pointer argument to uintptr_t and > performing arithmetic on it. Also fixed its description in the header file. > > Similarly changed OopStorage::Block::active_index_safe to avoid pointer > arithmetic, instead converting to uintptr_t for arithmetic. This avoids > potential problems when the Block argument is a "false positive" from > block_for_ptr. > > Changed OopStorage::allocation_status to check up front for a null argument, > immediately returning INVALID_ENTRY in that case. This avoids voilating > block_for_ptr's precondition that the argument is non-null. Added a gtest for > this. Also added a gtest for the potential false-positive case. > > While updating gtests, removed #ifndef DISABLE_GARBAGE_ALLOCATION_STATUS_TESTS. > That macro was included when these tests were first added, because some tests > needed to be disabled on Windows, due to SafeFetchN in gtest context not working > on that platform. That was later fixed by JDK-8185734. The conditional #define > of that macro in test_oopStorage.cpp was removed, but the no longer needed > #ifndef was inadvertently not removed. > > Testing: mach5 tier1-5 > Locally (linux-x64) reproduced the reported ubsan failure, and verified it no > longer reproduces with these changes. > > While working on this change I noticed a related issue. The recently added > OopStorage::print_containing doesn't verify the block is not a false positive > before using it as a block. I'll file a JBS issue for this. Marked as reviewed by mbaesken (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/21240#pullrequestreview-2347863489 From coleenp at openjdk.org Fri Oct 4 12:53:53 2024 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 4 Oct 2024 12:53:53 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v30] In-Reply-To: References:

Message-ID: On Fri, 4 Oct 2024 11:15:37 GMT, Roman Kennke wrote: >> This is the main body of the JEP 450: Compact Object Headers (Experimental). >> >> It is also a follow-up to #20640, which now also includes (and supersedes) #20603 and #20605, plus the Tiny Class-Pointers parts that have been previously missing. >> >> Main changes: >> - Introduction of the (experimental) flag UseCompactObjectHeaders. All changes in this PR are protected by this flag. The purpose of the flag is to provide a fallback, in case that users unexpectedly observe problems with the new implementation. The intention is that this flag will remain experimental and opt-in for at least one release, then make it on-by-default and diagnostic (?), and eventually deprecate and obsolete it. However, there are a few unknowns in that plan, specifically, we may want to further improve compact headers to 4 bytes, we are planning to enhance the Klass* encoding to support virtually unlimited number of Klasses, at which point we could also obsolete UseCompressedClassPointers. >> - The compressed Klass* can now be stored in the mark-word of objects. In order to be able to do this, we are add some changes to GC forwarding (see below) to protect the relevant (upper 22) bits of the mark-word. Significant parts of this PR deal with loading the compressed Klass* from the mark-word. This PR also changes some code paths (mostly in GCs) to be more careful when accessing Klass* (or mark-word or size) to be able to fetch it from the forwardee in case the object is forwarded. >> - Self-forwarding in GCs (which is used to deal with promotion failure) now uses a bit to indicate 'self-forwarding'. This is needed to preserve the crucial Klass* bits in the header. This also allows to get rid of preserved-header machinery in SerialGC and G1 (Parallel GC abuses preserved-marks to also find all other relevant oops). >> - Full GC forwarding now uses an encoding similar to compressed-oops. We have 40 bits for that, and can encode up to 8TB of heap. When exceeding 8TB, we turn off UseCompressedClassPointers (except in ZGC, which doesn't use the GC forwarding at all). >> - Instances can now have their base-offset (the offset where the field layouter starts to place fields) at offset 8 (instead of 12 or 16). >> - Arrays will now store their length at offset 8. >> - CDS can now write and read archives with the compressed header. However, it is not possible to read an archive that has been written with an opposite setting of UseCompactObjectHeaders. Some build machinery is added so that _co... > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 76 commits: > > - Merge remote-tracking branch 'rkennke/JDK-8305895-v4' into JDK-8305895-v4 > - Revert "Disable TestSplitPacks::test4a, failing on aarch64" > > This reverts commit 059b1573db26d1d2902ca6dadc8413f445234c2a. > - Simplify object init code in interpreter > - Disable some vectorization tests that fail with +UCOH and UseSSE<=3 > - Fix for CDSPluginTest.java > - Merge tag 'jdk-24+18' into JDK-8305895-v4 > > Added tag jdk-24+18 for changeset 19642bd3 > - Disable TestSplitPacks::test4a, failing on aarch64 > - @robcasloz review comments > - Improve CollectedHeap::is_oop() > - Allow LM_MONITOR on 32-bit platforms > - ... and 66 more: https://git.openjdk.org/jdk/compare/19642bd3...8742f3c1 There's another test failure that we're seeing that's similar to this bug in mainline when running with -XX:+UseCompactObjectHeaders on aarch64: https://bugs.openjdk.org/browse/JDK-8340212 ------------- PR Comment: https://git.openjdk.org/jdk/pull/20677#issuecomment-2393637283 From zgu at openjdk.org Fri Oct 4 14:42:15 2024 From: zgu at openjdk.org (Zhengyu Gu) Date: Fri, 4 Oct 2024 14:42:15 GMT Subject: RFR: 8341332: Refactor array chunking statistics counters Message-ID: Please review this patch that consolidates array chunking statistics counting and reporting inside task queue and task queue set. Also consolidating partial array chunking and processing into `PartialArrayProcessor` to reduce duplicate code. ------------- Commit messages: - remove empty line - Cleanup - Cleanup - v2 - v1 - add new files - v0 Changes: https://git.openjdk.org/jdk/pull/21343/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21343&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8341332 Stats: 448 lines in 10 files changed: 277 ins; 116 del; 55 mod Patch: https://git.openjdk.org/jdk/pull/21343.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21343/head:pull/21343 PR: https://git.openjdk.org/jdk/pull/21343 From kbarrett at openjdk.org Fri Oct 4 16:01:38 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 4 Oct 2024 16:01:38 GMT Subject: RFR: 8340945: Ubsan: oopStorage.cpp:374:8: runtime error: applying non-zero offset 18446744073709551168 to null pointer In-Reply-To: References: <1uTdfa0c-Nrk1-6t0IPbYRmwy9URAHyz7-V4YHx5hDo=.377970e5-1b31-4658-94fc-e232bfbdec3b@github.com> Message-ID: On Mon, 30 Sep 2024 08:29:17 GMT, Thomas Schatzl wrote: >> Please review this change to the OopStorage handling of storage block lookup, >> now being more careful about pointer arithmetic to avoid UB. >> >> As an initial cleanup, renamed OopStorage::find_block_or_null to >> block_for_ptr, for consistency with the Block function that implements it. >> Also moved the precondition assert that the argument is non-null into the >> Block function, where the requirement is located. >> >> Changed OopStorage::Block::block_for_ptr to avoid pointer arithmetic that >> might invoke UB, instead converting the pointer argument to uintptr_t and >> performing arithmetic on it. Also fixed its description in the header file. >> >> Similarly changed OopStorage::Block::active_index_safe to avoid pointer >> arithmetic, instead converting to uintptr_t for arithmetic. This avoids >> potential problems when the Block argument is a "false positive" from >> block_for_ptr. >> >> Changed OopStorage::allocation_status to check up front for a null argument, >> immediately returning INVALID_ENTRY in that case. This avoids voilating >> block_for_ptr's precondition that the argument is non-null. Added a gtest for >> this. Also added a gtest for the potential false-positive case. >> >> While updating gtests, removed #ifndef DISABLE_GARBAGE_ALLOCATION_STATUS_TESTS. >> That macro was included when these tests were first added, because some tests >> needed to be disabled on Windows, due to SafeFetchN in gtest context not working >> on that platform. That was later fixed by JDK-8185734. The conditional #define >> of that macro in test_oopStorage.cpp was removed, but the no longer needed >> #ifndef was inadvertently not removed. >> >> Testing: mach5 tier1-5 >> Locally (linux-x64) reproduced the reported ubsan failure, and verified it no >> longer reproduces with these changes. >> >> While working on this change I noticed a related issue. The recently added >> OopStorage::print_containing doesn't verify the block is not a false positive >> before using it as a block. I'll file a JBS issue for this. > > Marked as reviewed by tschatzl (Reviewer). Thanks for reviews @tschatzl and @MBaesken . ------------- PR Comment: https://git.openjdk.org/jdk/pull/21240#issuecomment-2394012889 From kbarrett at openjdk.org Fri Oct 4 16:01:39 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 4 Oct 2024 16:01:39 GMT Subject: Integrated: 8340945: Ubsan: oopStorage.cpp:374:8: runtime error: applying non-zero offset 18446744073709551168 to null pointer In-Reply-To: <1uTdfa0c-Nrk1-6t0IPbYRmwy9URAHyz7-V4YHx5hDo=.377970e5-1b31-4658-94fc-e232bfbdec3b@github.com> References: <1uTdfa0c-Nrk1-6t0IPbYRmwy9URAHyz7-V4YHx5hDo=.377970e5-1b31-4658-94fc-e232bfbdec3b@github.com> Message-ID: On Sat, 28 Sep 2024 05:20:23 GMT, Kim Barrett wrote: > Please review this change to the OopStorage handling of storage block lookup, > now being more careful about pointer arithmetic to avoid UB. > > As an initial cleanup, renamed OopStorage::find_block_or_null to > block_for_ptr, for consistency with the Block function that implements it. > Also moved the precondition assert that the argument is non-null into the > Block function, where the requirement is located. > > Changed OopStorage::Block::block_for_ptr to avoid pointer arithmetic that > might invoke UB, instead converting the pointer argument to uintptr_t and > performing arithmetic on it. Also fixed its description in the header file. > > Similarly changed OopStorage::Block::active_index_safe to avoid pointer > arithmetic, instead converting to uintptr_t for arithmetic. This avoids > potential problems when the Block argument is a "false positive" from > block_for_ptr. > > Changed OopStorage::allocation_status to check up front for a null argument, > immediately returning INVALID_ENTRY in that case. This avoids voilating > block_for_ptr's precondition that the argument is non-null. Added a gtest for > this. Also added a gtest for the potential false-positive case. > > While updating gtests, removed #ifndef DISABLE_GARBAGE_ALLOCATION_STATUS_TESTS. > That macro was included when these tests were first added, because some tests > needed to be disabled on Windows, due to SafeFetchN in gtest context not working > on that platform. That was later fixed by JDK-8185734. The conditional #define > of that macro in test_oopStorage.cpp was removed, but the no longer needed > #ifndef was inadvertently not removed. > > Testing: mach5 tier1-5 > Locally (linux-x64) reproduced the reported ubsan failure, and verified it no > longer reproduces with these changes. > > While working on this change I noticed a related issue. The recently added > OopStorage::print_containing doesn't verify the block is not a false positive > before using it as a block. I'll file a JBS issue for this. This pull request has now been integrated. Changeset: feb6a830 Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/feb6a830e291ff71e2803e37be6c35c237f7c1cf Stats: 69 lines in 4 files changed: 37 ins; 4 del; 28 mod 8340945: Ubsan: oopStorage.cpp:374:8: runtime error: applying non-zero offset 18446744073709551168 to null pointer Reviewed-by: tschatzl, mbaesken ------------- PR: https://git.openjdk.org/jdk/pull/21240 From phh at openjdk.org Fri Oct 4 16:39:41 2024 From: phh at openjdk.org (Paul Hohensee) Date: Fri, 4 Oct 2024 16:39:41 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3] In-Reply-To: <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> Message-ID: On Thu, 3 Oct 2024 21:30:12 GMT, Kelvin Nilsen wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Marked as reviewed by phh (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/21211#pullrequestreview-2348488078 From duke at openjdk.org Fri Oct 4 17:13:40 2024 From: duke at openjdk.org (duke) Date: Fri, 4 Oct 2024 17:13:40 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3] In-Reply-To: <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> Message-ID: On Thu, 3 Oct 2024 21:30:12 GMT, Kelvin Nilsen wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > fix typo @kdnilsen Your change (at version 055ad41109b303ab474c2510cc496a3cf87135b8) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2394174956 From kbarrett at openjdk.org Fri Oct 4 17:32:38 2024 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 4 Oct 2024 17:32:38 GMT Subject: RFR: 8341332: Refactor array chunking statistics counters In-Reply-To: References: Message-ID: On Fri, 4 Oct 2024 01:41:50 GMT, Zhengyu Gu wrote: > Please review this patch that consolidates array chunking statistics counting and reporting inside task queue and task queue set. Also consolidating partial array chunking and processing into `PartialArrayProcessor` to reduce duplicate code. I hadn't noticed the associated JBS issue and that you'd started working on it. I've also started working on the same problem, but taking a different approach. I think the approach being taken in this PR is continuing a problematic design approach already present in the existing code. I'd rather we didn't do that. Virtual functions tend to combine poorly with class templates. I think there's not actually any need for runtime polymorphism in the taskqueue stuff. Adding more is not appealing to me. I think the basic taskqueue and its associated statistics form a generic utility that could be used anywhere one needs parallel task queues with work stealing (so long as the tasks can meet the necessary requirements - see the discussion with the implementation of pop_global). I don't think inheritance is the best way to augment them. I think the existing use of inheritance for overflow queues, and inclusion of overflow stats in TaskQueueStats, are design errors. This PR proposes to do more of that. I think task specifics ought to be kept separate from the generic taskqueue, which doesn't really need to know anything about them. So I think, for example, that the various task types ought not be in the taskqueue files, but should be in their own files. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21343#issuecomment-2394206125 From ysr at openjdk.org Fri Oct 4 17:32:42 2024 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 4 Oct 2024 17:32:42 GMT Subject: RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3] In-Reply-To: <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> <2QJqfhzQfJk9ZDN1AI3QR2KpLvGA-djz8U-Cv7OeKGo=.708970da-45e6-41db-a3cd-86ca272d8a14@github.com> Message-ID: On Thu, 3 Oct 2024 21:30:12 GMT, Kelvin Nilsen wrote: >> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. >> >> Efficiency improvements include: >> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. >> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. >> >> Below, each trial runs for 1 hour, processing 28,000 transactions per second. >> >> Without this change, latency for 4 un-named business services is represented by the following chart: >> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) >> >> With this change, latency for the same services is much better: >> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) >> >> A comparison of the two is provided by the following: >> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > fix typo Nice! ------------- PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2394202215 From kdnilsen at openjdk.org Fri Oct 4 17:32:43 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 4 Oct 2024 17:32:43 GMT Subject: Integrated: 8341379: Shenandoah: Improve lock contention during cleanup In-Reply-To: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> References: <5GMjFWaMjkSuxhPJ8TYJyw_LYdFfhJWrySSaRUjSNq4=.0ee7f896-35b2-43aa-a243-418b1ffb50b0@github.com> Message-ID: On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen wrote: > This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort. The affected code runs at the end of FinalMark to reclaim immediate garbage. It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated. > > Efficiency improvements include: > 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed. > 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed. > > Below, each trial runs for 1 hour, processing 28,000 transactions per second. > > Without this change, latency for 4 un-named business services is represented by the following chart: > ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3) > > With this change, latency for the same services is much better: > ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64) > > A comparison of the two is provided by the following: > ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b) This pull request has now been integrated. Changeset: f5f0852f Author: Kelvin Nilsen URL: https://git.openjdk.org/jdk/commit/f5f0852f51d3dc1001bf3d68b89f4aab31e05e61 Stats: 40 lines in 1 file changed: 32 ins; 4 del; 4 mod 8341379: Shenandoah: Improve lock contention during cleanup Reviewed-by: xpeng, phh, wkemper ------------- PR: https://git.openjdk.org/jdk/pull/21211 From wkemper at openjdk.org Fri Oct 4 19:28:08 2024 From: wkemper at openjdk.org (William Kemper) Date: Fri, 4 Oct 2024 19:28:08 GMT Subject: RFR: 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy Message-ID: A recent change to avoid checking the log level under the lock inadvertently removed the lock from an operation that needs it. ------------- Commit messages: - Restore missing heap lock when updating usage at last gc Changes: https://git.openjdk.org/jdk/pull/21362/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21362&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8341554 Stats: 7 lines in 1 file changed: 3 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/21362.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21362/head:pull/21362 PR: https://git.openjdk.org/jdk/pull/21362 From kdnilsen at openjdk.org Fri Oct 4 19:28:08 2024 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 4 Oct 2024 19:28:08 GMT Subject: RFR: 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy In-Reply-To: References: Message-ID: <8olxAEXHXd9rybQFElBljiWNI2pfyAk-jnsKptCDwsw=.8963ecb2-8d1c-444f-a7a2-417232a1690e@github.com> On Fri, 4 Oct 2024 19:18:42 GMT, William Kemper wrote: > A recent change to avoid checking the log level under the lock inadvertently removed the lock from an operation that needs it. Marked as reviewed by kdnilsen (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/21362#pullrequestreview-2348975142 From ysr at openjdk.org Fri Oct 4 20:43:36 2024 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 4 Oct 2024 20:43:36 GMT Subject: RFR: 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy In-Reply-To: References: Message-ID: On Fri, 4 Oct 2024 19:18:42 GMT, William Kemper wrote: > A recent change to avoid checking the log level under the lock inadvertently removed the lock from an operation that needs it. Good catch by @kdnilsen ! Sorry for missing this in the previous review at https://github.com/openjdk/jdk/pull/19915 for @pengxiaolong. src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp line 186: > 184: ShenandoahHeapLocker locker(heap->lock()); > 185: heap->update_capacity_and_used_at_gc(); > 186: } If free set logging is enabled this does a lock/unlock for logging and then another lock/unlock for updating capacity. But I guess it's unavoidable, and it likely won't be the case that we have free set logging enabled in performance critical production situations, so not worth wasting too much sleep over. Reviewed! ------------- Marked as reviewed by ysr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21362#pullrequestreview-2349091755 PR Review Comment: https://git.openjdk.org/jdk/pull/21362#discussion_r1788281685 From ysr at openjdk.org Fri Oct 4 20:52:37 2024 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 4 Oct 2024 20:52:37 GMT Subject: RFR: 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy In-Reply-To: References:

Message-ID: On Fri, 4 Oct 2024 20:34:58 GMT, Y. Srinivas Ramakrishna wrote: >> A recent change to avoid checking the log level under the lock inadvertently removed the lock from an operation that needs it. > > src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp line 186: > >> 184: ShenandoahHeapLocker locker(heap->lock()); >> 185: heap->update_capacity_and_used_at_gc(); >> 186: } > > If free set logging is enabled this does a lock/unlock for logging and then another lock/unlock for updating capacity. > > But I guess it's unavoidable, and it likely won't be the case that we have free set logging enabled in performance critical production situations, so not worth wasting too much sleep over. > > Reviewed! @earthling-amzn : I'd suggest leaving a comment either in the ticket or in the PR (or here in the code?) stating the race that we are vulnerable to if we don't hold the lock, namely the potential skew between used & capacity, and why it's important not to have that skew for consumers of these fields. I imagine that we can demonstrate the issue with a targeted regression test. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21362#discussion_r1788297486 From wkemper at openjdk.org Fri Oct 4 21:58:44 2024 From: wkemper at openjdk.org (William Kemper) Date: Fri, 4 Oct 2024 21:58:44 GMT Subject: Integrated: 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy In-Reply-To: References: Message-ID: On Fri, 4 Oct 2024 19:18:42 GMT, William Kemper wrote: > A recent change to avoid checking the log level under the lock inadvertently removed the lock from an operation that needs it. This pull request has now been integrated. Changeset: bade041d Author: William Kemper URL: https://git.openjdk.org/jdk/commit/bade041db82a09cf33d4dbcc849f5784b3851f3d Stats: 7 lines in 1 file changed: 3 ins; 0 del; 4 mod 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy Reviewed-by: kdnilsen, ysr ------------- PR: https://git.openjdk.org/jdk/pull/21362 From wkemper at openjdk.org Fri Oct 4 21:58:44 2024 From: wkemper at openjdk.org (William Kemper) Date: Fri, 4 Oct 2024 21:58:44 GMT Subject: RFR: 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy In-Reply-To: References:

Message-ID: On Fri, 4 Oct 2024 20:48:35 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp line 186: >> >>> 184: ShenandoahHeapLocker locker(heap->lock()); >>> 185: heap->update_capacity_and_used_at_gc(); >>> 186: } >> >> If free set logging is enabled this does a lock/unlock for logging and then another lock/unlock for updating capacity. >> >> But I guess it's unavoidable, and it likely won't be the case that we have free set logging enabled in performance critical production situations, so not worth wasting too much sleep over. >> >> Reviewed! > > @earthling-amzn : I'd suggest leaving a comment either in the ticket or in the PR (or here in the code?) stating the race that we are vulnerable to if we don't hold the lock, namely the potential skew between used & capacity, and why it's important not to have that skew for consumers of these fields. I imagine that we can demonstrate the issue with a targeted regression test. Yeah, I think coupling the lock with the logging and `update_capacity_and_used` is what led us into trouble to begin with. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21362#discussion_r1788371332 From jonathanjoo at google.com Fri Oct 4 22:39:56 2024 From: jonathanjoo at google.com (Jonathan Joo) Date: Fri, 4 Oct 2024 15:39:56 -0700 Subject: Further discussion on Adaptable Heap Sizing with G1 Message-ID: Hi All, As Kirk mentioned in his email "Aligning the Serial collector with ZGC ", we are also working on adding Adaptable Heap Sizing (AHS) to G1. I created a draft Pull Request and received some comments on it already, including the following points: 1. I should Convert CurrentMaxExpansionSize to CurrentMaxHeapSize. 2. SoftMaxHeapSize, as implemented in the PR, is different from the original intent. 3. We need some sort of global memory pressure to enforce heap shrinkage. The first point I already addressed on the pull request, and I agree that CurrentMaxHeapSize works well :) Regarding the second point, we had some discussions already outside of this mailing list, but if I were to summarize the main ideas, they are: 1. The intent of SoftMaxHeapSize initially was for the GC to use this value as a guide for when to start concurrent GC. 2. Our implementation of SoftMaxHeapSize (in the PR) currently behaves more like a ProposedHeapSize, where whenever we shrink and expand the heap, we try to set the heap size to ProposedHeapSize, regardless of the value of MinHeapSize. 3. We need to ensure that the heap regularly approaches the value of ProposedHeapSize by introducing some sort of periodic GC, which we have a Google-internal patch for, and is not yet present in the PR. If we are in alignment that this makes sense, I can try adding this as a separate PR. For the third point, Similar to ZGC's -XX:ZGCPressure, we use target_gc_cpu_overhead (along with periodic GC) to control heap shrinkage by adjusting ProposedHeapSize. This allows users to balance RAM and CPU usage. Note that this isn't yet present in the PR, since the PR only includes the JVM changes introducing the CurrentMaxHeapSize and ProposedHeapSize flags. The logic to actually dynamically calculate the value of ProposedHeapSize would be a separate change, and can be iterated on. As a separate point - Kirk mentioned in his email that he aims to introduce an adaptive size policy where "Heap should be large enough to minimize GC overhead but not large enough to trigger OOM". I think from our experience in G1, we don't actively try to minimize GC overhead, as we find that maintaining a higher GC overhead often results in overall RAM savings >> CPU usage. I think as a general summary - the way I see it, there's value in creating a simplified system where we control the majority of JVM behavior simply with two flags - the maximum heap size (to prevent OOMs), and a target heap size, which is our calculation of an "optimal" size based on our understanding of the environment. The exact calculations for this optimal size may change depending on workload/preference, but what we are trying to do at this point in time is allow for a way to pass in some calculation for "optimal heap size" and have G1 react to it in a meaningful way. I acknowledge that the current JVM behavior (as implemented in my PR) may be suboptimal in terms of getting the heap to get to and stay at this "optimal heap size". However, even with the basic implementation of passing this value to shrinks/expands and only triggering resizes on Remarks/Full GCs, we've seen dramatic improvements in heap behavior at Google, compared to the current G1. I know there was some disagreement about the addition of this new "optimal heap size" flag, and I agree that SoftMaxHeapSize is probably not the right flag to represent this value. But I'd like to get some thoughts on whether the above summary seems like a reasonable way of reasoning about G1 AHS. If we agree, then we can always iteratively improve the JVM logic to better adhere to the optimal heap size. But it's not yet clear to me whether people are onboard the idea of having this "optimal heap size" calculation at all, since perhaps this functionality could be covered in other, existing ways. Thank you! ~ Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From tschatzl at openjdk.org Mon Oct 7 08:27:41 2024 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 7 Oct 2024 08:27:41 GMT Subject: RFR: 8341238: G1: Refactor G1Policy to move collection set selection methods into G1CollectionSet In-Reply-To: References: Message-ID: On Fri, 4 Oct 2024 07:25:32 GMT, Ivan Walulya wrote: > Hi, > > Please review this code migration patch to move collection set candidate selection methods out of G1Policy into G1CollectionSet. By relocating these methods, we can simplify the method signatures and reduce the interdependency between the classes. > > Testing: Tier 1 Marked as reviewed by tschatzl (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/21347#pullrequestreview-2351254930 From stefank at openjdk.org Mon Oct 7 08:55:59 2024 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 7 Oct 2024 08:55:59 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v30] In-Reply-To: References:

<-UEFgAIQjGBginN0JRoyuwwJLmDssUEQGE_tCP-tRkw=.01ef3f37-01fa-4931-b4f3-571d21252bbd@github.com> Message-ID: On Thu, 19 Sep 2024 05:36:41 GMT, Stefan Karlsson wrote: >> src/hotspot/share/gc/parallel/psParallelCompact.cpp line 787: >> >>> 785: // The gap is always equal to min-fill-size, so nothing to do. >>> 786: return; >>> 787: } >> >> Reading the comment, it is not obvious that this is correct if you set MinObjectAlignment to something larger than the default value: >> >> void PSParallelCompact::fill_dense_prefix_end(SpaceId id) { >> // Comparing two sizes to decide if filling is required: >> // >> // The size of the filler (min-obj-size) is 2 heap words with the default >> // MinObjAlignment, since both markword and klass take 1 heap word. >> // >> // The size of the gap (if any) right before dense-prefix-end is >> // MinObjAlignment. >> // >> // Need to fill in the gap only if it's smaller than min-obj-size, and the >> // filler obj will extend to next region. >> >> // Note: If min-fill-size decreases to 1, this whole method becomes redundant. >> if (UseCompactObjectHeaders) { >> // The gap is always equal to min-fill-size, so nothing to do. >> return; >> } >> assert(CollectedHeap::min_fill_size() >= 2, "inv"); > > Style note: The added code is inserted between a comment and the code that the comment refers to. It would be nice to tidy this up. Did you figure out if the code above is correct w.r.t. `MinObjectAlignment=16`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20677#discussion_r1789797050 From stefank at openjdk.org Mon Oct 7 08:55:59 2024 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 7 Oct 2024 08:55:59 GMT Subject: RFR: 8305895: Implement JEP 450: Compact Object Headers (Experimental) [v9] In-Reply-To: References: <6Rant6SjxpFIHHWNthWc_plOdnGpWPvqj3rxRe144po=.bcdbad7a-e93a-41a3-b958-6ae602c7e083@github.com>