From rkennke at openjdk.org Thu Dec 1 09:46:14 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 1 Dec 2022 09:46:14 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v2] In-Reply-To: References: Message-ID: <8akQ83Qu0QVUqLJWmXmIjK8q3rnkiPtj9C_H754etQU=.53980eeb-8c04-4b56-baf5-d588903206a8@github.com> On Thu, 24 Nov 2022 23:53:44 GMT, Ashutosh Mehra wrote: >> Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. >> The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. >> >> Signed-off-by: Ashutosh Mehra > > Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: > > Include runtime/javaThread.inline.hpp for JavaThread::is_terminated() to > fix compile failure > > Signed-off-by: Ashutosh Mehra Looks good to me! Thank you, Ashu! ------------- Marked as reviewed by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/11360 From duke at openjdk.org Thu Dec 1 19:39:21 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Thu, 1 Dec 2022 19:39:21 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v2] In-Reply-To: <8akQ83Qu0QVUqLJWmXmIjK8q3rnkiPtj9C_H754etQU=.53980eeb-8c04-4b56-baf5-d588903206a8@github.com> References: <8akQ83Qu0QVUqLJWmXmIjK8q3rnkiPtj9C_H754etQU=.53980eeb-8c04-4b56-baf5-d588903206a8@github.com> Message-ID: On Thu, 1 Dec 2022 09:42:23 GMT, Roman Kennke wrote: >> Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: >> >> Include runtime/javaThread.inline.hpp for JavaThread::is_terminated() to >> fix compile failure >> >> Signed-off-by: Ashutosh Mehra > > Looks good to me! Thank you, Ashu! @rkennke thanks for suggesting and reviewing the fix. ------------- PR: https://git.openjdk.org/jdk/pull/11360 From phh at openjdk.org Fri Dec 2 00:19:14 2022 From: phh at openjdk.org (Paul Hohensee) Date: Fri, 2 Dec 2022 00:19:14 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v2] In-Reply-To: References: Message-ID: <_3PHeiuhtRZjH1XTXF6in0fwUQFjAssUB8AkDUY8jcM=.6a309505-1fda-493b-9fb5-0c3aa2d74566@github.com> On Thu, 24 Nov 2022 23:53:44 GMT, Ashutosh Mehra wrote: >> Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. >> The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. >> >> Signed-off-by: Ashutosh Mehra > > Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: > > Include runtime/javaThread.inline.hpp for JavaThread::is_terminated() to > fix compile failure > > Signed-off-by: Ashutosh Mehra I'm not familiar with this code, so please bear with me. :) The comment on line 246 says "Thread which is not an active Java thread should also not block.", but the check at line 251 will return (i.e., looks like not block) if the current thread is an active Java thread. Should the check be !current->is_active_Java_thread() instead? ------------- PR: https://git.openjdk.org/jdk/pull/11360 From duke at openjdk.org Fri Dec 2 03:46:25 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Fri, 2 Dec 2022 03:46:25 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v2] In-Reply-To: <_3PHeiuhtRZjH1XTXF6in0fwUQFjAssUB8AkDUY8jcM=.6a309505-1fda-493b-9fb5-0c3aa2d74566@github.com> References: <_3PHeiuhtRZjH1XTXF6in0fwUQFjAssUB8AkDUY8jcM=.6a309505-1fda-493b-9fb5-0c3aa2d74566@github.com> Message-ID: On Fri, 2 Dec 2022 00:17:00 GMT, Paul Hohensee wrote: >> Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: >> >> Include runtime/javaThread.inline.hpp for JavaThread::is_terminated() to >> fix compile failure >> >> Signed-off-by: Ashutosh Mehra > > I'm not familiar with this code, so please bear with me. :) The comment on line 246 says "Thread which is not an active Java thread should also not block.", but the check at line 251 will return (i.e., looks like not block) if the current thread is an active Java thread. Should the check be !current->is_active_Java_thread() instead? @phohensee you are right. It should be `!current->is_active_Java_thread()`, how did I miss that `!`! Thanks for catching it in time. ------------- PR: https://git.openjdk.org/jdk/pull/11360 From duke at openjdk.org Fri Dec 2 04:02:33 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Fri, 2 Dec 2022 04:02:33 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v3] In-Reply-To: References: Message-ID: > Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. > The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. > > Signed-off-by: Ashutosh Mehra Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: Fix the condition that the current thread is not an active java thread Signed-off-by: Ashutosh Mehra ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11360/files - new: https://git.openjdk.org/jdk/pull/11360/files/60f174fc..17a7b3bf Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11360&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11360&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11360.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11360/head:pull/11360 PR: https://git.openjdk.org/jdk/pull/11360 From phh at openjdk.org Fri Dec 2 13:32:09 2022 From: phh at openjdk.org (Paul Hohensee) Date: Fri, 2 Dec 2022 13:32:09 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v3] In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:02:33 GMT, Ashutosh Mehra wrote: >> Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. >> The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. >> >> Signed-off-by: Ashutosh Mehra > > Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: > > Fix the condition that the current thread is not an active java thread > > Signed-off-by: Ashutosh Mehra Looks good now. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/11360 From rkennke at openjdk.org Fri Dec 2 14:06:20 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 2 Dec 2022 14:06:20 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v3] In-Reply-To: References: Message-ID: On Fri, 2 Dec 2022 04:02:33 GMT, Ashutosh Mehra wrote: >> Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. >> The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. >> >> Signed-off-by: Ashutosh Mehra > > Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: > > Fix the condition that the current thread is not an active java thread > > Signed-off-by: Ashutosh Mehra Marked as reviewed by rkennke (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11360 From duke at openjdk.org Fri Dec 2 14:25:32 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Fri, 2 Dec 2022 14:25:32 GMT Subject: Integrated: 8297285: Shenandoah pacing causes assertion failure during VM initialization In-Reply-To: References: Message-ID: <0oaTeJUG3fC-F1V499dwXovnPUGcgR6gILCgnKJd_mY=.6a5dcd3d-4b97-47ed-a029-67122be03ef8@github.com> On Thu, 24 Nov 2022 21:57:06 GMT, Ashutosh Mehra wrote: > Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. > The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. > > Signed-off-by: Ashutosh Mehra This pull request has now been integrated. Changeset: 415cfd2e Author: Ashutosh Mehra Committer: Paul Hohensee URL: https://git.openjdk.org/jdk/commit/415cfd2e28e6b7613712ab63a1ab66522e9bf0f2 Stats: 8 lines in 1 file changed: 7 ins; 0 del; 1 mod 8297285: Shenandoah pacing causes assertion failure during VM initialization Reviewed-by: rkennke, phh ------------- PR: https://git.openjdk.org/jdk/pull/11360 From wkemper at openjdk.org Sat Dec 3 01:16:20 2022 From: wkemper at openjdk.org (William Kemper) Date: Sat, 3 Dec 2022 01:16:20 GMT Subject: RFR: Generation resizing Message-ID: These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. ------------- Commit messages: - Document the class responsible for adjusting generation sizes - Revert unnecessary change - Remove unused time between cycle tracking - Remove vestigial mmu tracker instance - Clamp adjustments to min/max when increment is too large - Adjust generation sizes from safepoint - Fix crash in SATB mode, always log average MMU on scheduled interval - Limits on generation size adjustments, log young/old heap occupancy in generational mode - WIP: Transfer up to 10% capacity to undersized generation - WIP: Track idle gc time and mmu averages, rename confusing method name - ... and 3 more: https://git.openjdk.org/shenandoah/compare/998f68b2...b916a909 Changes: https://git.openjdk.org/shenandoah/pull/177/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=177&range=00 Stats: 449 lines in 22 files changed: 419 ins; 18 del; 12 mod Patch: https://git.openjdk.org/shenandoah/pull/177.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/177/head:pull/177 PR: https://git.openjdk.org/shenandoah/pull/177 From mcimadamore at openjdk.org Mon Dec 5 10:31:52 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 5 Dec 2022 10:31:52 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v39] In-Reply-To: References: Message-ID: <-V_N0Cvh4J0vKNbBYdFcow9E8yFHRIjya8n69MpDSuY=.9626ee4d-95b6-41e4-b21e-395e79840388@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix Preview annotation for JEP 434 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/8b5dc0f0..33b834ca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=38 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=37-38 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From sundar at openjdk.org Mon Dec 5 11:03:15 2022 From: sundar at openjdk.org (Athijegannathan Sundararajan) Date: Mon, 5 Dec 2022 11:03:15 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v39] In-Reply-To: <-V_N0Cvh4J0vKNbBYdFcow9E8yFHRIjya8n69MpDSuY=.9626ee4d-95b6-41e4-b21e-395e79840388@github.com> References: <-V_N0Cvh4J0vKNbBYdFcow9E8yFHRIjya8n69MpDSuY=.9626ee4d-95b6-41e4-b21e-395e79840388@github.com> Message-ID: On Mon, 5 Dec 2022 10:31:52 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix Preview annotation for JEP 434 LGTM ------------- Marked as reviewed by sundar (Reviewer). PR: https://git.openjdk.org/jdk/pull/10872 From rkennke at openjdk.org Mon Dec 5 11:07:01 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 5 Dec 2022 11:07:01 GMT Subject: RFR: Generation resizing In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 01:09:59 GMT, William Kemper wrote: > These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. Thanks, William! The PR has merge conflicts, can you resolve them? Thanks, Roman ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From rkennke at openjdk.org Mon Dec 5 11:16:15 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 5 Dec 2022 11:16:15 GMT Subject: RFR: Generation resizing In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 01:09:59 GMT, William Kemper wrote: > These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. Thank you for implementing this useful change! I have a few questions and comments. src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp line 451: > 449: void ShenandoahControlThread::service_concurrent_normal_cycle( > 450: const ShenandoahHeap* heap, const GenerationMode generation, GCCause::Cause cause) { > 451: GCIdMark gc_id_mark; Why does the GCIdMark need to move around? src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1791: > 1789: > 1790: void ShenandoahHeap::on_cycle_start(GCCause::Cause cause, ShenandoahGeneration* generation) { > 1791: log_info(gc)("on_cycle_start: %s", generation->name()); What is that logging for/ what does the log message mean? I'd either improve the log message or remove the logging (or make it dev+trace) if it was only for dev purposes. src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1804: > 1802: > 1803: void ShenandoahHeap::on_cycle_end(ShenandoahGeneration* generation) { > 1804: log_info(gc)("on_cycle_end: %s", generation->name()); Same here. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 25: > 23: */ > 24: > 25: #include "gc/shenandoah/shenandoahMmuTracker.hpp" You need to include precompiled.hpp here. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 28: > 26: #define SHARE_GC_SHENANDOAH_SHENANDOAHMMUTRACKER_HPP > 27: > 28: #include "memory/iterator.hpp" What do we need the iterator.hpp for? ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From mcimadamore at openjdk.org Mon Dec 5 13:49:46 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 5 Dec 2022 13:49:46 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v39] In-Reply-To: <-V_N0Cvh4J0vKNbBYdFcow9E8yFHRIjya8n69MpDSuY=.9626ee4d-95b6-41e4-b21e-395e79840388@github.com> References: <-V_N0Cvh4J0vKNbBYdFcow9E8yFHRIjya8n69MpDSuY=.9626ee4d-95b6-41e4-b21e-395e79840388@github.com> Message-ID: On Mon, 5 Dec 2022 10:31:52 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix Preview annotation for JEP 434 Note: there are 4 tests failing in x86: * MemoryLayoutPrincipalTotalityTest * MemoryLayoutTypeRetentionTest * TestLargeSegmentCopy * TestLinker These failures are addressed in the dependent PR: https://git.openjdk.org/jdk/pull/11019, which will be integrated immediately after these changes ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Dec 5 13:55:22 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 5 Dec 2022 13:55:22 GMT Subject: Integrated: 8295044: Implementation of Foreign Function and Memory API (Second Preview) In-Reply-To: References: Message-ID: <7Ara-NxY9rdQzABZPYR9T-N7b1XLY99_6J-dG3cr2NY=.4151c690-0138-4ffd-a763-ff2456754189@github.com> On Wed, 26 Oct 2022 13:11:50 GMT, Maurizio Cimadamore wrote: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 This pull request has now been integrated. Changeset: 73baadce Author: Maurizio Cimadamore URL: https://git.openjdk.org/jdk/commit/73baadceb60029f6340c1327118aeb59971c2434 Stats: 13808 lines in 255 files changed: 5780 ins; 4448 del; 3580 mod 8295044: Implementation of Foreign Function and Memory API (Second Preview) Co-authored-by: Jorn Vernee Co-authored-by: Per Minborg Co-authored-by: Maurizio Cimadamore Reviewed-by: jvernee, pminborg, psandoz, alanb, sundar ------------- PR: https://git.openjdk.org/jdk/pull/10872 From wkemper at openjdk.org Mon Dec 5 17:04:12 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 5 Dec 2022 17:04:12 GMT Subject: RFR: Generation resizing In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 11:07:23 GMT, Roman Kennke wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp line 451: > >> 449: void ShenandoahControlThread::service_concurrent_normal_cycle( >> 450: const ShenandoahHeap* heap, const GenerationMode generation, GCCause::Cause cause) { >> 451: GCIdMark gc_id_mark; > > Why does the GCIdMark need to move around? I pulled up the gcid mark because every old generation collection is preceded by a "bootstrap" young collection. Logically, this bootstrap phase belongs to the old collection so it should have the same GC id as the subsequent old marking phase. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Mon Dec 5 17:17:20 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 5 Dec 2022 17:17:20 GMT Subject: RFR: Generation resizing In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 11:10:05 GMT, Roman Kennke wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1791: > >> 1789: >> 1790: void ShenandoahHeap::on_cycle_start(GCCause::Cause cause, ShenandoahGeneration* generation) { >> 1791: log_info(gc)("on_cycle_start: %s", generation->name()); > > What is that logging for/ what does the log message mean? I'd either improve the log message or remove the logging (or make it dev+trace) if it was only for dev purposes. Will remove this. > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1804: > >> 1802: >> 1803: void ShenandoahHeap::on_cycle_end(ShenandoahGeneration* generation) { >> 1804: log_info(gc)("on_cycle_end: %s", generation->name()); > > Same here. Will remove this. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Mon Dec 5 17:29:47 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 5 Dec 2022 17:29:47 GMT Subject: RFR: Generation resizing In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 11:10:54 GMT, Roman Kennke wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 25: > >> 23: */ >> 24: >> 25: #include "gc/shenandoah/shenandoahMmuTracker.hpp" > > You need to include precompiled.hpp here. Done (added this to my new file template as well). ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Mon Dec 5 19:52:35 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 5 Dec 2022 19:52:35 GMT Subject: RFR: Generation resizing [v2] In-Reply-To: References: Message-ID: > These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: - Remove unnecessary logging, clean up imports - Merge from shenandoah/master - Document the class responsible for adjusting generation sizes - Revert unnecessary change - Remove unused time between cycle tracking - Remove vestigial mmu tracker instance - Clamp adjustments to min/max when increment is too large - Adjust generation sizes from safepoint - Fix crash in SATB mode, always log average MMU on scheduled interval - Limits on generation size adjustments, log young/old heap occupancy in generational mode - ... and 5 more: https://git.openjdk.org/shenandoah/compare/f90a7701...41f057fa ------------- Changes: https://git.openjdk.org/shenandoah/pull/177/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=177&range=01 Stats: 447 lines in 22 files changed: 417 ins; 18 del; 12 mod Patch: https://git.openjdk.org/shenandoah/pull/177.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/177/head:pull/177 PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Mon Dec 5 23:25:08 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 5 Dec 2022 23:25:08 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: <-xErRk5G6IkHidX1x7cFmyAHd5mU7DRXN0zqH8IT07Q=.40fd9a98-5f4c-48df-9d5e-55a5adcf7c65@github.com> Weekly merge from upstream. Looks fine in testing. ------------- Commit messages: - Merge tag 'jdk-20+26' into merge-jdk-20-26 - 8297731: Remove redundant check in MutableBigInteger.divide - 8287400: Make BitMap range parameter names consistent - 8297584: G1 parallel phase event for scan heap roots is sent too often - 8294924: JvmtiExport::post_exception_throw() doesn't deal well with concurrent stack processing - 8296875: Generational ZGC: Refactor loom code - 8297284: ResolutionErrorTable's key is wrong - 8297740: runtime/ClassUnload/UnloadTest.java failed with "Test failed: should still be live" - 8297644: RISC-V: Compilation error when shenandoah is disabled - 8297523: Various GetPrimitiveArrayCritical miss result - NULL check - ... and 86 more: https://git.openjdk.org/shenandoah/compare/f90a7701...bfd2f109 The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=shenandoah&pr=178&range=00.0 - openjdk/jdk:master: https://webrevs.openjdk.org/?repo=shenandoah&pr=178&range=00.1 Changes: https://git.openjdk.org/shenandoah/pull/178/files Stats: 15750 lines in 611 files changed: 10080 ins; 3270 del; 2400 mod Patch: https://git.openjdk.org/shenandoah/pull/178.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/178/head:pull/178 PR: https://git.openjdk.org/shenandoah/pull/178 From ysr at openjdk.org Tue Dec 6 03:57:11 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 6 Dec 2022 03:57:11 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" Message-ID: JBS link: https://bugs.openjdk.org/browse/JDK-8298138 - Fixed a boundary condition that was triggering an assert. - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` ------------- Commit messages: - Merge branch 'master' into shen_numberseq - A simple-minded test of HdrSeq which also exercises the problematic - Fix a boundary condition issue w/HdrSeq Changes: https://git.openjdk.org/jdk/pull/11524/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11524&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298138 Stats: 77 lines in 3 files changed: 74 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11524.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11524/head:pull/11524 PR: https://git.openjdk.org/jdk/pull/11524 From rkennke at openjdk.org Tue Dec 6 11:37:49 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 6 Dec 2022 11:37:49 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 03:46:12 GMT, Y. Srinivas Ramakrishna wrote: > JBS link: https://bugs.openjdk.org/browse/JDK-8298138 > - Fixed a boundary condition that was triggering an assert. > - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. > - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` Hi Ramki, The change looks good. I have a few minor comments. src/hotspot/share/gc/shenandoah/shenandoahNumberSeq.hpp line 58: > 56: // It has very low memory requirements, and is thread-safe. When accuracy > 57: // is not needed, it is preferred over HdrSeq. > 58: class BinaryMagnitudeSeq : public CHeapObj { What is the relevance of this change? Also, if it *is* necessary, then it should be mtGC. test/hotspot/gtest/gc/shenandoah/test_shenandoahNumberSeq.cpp line 2: > 1: /* > 2: * Copyright (c) 2016, 2017, Oracle and/or its affiliates. All rights reserved. The copyright should be 2022. ------------- Changes requested by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/11524 From kdnilsen at openjdk.org Tue Dec 6 15:48:18 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 6 Dec 2022 15:48:18 GMT Subject: RFR: Generation resizing [v2] In-Reply-To: References: Message-ID: On Mon, 5 Dec 2022 19:52:35 GMT, William Kemper wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: > > - Remove unnecessary logging, clean up imports > - Merge from shenandoah/master > - Document the class responsible for adjusting generation sizes > - Revert unnecessary change > - Remove unused time between cycle tracking > - Remove vestigial mmu tracker instance > - Clamp adjustments to min/max when increment is too large > - Adjust generation sizes from safepoint > - Fix crash in SATB mode, always log average MMU on scheduled interval > - Limits on generation size adjustments, log young/old heap occupancy in generational mode > - ... and 5 more: https://git.openjdk.org/shenandoah/compare/f90a7701...41f057fa LGTM. I'm not yet convinced that this is the right heuristic, or the only heuristic for resizing generations. But this is a huge step making it possible to adjust generation sizes on the fly. I expect further refinement will be driven by additional experiments. ------------- Marked as reviewed by kdnilsen (Committer). PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Tue Dec 6 16:48:37 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 6 Dec 2022 16:48:37 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: <-xErRk5G6IkHidX1x7cFmyAHd5mU7DRXN0zqH8IT07Q=.40fd9a98-5f4c-48df-9d5e-55a5adcf7c65@github.com> References: <-xErRk5G6IkHidX1x7cFmyAHd5mU7DRXN0zqH8IT07Q=.40fd9a98-5f4c-48df-9d5e-55a5adcf7c65@github.com> Message-ID: On Mon, 5 Dec 2022 23:16:57 GMT, William Kemper wrote: > Weekly merge from upstream. Looks fine in testing. This pull request has now been integrated. Changeset: 6ce5f226 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/6ce5f226c9110a7c0262c02d86d2ac7539a4a81d Stats: 15750 lines in 611 files changed: 10080 ins; 3270 del; 2400 mod Merge openjdk/jdk:master ------------- PR: https://git.openjdk.org/shenandoah/pull/178 From wkemper at openjdk.org Tue Dec 6 17:26:08 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 6 Dec 2022 17:26:08 GMT Subject: RFR: Generation resizing [v3] In-Reply-To: References: Message-ID: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> > These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. William Kemper has updated the pull request incrementally with one additional commit since the last revision: Remove vestigial lock, do not enroll periodic task while holding threads_lock ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/177/files - new: https://git.openjdk.org/shenandoah/pull/177/files/41f057fa..d7a01946 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=177&range=02 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=177&range=01-02 Stats: 8 lines in 3 files changed: 2 ins; 3 del; 3 mod Patch: https://git.openjdk.org/shenandoah/pull/177.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/177/head:pull/177 PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Tue Dec 6 18:23:16 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 6 Dec 2022 18:23:16 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" [v2] In-Reply-To: References: Message-ID: > JBS link: https://bugs.openjdk.org/browse/JDK-8298138 > - Fixed a boundary condition that was triggering an assert. > - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. > - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: - Copyright dates etc. - include reorder to alphabetic; don't use/include std:: namespace. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11524/files - new: https://git.openjdk.org/jdk/pull/11524/files/fb5cd5d0..a714630c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11524&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11524&range=00-01 Stats: 14 lines in 2 files changed: 2 ins; 2 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/11524.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11524/head:pull/11524 PR: https://git.openjdk.org/jdk/pull/11524 From ysr at openjdk.org Tue Dec 6 18:23:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 6 Dec 2022 18:23:17 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" [v2] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 11:34:16 GMT, Roman Kennke wrote: >> Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: >> >> - Copyright dates etc. >> - include reorder to alphabetic; don't use/include std:: namespace. > > src/hotspot/share/gc/shenandoah/shenandoahNumberSeq.hpp line 58: > >> 56: // It has very low memory requirements, and is thread-safe. When accuracy >> 57: // is not needed, it is preferred over HdrSeq. >> 58: class BinaryMagnitudeSeq : public CHeapObj { > > What is the relevance of this change? Also, if it *is* necessary, then it should be mtGC. Wanted a spec for allocation for correct accounting. Changed to mtGC; thanks! > test/hotspot/gtest/gc/shenandoah/test_shenandoahNumberSeq.cpp line 2: > >> 1: /* >> 2: * Copyright (c) 2016, 2017, Oracle and/or its affiliates. All rights reserved. > > The copyright should be 2022. Fixed; thanks for the catch! ------------- PR: https://git.openjdk.org/jdk/pull/11524 From kdnilsen at openjdk.org Tue Dec 6 18:42:58 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 6 Dec 2022 18:42:58 GMT Subject: RFR: Enforce max regions Message-ID: This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. ------------- Commit messages: - Fix whitespace - Merge remote-tracking branch 'GitFarmBranch/enforce-max-old-regions' into enforce-max-regions - Remove instrumentation - Fixup region budgeting errors - Fix spelling error in assertion symbol - Enforce bounds on regions per generation Changes: https://git.openjdk.org/shenandoah/pull/179/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=179&range=00 Stats: 170 lines in 10 files changed: 103 ins; 7 del; 60 mod Patch: https://git.openjdk.org/shenandoah/pull/179.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/179/head:pull/179 PR: https://git.openjdk.org/shenandoah/pull/179 From wkemper at openjdk.org Tue Dec 6 21:35:11 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 6 Dec 2022 21:35:11 GMT Subject: RFR: Enforce max regions In-Reply-To: References: Message-ID: <9me-r1yDWDfDp2MTKgO0QkdXwjJsMWcXAWg95oqHlS0=.6182ff46-8a88-413c-a455-05474923760d@github.com> On Tue, 6 Dec 2022 17:57:18 GMT, Kelvin Nilsen wrote: > This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. I saw a pattern like this in a couple of a places for young and old generations: size_t avail_young_regions = ((_heap->young_generation()->adjusted_capacity() - _heap->young_generation()->used_regions_size()) / ShenandoahHeapRegion::region_size_bytes()); We also have this method in `ShenandoahGeneration` called `free_unaffiliated_regions` which is similar, except that it uses soft max capacity, instead of adjusted capacity. Could these calculations be consolidated? ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From wkemper at openjdk.org Tue Dec 6 22:03:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 6 Dec 2022 22:03:18 GMT Subject: RFR: Enforce max regions In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 17:57:18 GMT, Kelvin Nilsen wrote: > This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 1031: > 1029: // affiliation to OLD_GENERATION and adjust the generation-use tallies. The remnant of memory > 1030: // in the last humongous region that is not spanned by obj is currently not used. > 1031: for (size_t i = index(); i < index_limit; i++) { Do we need to worry about races here? Could we have a separate evacuating thread take a new, old region to use for PLABs _after_ we checked old available regions? ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Tue Dec 6 22:41:39 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 6 Dec 2022 22:41:39 GMT Subject: RFR: Enforce max regions In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 17:57:18 GMT, Kelvin Nilsen wrote: > This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. I'll also introduce a new method adjusted_unaffiliated_regions() to consolidate the code for the "common pattern" you identified. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Tue Dec 6 22:41:42 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 6 Dec 2022 22:41:42 GMT Subject: RFR: Enforce max regions In-Reply-To: References: Message-ID: <_JDV1PL-12IKTMIyES68372b6-tj8Ps1ZzWPVohlkI0=.1d3d6b5b-c94c-4840-a8d4-1ceaae4dd60b@github.com> On Tue, 6 Dec 2022 21:43:17 GMT, William Kemper wrote: >> This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. > > src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 1031: > >> 1029: // affiliation to OLD_GENERATION and adjust the generation-use tallies. The remnant of memory >> 1030: // in the last humongous region that is not spanned by obj is currently not used. >> 1031: for (size_t i = index(); i < index_limit; i++) { > > Do we need to worry about races here? Could we have a separate evacuating thread take a new, old region to use for PLABs _after_ we checked old available regions? Good catch. I need to grab the heap lock for part of this code. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Tue Dec 6 22:41:43 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 6 Dec 2022 22:41:43 GMT Subject: RFR: Enforce max regions In-Reply-To: <_JDV1PL-12IKTMIyES68372b6-tj8Ps1ZzWPVohlkI0=.1d3d6b5b-c94c-4840-a8d4-1ceaae4dd60b@github.com> References: <_JDV1PL-12IKTMIyES68372b6-tj8Ps1ZzWPVohlkI0=.1d3d6b5b-c94c-4840-a8d4-1ceaae4dd60b@github.com> Message-ID: On Tue, 6 Dec 2022 22:18:01 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 1031: >> >>> 1029: // affiliation to OLD_GENERATION and adjust the generation-use tallies. The remnant of memory >>> 1030: // in the last humongous region that is not spanned by obj is currently not used. >>> 1031: for (size_t i = index(); i < index_limit; i++) { >> >> Do we need to worry about races here? Could we have a separate evacuating thread take a new, old region to use for PLABs _after_ we checked old available regions? > > Good catch. I need to grab the heap lock for part of this code. Good catch. I need to grab the heap lock for part of this work. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 15:39:34 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 15:39:34 GMT Subject: RFR: Enforce max regions [v2] In-Reply-To: References: Message-ID: > This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: Respond to reviewer feedback ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/179/files - new: https://git.openjdk.org/shenandoah/pull/179/files/f5b3e0db..28a53a86 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=179&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=179&range=00-01 Stats: 60 lines in 4 files changed: 22 ins; 10 del; 28 mod Patch: https://git.openjdk.org/shenandoah/pull/179.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/179/head:pull/179 PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 16:14:18 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 16:14:18 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: > This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: Fix white space and add an assertion ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/179/files - new: https://git.openjdk.org/shenandoah/pull/179/files/28a53a86..4617913f Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=179&range=02 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=179&range=01-02 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/shenandoah/pull/179.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/179/head:pull/179 PR: https://git.openjdk.org/shenandoah/pull/179 From wkemper at openjdk.org Wed Dec 7 17:51:09 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 7 Dec 2022 17:51:09 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 16:14:18 GMT, Kelvin Nilsen wrote: >> This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > Fix white space and add an assertion Looks good - thank you! ------------- Marked as reviewed by wkemper (Committer). PR: https://git.openjdk.org/shenandoah/pull/179 From ysr at openjdk.org Wed Dec 7 18:59:08 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 7 Dec 2022 18:59:08 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" [v2] In-Reply-To: References: Message-ID: On Tue, 6 Dec 2022 11:35:23 GMT, Roman Kennke wrote: >> Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: >> >> - Copyright dates etc. >> - include reorder to alphabetic; don't use/include std:: namespace. > > Hi Ramki, > The change looks good. I have a few minor comments. @rkennke & @shipilev : could you folks please review and approve? I made the changes requested by Roman. Thanks! -- Ramki ------------- PR: https://git.openjdk.org/jdk/pull/11524 From ysr at openjdk.org Wed Dec 7 21:15:06 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 7 Dec 2022 21:15:06 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 16:14:18 GMT, Kelvin Nilsen wrote: >> This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > Fix white space and add an assertion Overall looks great to the extent that I understood it. I left a few questions/comments in a few places, typically because I may lack the complete picture of the design and its rationale in a few places. Are there performance numbers to share with these changes? Those could be added either here in the pull request, or in the associated JBS ticket, which can be linked to the PR. Thanks! src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 103: > 101: switch (req.affiliation()) { > 102: case ShenandoahRegionAffiliation::OLD_GENERATION: > 103: if (_heap->old_generation()->adjusted_unaffiliated_regions() <= 0) { Re "<=" : I am guessing this is because adjusted unaffiliated_regions can go negative for periods of time while GC is in progress in a tight heap situation? Unfortunately, the signature of this is a size_t (unsigned), so a "<=" comparison with "0" should have been flagged by the compiler? Or does the compiler silently treated as "==", without issuing a warning about the comparison? In any case, worth thinking about a related question in the definition of adjusted_unaffiliated_count(), and adjusting accordingly. src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 126: > 124: for (size_t idx = _mutator_leftmost; idx <= _mutator_rightmost; idx++) { > 125: ShenandoahHeapRegion* r = _heap->get_region(idx); > 126: if (is_mutator_free(idx) && (allow_new_region || r->affiliation() != ShenandoahRegionAffiliation::FREE)) { Aside: Does Shanandoah have a concept of an allocation cursor per mutator in shared space independent of its TLAB? This is because firstly it might make first fit searches more efficient, and secondly we might end up with spatial locality of allocations that are temporally in close proximity from the same mutator, which might help reduce fragmentation and potentially evacuation costs. One might consider resetting the cursors following each minor gc. src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 171: > 169: ShenandoahHeapRegion* r = _heap->get_region(idx); > 170: if (can_allocate_from(r)) { > 171: flip_to_gc(r); Does the flipping have to strictly precede the allocation attempt? Otherwise the flip is futile and we steal space from mutators but to no advantage. I also notice the asymmetry in the existence of `flip_to_gc()` but no corresponding `flip_to_mutator()`. I suppose that's because regions freed by GC as a result of evacuation will be available to mutators, so the flipping to GC may be considered temporary in that sense. However, I suspect futile flipping may strand space in GC territory for no good reason. In any case, take my comments here with the right grain of salt because I am lacking the philosophical foundations of the need for this mutator & collector view dichotomy here. It would be good if in the `.hpp` file we expended a few sentences listing the rationale for that design choice; e.g. the allocate from left and allocate from right could still hold without necessarily having strict collector/mutator affiliations (as indicated by the `flip` above)? src/hotspot/share/gc/shenandoah/shenandoahFullGC.cpp line 198: > 196: > 197: if (heap->mode()->is_generational()) { > 198: // Since we probably have not yet reclaimed the most recently selected collection set, we have to defer I'd make the comment less tentative, and state: // Since the most recently selected collection set may not have been reclaimed at this stage, // we'll defer unadjust_avaliable() until after the full gc is completed. Question: is the adjusted available value (modulo the loaned size) used by full gc for any purpose, or is it to satisfy assertion checks / verification in some of the methods invoked during full gc work below? src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 924: > 922: size_t ShenandoahGeneration::decrement_affiliated_region_count() { > 923: _affiliated_region_count--; > 924: return _affiliated_region_count; Both these seem fine and probably more readable, but you'd save a line by returning the pre-{in,de}cremented result, e.g.: `return --_affiliated_region_count;` Would it be useful to assert that the region count is always non-zero? src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 986: > 984: } > 985: > 986: size_t ShenandoahGeneration::adjusted_unaffiliated_regions() { You can const this method too. src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 988: > 986: size_t ShenandoahGeneration::adjusted_unaffiliated_regions() { > 987: assert(adjusted_capacity() > used_regions_size(), "adjusted_unaffiliated_regions() cannot return negative"); > 988: return (adjusted_capacity() - used_regions_size()) / ShenandoahHeapRegion::region_size_bytes(); So, just to be clear, this is the number of unaffiliated regions that can _potentially_ be affiliated with this region. I assume it isn't the case that that number of unaffiliated free regions actually exist? If the answer is "no, that number of unaffiliated free regions do exist" would it be worth asserting that invariant here (or may be because this is all concurrent with allocations, no such guarantees will ever hold anyway, so it's futile to assert such invariants?). Indeed this question ties in with my comment further up where you do a "<=" comparison with 0 on the return value from here. src/hotspot/share/gc/shenandoah/shenandoahGeneration.hpp line 169: > 167: void scan_remembered_set(bool is_concurrent); > 168: > 169: size_t increment_affiliated_region_count(); Add a single line comment in the header file describing what a method returns: // Returns the affiliated region count following the operation. src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1491: > 1489: // doing this work during a safepoint. We cannot put humongous regions into the collection set because that > 1490: // triggers the load-reference barrier (LRB) to copy on reference fetch. > 1491: if (r->promote_humongous() == 0) { See my comment in ::promote_humongous(). I think that method could directly call the requisite expansion code under those circumstances, so this code can move there, with (as I noted there) promotion always succeeding for humongous object arrays at least, but in general for all humongous objects that are deemed eligible for promotion by other criteria (see my note in ::promote_humongous() on potentially treating humongous primitive type arrays differently from humongous object arrays). src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 1040: > 1038: // Then fall through to finish the promotion after releasing the heap lock. > 1039: } else { > 1040: return 0; This is interesting. Doing some thinking out loud here. I realize we want to very strictly enforce the generation sizes (indicated by the affiliation of regions to generations in a formal sense of generation sizes), but I do wonder if humongous regions should not enter into that calculus at all? In this case, the reason we would typically want to designate a humongous object as old (via promotion via this method) is because we don't want to have to spend effort scanning its contents. After all we never spend any time copying it when it survives a minor collection. Under the circumstances, it appears as if we would always want humongous objects that are primitive type arrays to stay in young (never be promoted, although I admit that it might make sense to not pay even the cost of marking it if it's been around forever per generational hypothesis), and if a humongous object that has references (i.e. ages into the old generation) then it's affiliated with old and is "promoted" even if there aren't any available regions in old. In other words , humongous objects, because they are never copied, have affiliations that do not affect the promotion calculus in a strict manner. For these reasons, I'd think that humongous object promotions should be treated specially and old generation size should not be a criterion for determining generational affiliation of humongous regions. src/hotspot/share/gc/shenandoah/shenandoahVerifier.cpp line 343: > 341: }; > 342: > 343: class ShenandoahCalculateRegionStatsClosure : public ShenandoahHeapRegionClosure { A one-line documentation spec here would be useful: // A closure used to accumulate the net used, committed, and garbage bytes, and number of regions; // typically associated with a generation in generational mode. ------------- Marked as reviewed by ysr (Author). PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 21:20:00 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 21:20:00 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 20:25:23 GMT, Y. Srinivas Ramakrishna wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix white space and add an assertion > > src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 103: > >> 101: switch (req.affiliation()) { >> 102: case ShenandoahRegionAffiliation::OLD_GENERATION: >> 103: if (_heap->old_generation()->adjusted_unaffiliated_regions() <= 0) { > > Re "<=" : I am guessing this is because adjusted unaffiliated_regions can go negative for periods of time while GC is in progress in a tight heap situation? > > Unfortunately, the signature of this is a size_t (unsigned), so a "<=" comparison with "0" should have been flagged by the compiler? Or does the compiler silently treated as "==", without issuing a warning about the comparison? In any case, worth thinking about a related question in the definition of adjusted_unaffiliated_count(), and adjusting accordingly. This is really a test for ==, and the compiler doesn't complain because the test is meaningful as written (though perhaps confusing as written). OTOH, writing it this way makes the code more "future proof" in case someone changes the return type to signed. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 21:23:06 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 21:23:06 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 18:36:39 GMT, Y. Srinivas Ramakrishna wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix white space and add an assertion > > src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 986: > >> 984: } >> 985: >> 986: size_t ShenandoahGeneration::adjusted_unaffiliated_regions() { > > You can const this method too. Thanks. I'll change this. > src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 988: > >> 986: size_t ShenandoahGeneration::adjusted_unaffiliated_regions() { >> 987: assert(adjusted_capacity() > used_regions_size(), "adjusted_unaffiliated_regions() cannot return negative"); >> 988: return (adjusted_capacity() - used_regions_size()) / ShenandoahHeapRegion::region_size_bytes(); > > So, just to be clear, this is the number of unaffiliated regions that can _potentially_ be affiliated with this region. I assume it isn't the case that that number of unaffiliated free regions actually exist? > > If the answer is "no, that number of unaffiliated free regions do exist" would it be worth asserting that invariant here (or may be because this is all concurrent with allocations, no such guarantees will ever hold anyway, so it's futile to assert such invariants?). > > Indeed this question ties in with my comment further up where you do a "<=" comparison with 0 on the return value from here. Yes. That is accurate. This is the number of regions that are currently affiliated with FREE, which are eligible to be affiliated as part of this generation if we have reason to do so. If this value is zero, then the entire adjusted_capacity is consumed by the regions already affiliated with this generation, and we are not allowed to move any more FREE regions into this generation. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 21:30:13 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 21:30:13 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 18:19:50 GMT, Y. Srinivas Ramakrishna wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix white space and add an assertion > > src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 1040: > >> 1038: // Then fall through to finish the promotion after releasing the heap lock. >> 1039: } else { >> 1040: return 0; > > This is interesting. Doing some thinking out loud here. > > I realize we want to very strictly enforce the generation sizes (indicated by the affiliation of regions to generations in a formal sense of generation sizes), but I do wonder if humongous regions should not enter into that calculus at all? In this case, the reason we would typically want to designate a humongous object as old (via promotion via this method) is because we don't want to have to spend effort scanning its contents. After all we never spend any time copying it when it survives a minor collection. Under the circumstances, it appears as if we would always want humongous objects that are primitive type arrays to stay in young (never be promoted, although I admit that it might make sense to not pay even the cost of marking it if it's been around forever per generational hypothesis), and if a humongous object that has references (i.e. ages into the old generation) then it's affiliated with old and is "promoted" even if there aren't any available regions in old. In other wor ds, humongous objects, because they are never copied, have affiliations that do not affect the promotion calculus in a strict manner. > > For these reasons, I'd think that humongous object promotions should be treated specially and old generation size should not be a criterion for determining generational affiliation of humongous regions. I'm going to add a TODO comment here, so that we can think about changing this behavior. I totally agree with your rationale. Problem is that we have "assumptions" and "invariants" scattered throughout the existing implementation that need to be carefully reconsidered if we allow the rules to bend. (For example: there are lots of size_t subtractions that may overflow to huge unmeaningful numbers, and if we run with ShenandoahVerify enabled, it will complain if the size of the generation exceeds it capacity. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 21:45:01 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 21:45:01 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 20:50:45 GMT, Y. Srinivas Ramakrishna wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix white space and add an assertion > > src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 126: > >> 124: for (size_t idx = _mutator_leftmost; idx <= _mutator_rightmost; idx++) { >> 125: ShenandoahHeapRegion* r = _heap->get_region(idx); >> 126: if (is_mutator_free(idx) && (allow_new_region || r->affiliation() != ShenandoahRegionAffiliation::FREE)) { > > Aside: Does Shanandoah have a concept of an allocation cursor per mutator in shared space independent of its TLAB? This is because firstly it might make first fit searches more efficient, and secondly we might end up with spatial locality of allocations that are temporally in close proximity from the same mutator, which might help reduce fragmentation and potentially evacuation costs. > > One might consider resetting the cursors following each minor gc. There is no concept of an allocation cursor per mutator. In tracing some "anomalous" behaviors, I observed that the search for a heap region with memory available to be allocated can be very cumbersome. As young memory becomes more scarce, the effort consumed by each thread trying to allocate (under lock by the way) becomes more and more costly, having to sequentially examine large numbers of regions (possibly more than a thousand regions) to find the first region with sufficient space to satisfy the allocation request. We could definitely make some improvements here, especially because the allocating threads holds the lock throughout this traversal. Another possible improvement is to not require the global heap lock while searching for a region to serve tlab allocation request. > src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 171: > >> 169: ShenandoahHeapRegion* r = _heap->get_region(idx); >> 170: if (can_allocate_from(r)) { >> 171: flip_to_gc(r); > > Does the flipping have to strictly precede the allocation attempt? Otherwise the flip is futile and we steal space from mutators but to no advantage. > > I also notice the asymmetry in the existence of `flip_to_gc()` but no corresponding `flip_to_mutator()`. I suppose that's because regions freed by GC as a result of evacuation will be available to mutators, so the flipping to GC may be considered temporary in that sense. However, I suspect futile flipping may strand space in GC territory for no good reason. > > In any case, take my comments here with the right grain of salt because I am lacking the philosophical foundations of the need for this mutator & collector view dichotomy here. It would be good if in the `.hpp` file we expended a few sentences listing the rationale for that design choice; e.g. the allocate from left and allocate from right could still hold without necessarily having strict collector/mutator affiliations (as indicated by the `flip` above)? There's no flip to mutator because we do not allow a mutator allocation request to take memory that had been set aside for use by the collector (for evacuations). If a mutator alloc "fails", we can stall that single mutating thread. If a GC evacuation fails, we have to force all threads into a safepoint so that we can perform a FULL GC. This is the reason we don't allow mutators to flip_to_mutator(). ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 22:00:06 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 22:00:06 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 21:27:10 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 1040: >> >>> 1038: // Then fall through to finish the promotion after releasing the heap lock. >>> 1039: } else { >>> 1040: return 0; >> >> This is interesting. Doing some thinking out loud here. >> >> I realize we want to very strictly enforce the generation sizes (indicated by the affiliation of regions to generations in a formal sense of generation sizes), but I do wonder if humongous regions should not enter into that calculus at all? In this case, the reason we would typically want to designate a humongous object as old (via promotion via this method) is because we don't want to have to spend effort scanning its contents. After all we never spend any time copying it when it survives a minor collection. Under the circumstances, it appears as if we would always want humongous objects that are primitive type arrays to stay in young (never be promoted, although I admit that it might make sense to not pay even the cost of marking it if it's been around forever per generational hypothesis), and if a humongous object that has references (i.e. ages into the old generation) then it's affiliated with old and is "promoted" even if there aren't any available regions in old. In other wo rds, humongous objects, because they are never copied, have affiliations that do not affect the promotion calculus in a strict manner. >> >> For these reasons, I'd think that humongous object promotions should be treated specially and old generation size should not be a criterion for determining generational affiliation of humongous regions. > > I'm going to add a TODO comment here, so that we can think about changing this behavior. I totally agree with your rationale. Problem is that we have "assumptions" and "invariants" scattered throughout the existing implementation that need to be carefully reconsidered if we allow the rules to bend. (For example: there are lots of size_t subtractions that may overflow to huge unmeaningful numbers, and if we run with ShenandoahVerify enabled, it will complain if the size of the generation exceeds it capacity. I also like your idea about just keeping primitive humongous objects in YOUNG. That would allow their memory to be reclaimed much more quickly if and when they do become garbage. OTOH, it may create an "unexpected surprise" to anyone who is carefully specifying the sizes of young-gen and old-gen. Once we have auto-sizing of old- and young- fully working, this would be a good tradeoff to make. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 22:18:38 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 22:18:38 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 20:17:07 GMT, Y. Srinivas Ramakrishna wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix white space and add an assertion > > src/hotspot/share/gc/shenandoah/shenandoahFullGC.cpp line 198: > >> 196: >> 197: if (heap->mode()->is_generational()) { >> 198: // Since we probably have not yet reclaimed the most recently selected collection set, we have to defer > > I'd make the comment less tentative, and state: > > // Since the most recently selected collection set may not have been reclaimed at this stage, > // we'll defer unadjust_avaliable() until after the full gc is completed. > > Question: is the adjusted available value (modulo the loaned size) used by full gc for any purpose, or is it to satisfy assertion checks / verification in some of the methods invoked during full gc work below? It's not used by full GC, but the Shenandoah verifier which runs at the start of full gc and then again at the end of full gc enforces compliance with the adjusted budgets. The verifier didn't used to care. But I made it care with this commit, and then I had to change where we do the unadjusting... > src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 924: > >> 922: size_t ShenandoahGeneration::decrement_affiliated_region_count() { >> 923: _affiliated_region_count--; >> 924: return _affiliated_region_count; > > Both these seem fine and probably more readable, but you'd save a line by returning the pre-{in,de}cremented result, e.g.: > > `return --_affiliated_region_count;` > > Would it be useful to assert that the region count is always non-zero? Actually, affiliated region count can be zero. Often starts out that way for old-gen. > src/hotspot/share/gc/shenandoah/shenandoahGeneration.hpp line 169: > >> 167: void scan_remembered_set(bool is_concurrent); >> 168: >> 169: size_t increment_affiliated_region_count(); > > Add a single line comment in the header file describing what a method returns: > > // Returns the affiliated region count following the operation. Thanks. > src/hotspot/share/gc/shenandoah/shenandoahVerifier.cpp line 343: > >> 341: }; >> 342: >> 343: class ShenandoahCalculateRegionStatsClosure : public ShenandoahHeapRegionClosure { > > A one-line documentation spec here would be useful: > > // A closure used to accumulate the net used, committed, and garbage bytes, and number of regions; > // typically associated with a generation in generational mode. Thanks. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 22:18:39 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 22:18:39 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 22:11:42 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahFullGC.cpp line 198: >> >>> 196: >>> 197: if (heap->mode()->is_generational()) { >>> 198: // Since we probably have not yet reclaimed the most recently selected collection set, we have to defer >> >> I'd make the comment less tentative, and state: >> >> // Since the most recently selected collection set may not have been reclaimed at this stage, >> // we'll defer unadjust_avaliable() until after the full gc is completed. >> >> Question: is the adjusted available value (modulo the loaned size) used by full gc for any purpose, or is it to satisfy assertion checks / verification in some of the methods invoked during full gc work below? > > It's not used by full GC, but the Shenandoah verifier which runs at the start of full gc and then again at the end of full gc enforces compliance with the adjusted budgets. > > The verifier didn't used to care. But I made it care with this commit, and then I had to change where we do the unadjusting... I'll fix the comment. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 22:29:39 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 22:29:39 GMT Subject: RFR: Enforce max regions [v4] In-Reply-To: References: Message-ID: > This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: Comments in response to reviewer feedback ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/179/files - new: https://git.openjdk.org/shenandoah/pull/179/files/4617913f..8e23c321 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=179&range=03 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=179&range=02-03 Stats: 28 lines in 6 files changed: 24 ins; 0 del; 4 mod Patch: https://git.openjdk.org/shenandoah/pull/179.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/179/head:pull/179 PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 22:37:55 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 22:37:55 GMT Subject: RFR: Enforce max regions [v3] In-Reply-To: References: Message-ID: On Wed, 7 Dec 2022 18:22:40 GMT, Y. Srinivas Ramakrishna wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix white space and add an assertion > > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1491: > >> 1489: // doing this work during a safepoint. We cannot put humongous regions into the collection set because that >> 1490: // triggers the load-reference barrier (LRB) to copy on reference fetch. >> 1491: if (r->promote_humongous() == 0) { > > See my comment in ::promote_humongous(). > > I think that method could directly call the requisite expansion code under those circumstances, so this code can move there, with (as I noted there) promotion always succeeding for humongous object arrays at least, but in general for all humongous objects that are deemed eligible for promotion by other criteria (see my note in ::promote_humongous() on potentially treating humongous primitive type arrays differently from humongous object arrays). I like your ideas, but I'll suggest we tackle this in a future distinct pr. ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From kdnilsen at openjdk.org Wed Dec 7 22:55:51 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 7 Dec 2022 22:55:51 GMT Subject: Integrated: Enforce max regions In-Reply-To: References: Message-ID: <2pK2RGabJYCsXrLQnb7eTQT-NZ0itfK9QydBix_lY0M=.cf35f265-6df3-4c9e-ba20-7c4f1c0ff112@github.com> On Tue, 6 Dec 2022 17:57:18 GMT, Kelvin Nilsen wrote: > This commit enforces upper bounds on the number of ShenandoahHeapRegions affiliated with each generation. Prior to this change, enforcement of generation sizes was by usage alone. This allowed situations in which so many sparsely populated regions were affiliated with old-gen that there were insufficient FREE regions available to satisfy legitimate young-gen allocation requests. This was resulting in excessive TLAB allocation failures and degenerated collections. This pull request has now been integrated. Changeset: 25469283 Author: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/25469283fbe14e85adeaf0e3a21d40faea5f7288 Stats: 202 lines in 10 files changed: 150 ins; 17 del; 35 mod Enforce max regions Reviewed-by: wkemper, ysr ------------- PR: https://git.openjdk.org/shenandoah/pull/179 From ysr at openjdk.org Thu Dec 8 00:59:31 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 00:59:31 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan Message-ID: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> **Note:** This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. **Summary:** The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. **Details of files changed:** 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. **Format of stats produced and how to interpret them: (sample)** [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] ... The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread - clean_run: as above, but the length of an uninterrupted run of clean cards - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk - max_dirty_run & max_clean_run: Similarly for the maximum of each. - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned - dirty_scans, clean_scans: numbers of objects scanned by the closure - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk The data above indicates that at least 75% of the chunks have no alternations at all, and cards are almost always mostly clean for this specific benchmark config (extremem). Comparing worker stats from worker 0 and worker 9 indicates very little difference between their statistics, as one might typically expect for well-balanced RS scans. **Questions:** 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? 3. Any suggestions for a more easily consumable format? 4. I welcome any other feedback on the pull request. ------------- Commit messages: - Merge branch 'master' into JVM-1264 - Card stats only in non-product mode (until impact of stats collection is - Merge branch 'master' into JVM-1264 - Merge branch 'master' into JVM-1264 - jcheck whitespace fixes. - Fix card_stats() so it doesn't crash when card stats aren't enabled. - Fix comment. - Don't allocate stats arrays if not enabled. Should we decide we want - Disable card stats printing when disabled - Remove compile time preprocesor option. - ... and 25 more: https://git.openjdk.org/shenandoah/compare/25469283...f5669577 Changes: https://git.openjdk.org/shenandoah/pull/176/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297796 Stats: 738 lines in 8 files changed: 369 ins; 220 del; 149 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 8 00:59:33 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 00:59:33 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Thu, 1 Dec 2022 19:55:45 GMT, Y. Srinivas Ramakrishna wrote: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Pulled code into non-product mode. Will verify that changes are performance-neutral in product mode. Built & tested slowdebug, fastdebug, optimized, and product builds, and verified that flag & code could be enabled only in non-product builds, and was off by default in all non-debug modes (including optimized where it was available, but disabled by default). Please see the draft pull request message above for further details. The PR is now open for review; thanks for your reviews/comments/feedback! src/hotspot/share/gc/shenandoah/shenandoahNumberSeq.cpp line 59: > 57: if (v > 0) { > 58: mag = 0; > 59: while (v >= 1) { You can safely ignore the changes in this file and the next. They are part of a separate PR to tip, and will eventually get reconciled when tip is merged into the project repo. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 8 01:11:43 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 01:11:43 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v2] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: <9E6NmFY5877JXtI7RKpqa1r2nXDaEJ7xxLG9q0hEP6U=.03c76ffe-bac9-4401-9091-4ee19d6a394e@github.com> > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: Moved some more methods into non-product mode. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/f5669577..c0a4a9d7 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=00-01 Stats: 4 lines in 2 files changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 8 09:13:40 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 09:13:40 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" [v3] In-Reply-To: References: Message-ID: > JBS link: https://bugs.openjdk.org/browse/JDK-8298138 > - Fixed a boundary condition that was triggering an assert. > - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. > - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' into shen_numberseq - - Copyright dates etc. - include reorder to alphabetic; don't use/include std:: namespace. - Merge branch 'master' into shen_numberseq - A simple-minded test of HdrSeq which also exercises the problematic code. - Fix a boundary condition issue w/HdrSeq ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11524/files - new: https://git.openjdk.org/jdk/pull/11524/files/a714630c..a0edcbda Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11524&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11524&range=01-02 Stats: 14012 lines in 340 files changed: 9690 ins; 3152 del; 1170 mod Patch: https://git.openjdk.org/jdk/pull/11524.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11524/head:pull/11524 PR: https://git.openjdk.org/jdk/pull/11524 From shade at openjdk.org Thu Dec 8 10:07:26 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 8 Dec 2022 10:07:26 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" [v3] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 09:13:40 GMT, Y. Srinivas Ramakrishna wrote: >> JBS link: https://bugs.openjdk.org/browse/JDK-8298138 >> - Fixed a boundary condition that was triggering an assert. >> - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. >> - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into shen_numberseq > - - Copyright dates etc. > - include reorder to alphabetic; don't use/include std:: namespace. > - Merge branch 'master' into shen_numberseq > - A simple-minded test of HdrSeq which also exercises the problematic > code. > - Fix a boundary condition issue w/HdrSeq So the failure is: we want the bucket to cover `[a; a+n)`, but current code makes it cover `[a; a+n]`, which means the right-most value would overflow its assignment for sub-bucket? If so, the fix looks good. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/11524 From rkennke at openjdk.org Thu Dec 8 14:15:58 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 8 Dec 2022 14:15:58 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" [v3] In-Reply-To: References: Message-ID: <49r00-alghwN1jXs4qCkuePRKAl2ZKEov5G0bWH7aQQ=.ec3b80c4-4b70-4251-a3f5-7492578c111a@github.com> On Thu, 8 Dec 2022 09:13:40 GMT, Y. Srinivas Ramakrishna wrote: >> JBS link: https://bugs.openjdk.org/browse/JDK-8298138 >> - Fixed a boundary condition that was triggering an assert. >> - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. >> - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into shen_numberseq > - - Copyright dates etc. > - include reorder to alphabetic; don't use/include std:: namespace. > - Merge branch 'master' into shen_numberseq > - A simple-minded test of HdrSeq which also exercises the problematic > code. > - Fix a boundary condition issue w/HdrSeq Looks good, thank you! ------------- Marked as reviewed by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/11524 From rkennke at openjdk.org Thu Dec 8 14:22:55 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 8 Dec 2022 14:22:55 GMT Subject: RFR: Generation resizing [v3] In-Reply-To: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> References: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> Message-ID: On Tue, 6 Dec 2022 17:26:08 GMT, William Kemper wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Remove vestigial lock, do not enroll periodic task while holding threads_lock Looks good to me. Thank you! EDIT: well actually, it is still complaining about conflict, and you should also make it ready for review ;-) ------------- Marked as reviewed by rkennke (Lead). PR: https://git.openjdk.org/shenandoah/pull/177 From kdnilsen at openjdk.org Thu Dec 8 14:45:17 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Thu, 8 Dec 2022 14:45:17 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v2] In-Reply-To: <9E6NmFY5877JXtI7RKpqa1r2nXDaEJ7xxLG9q0hEP6U=.03c76ffe-bac9-4401-9091-4ee19d6a394e@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <9E6NmFY5877JXtI7RKpqa1r2nXDaEJ7xxLG9q0hEP6U=.03c76ffe-bac9-4401-9091-4ee19d6a394e@github.com> Message-ID: On Thu, 8 Dec 2022 01:11:43 GMT, Y. Srinivas Ramakrishna wrote: >> **Note:** >> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) >> >> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. >> >> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. >> >> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. >> >> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. >> >> **Summary:** >> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. >> >> **Details of files changed:** >> >> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. >> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats >> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. >> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq >> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). >> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. >> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. >> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> >> [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning >> [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> The data above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific benchmark config (extremem). >> >> Comparing worker stats from worker 0 and worker 9 indicates very little difference between >> their statistics, as one might typically expect for well-balanced RS scans. >> >> **Questions:** >> >> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? >> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? >> 3. Any suggestions for a more easily consumable format? >> 4. I welcome any other feedback on the pull request. > > Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: > > Moved some more methods into non-product mode. Thanks for sharing this code. A few overview comments: 1. Yes, I think it would be useful to see the data collected for each mark scan and each update-reference scan independently. Sometimes, abnormal behavior of the application causes spikes in performance, and it would be nice to understand the degree to which remembered set scanning is part of this spike. 2. It is also useful to have a cumulative summary of all costs at the end of a run, probably still separating out the mark scans from the update-refs scans. 3. Is it possible to eliminate the overhead entirely of this instrumentation by compiling it out for release builds? ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Thu Dec 8 21:46:54 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 8 Dec 2022 21:46:54 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: > These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: - Merge branch 'shenandoah-master' into mmu-instrumentation - Remove vestigial lock, do not enroll periodic task while holding threads_lock - Remove unnecessary logging, clean up imports - Merge from shenandoah/master - Document the class responsible for adjusting generation sizes - Revert unnecessary change - Remove unused time between cycle tracking - Remove vestigial mmu tracker instance - Clamp adjustments to min/max when increment is too large - Adjust generation sizes from safepoint - ... and 7 more: https://git.openjdk.org/shenandoah/compare/25469283...50896e31 ------------- Changes: https://git.openjdk.org/shenandoah/pull/177/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=177&range=03 Stats: 448 lines in 22 files changed: 418 ins; 18 del; 12 mod Patch: https://git.openjdk.org/shenandoah/pull/177.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/177/head:pull/177 PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Thu Dec 8 21:47:38 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 21:47:38 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v2] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <9E6NmFY5877JXtI7RKpqa1r2nXDaEJ7xxLG9q0hEP6U=.03c76ffe-bac9-4401-9091-4ee19d6a394e@github.com> Message-ID: On Thu, 8 Dec 2022 14:42:53 GMT, Kelvin Nilsen wrote: > Thanks for sharing this code. A few overview comments: > > 1. Yes, I think it would be useful to see the data collected for each mark scan and each update-reference scan independently. Sometimes, abnormal behavior of the application causes spikes in performance, and it would be nice to understand the degree to which remembered set scanning is part of this spike. > 2. It is also useful to have a cumulative summary of all costs at the end of a run, probably still separating out the mark scans from the update-refs scans. I'll make those changes. > 3. Is it possible to eliminate the overhead entirely of this instrumentation by compiling it out for release builds? It essentially is compiled out of product/release builds, but compiled only into optimized/release and *debug builds. I'll gather numbers to support that, as well as include the `.s` listing for process_clusters that should be unaffected by the presence of the stat calls which would be inlined and constant folded out. I'll make the changes for 1. and 2., and add the supporting data for 3. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Thu Dec 8 21:48:57 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 8 Dec 2022 21:48:57 GMT Subject: Integrated: Generation resizing In-Reply-To: References: Message-ID: On Sat, 3 Dec 2022 01:09:59 GMT, William Kemper wrote: > These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. This pull request has now been integrated. Changeset: ee49a488 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/ee49a4888452196877911f10b9b40fb08b2ae293 Stats: 448 lines in 22 files changed: 418 ins; 18 del; 12 mod Generation resizing Reviewed-by: rkennke, kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Thu Dec 8 21:55:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 21:55:17 GMT Subject: RFR: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" [v3] In-Reply-To: References: Message-ID: <0zjCcXrkBBY9nRmwqX1UJBdbrZu2A9-PLxOOoRE2Q90=.4a83bba1-4d70-4c23-a2f5-6ea3183a9966@github.com> On Thu, 8 Dec 2022 09:13:40 GMT, Y. Srinivas Ramakrishna wrote: >> JBS link: https://bugs.openjdk.org/browse/JDK-8298138 >> - Fixed a boundary condition that was triggering an assert. >> - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. >> - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into shen_numberseq > - - Copyright dates etc. > - include reorder to alphabetic; don't use/include std:: namespace. > - Merge branch 'master' into shen_numberseq > - A simple-minded test of HdrSeq which also exercises the problematic > code. > - Fix a boundary condition issue w/HdrSeq Thanks for the reviews, Roman and Alexey! ------------- PR: https://git.openjdk.org/jdk/pull/11524 From ysr at openjdk.org Thu Dec 8 21:57:22 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 21:57:22 GMT Subject: Integrated: JDK-8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" In-Reply-To: References: Message-ID: <3JQOWP5xYJOM9dBqfHQtQTCh2hNFAHCkJXxRAzxCUgA=.0f90495d-ef94-4258-bcef-34c0afd01e3b@github.com> On Tue, 6 Dec 2022 03:46:12 GMT, Y. Srinivas Ramakrishna wrote: > JBS link: https://bugs.openjdk.org/browse/JDK-8298138 > - Fixed a boundary condition that was triggering an assert. > - Added a simple-minded gtest for HdrSeq, which allows one to exercise the asserting code in a debug build. > - Tested with: `CONF=slowdebug make run-test TEST="gtest:BasicShenandoahNumberSeqTest"` This pull request has now been integrated. Changeset: c16eb89c Author: Y. Srinivas Ramakrishna URL: https://git.openjdk.org/jdk/commit/c16eb89ce0d59f2ff83b6db0bee3e384ec8d5efe Stats: 77 lines in 3 files changed: 74 ins; 0 del; 3 mod 8298138: Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)" Reviewed-by: rkennke, shade ------------- PR: https://git.openjdk.org/jdk/pull/11524 From ysr at openjdk.org Thu Dec 8 23:25:27 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 8 Dec 2022 23:25:27 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 21:46:54 GMT, William Kemper wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: > > - Merge branch 'shenandoah-master' into mmu-instrumentation > - Remove vestigial lock, do not enroll periodic task while holding threads_lock > - Remove unnecessary logging, clean up imports > - Merge from shenandoah/master > - Document the class responsible for adjusting generation sizes > - Revert unnecessary change > - Remove unused time between cycle tracking > - Remove vestigial mmu tracker instance > - Clamp adjustments to min/max when increment is too large > - Adjust generation sizes from safepoint > - ... and 7 more: https://git.openjdk.org/shenandoah/compare/25469283...50896e31 @earthling-amzn : I have a partially completed review from last night which I'll complete and post here in the next hour or so. Sorry for the delay. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Fri Dec 9 01:11:03 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 9 Dec 2022 01:11:03 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 21:46:54 GMT, William Kemper wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: > > - Merge branch 'shenandoah-master' into mmu-instrumentation > - Remove vestigial lock, do not enroll periodic task while holding threads_lock > - Remove unnecessary logging, clean up imports > - Merge from shenandoah/master > - Document the class responsible for adjusting generation sizes > - Revert unnecessary change > - Remove unused time between cycle tracking > - Remove vestigial mmu tracker instance > - Clamp adjustments to min/max when increment is too large > - Adjust generation sizes from safepoint > - ... and 7 more: https://git.openjdk.org/shenandoah/compare/25469283...50896e31 Sorry again for not getting this review back to you in time. It looks good overall, but some comments here for you to use to perhaps improve a few things. Reviewed & approved, modulo the above comments. src/hotspot/share/gc/shenandoah/mode/shenandoahGenerationalMode.cpp line 39: > 37: } > 38: > 39: SHENANDOAH_ERGO_OVERRIDE_DEFAULT(GCTimeRatio, 70); Does this translate to a GC overhead of 1/71*100% = 1.4%? src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 59: > 57: ShenandoahHeap::heap()->gc_threads_do(&cl); > 58: // Include VM thread? Compiler threads? or no - because there > 59: // is nothing the collector can do about those threads. Correct, we should not measure in the control signal that which we do not affect. I'd either delete this comment, or just state smething like: // We do not include non-GC vm threads, such as compiler threads, etc. in our measurement // since we are using the tracker only to control (affect) the time spent in GC. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 88: > 86: // This is only called by the control thread. > 87: double collector_time_s = gc_thread_time_seconds(); > 88: double elapsed_gc_time_s = collector_time_s - _initial_collector_time_s; Since "elapsed" has a different connotation, it would be less confusing for this variable to be called something like `delta_gc_time_s` being the delta between what was previously recorded and what has now been recorded. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 96: > 94: // This is only called by the periodic thread. > 95: double process_time_s = process_time_seconds(); > 96: double elapsed_process_time_s = process_time_s - _initial_process_time_s; elapsed -> delta src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 99: > 97: _initial_process_time_s = process_time_s; > 98: double verify_time_s = gc_thread_time_seconds(); > 99: double verify_elapsed = verify_time_s - _initial_verify_collector_time_s; elapsed -> delta src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 124: > 122: if (old_time_s > young_time_s) { > 123: return transfer_capacity(young, old); > 124: } else { In another place I had asked if this method was idempotent. It would be nice if it were. This is almost idempotent, but not quite. You can make it idempotent by changing the `else` to `else if (young_time_s > old_time_s)` thus sidestepping the case where the two have just been reset and will be 0 (at least until the next gc). src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 146: > 144: size_t regions_to_transfer = MAX2(1UL, size_t(double(available_regions) * _resize_increment)); > 145: size_t bytes_to_transfer = regions_to_transfer * ShenandoahHeapRegion::region_size_bytes(); > 146: if (from->generation_mode() == YOUNG) { I'd consider extracting the work in the `if` and `else` arms into a suitable smaller work method (or two, if one won't suffice for both arms) instead of doing it in line here. It might improve readability and maintainability of the code. If you tried that and it didn't help, you can ignore this comment. The similarity in shape of the two arms and the "duplication" just seemed to be worth refactoring into a worker method. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 54: > 52: double _initial_collector_time_s; > 53: double _initial_process_time_s; > 54: double _initial_verify_collector_time_s; What does this field with `verify` in its name track? For each of the data fields, I'd suggest adding a short comment; e.g.: double _initial_collector_time_s; // tracks cumulative collector threads virtual cpu-time at last recording etc. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Fri Dec 9 01:11:16 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 9 Dec 2022 01:11:16 GMT Subject: RFR: Generation resizing [v3] In-Reply-To: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> References: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> Message-ID: On Tue, 6 Dec 2022 17:26:08 GMT, William Kemper wrote: >> These changes have the generational mode track the minimum mutator utilization (percentage of process time used by mutators). When it falls below a configuration percentage (GCTimeRatio), a heuristic will transfer memory capacity to whatever generation has been using more CPU time. The assumption here is that by increasing capacity, we will decrease the collection frequency and improve the MMU. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Remove vestigial lock, do not enroll periodic task while holding threads_lock src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 980: > 978: } > 979: > 980: void ShenandoahGeneration::increase_capacity(size_t increment) { is there some sanity check done on this elsewhere to make sure the increase/decrease make sense? Perhaps I'll see it in the caller(s) when I get to it. src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1095: > 1093: } > 1094: > 1095: bool ShenandoahHeap::adjust_generation_sizes() { Is this method idempotent? I guess it depends on the method of the same name in the MMU Tracker. I guess my question will be answered when I get to it. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 37: > 35: * This class is responsible for tracking and adjusting the minimum mutator > 36: * utilization (MMU). MMU is defined as the percentage of CPU time available > 37: * to mutator threads over an arbitrary, fixed interval of time. MMU is measured Where do we specify the fixed interval used as the basis for the MMU? src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 38: > 36: * utilization (MMU). MMU is defined as the percentage of CPU time available > 37: * to mutator threads over an arbitrary, fixed interval of time. MMU is measured > 38: * by summing all of the time given to the GC threads and comparing this too too -> to src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 44: > 42: * The time spent by GC threads is attributed to the young or old generation. > 43: * The time given to the controller and regulator threads is attributed to the > 44: * global generation. At the end of every collection, the average MMU is inspected. average over ...? Average MMU over the most recently ended collection cycle? Or over the cumulative history of the run? Or over all of the collection cycles since the last adjustment of generation sizes? etc. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 61: > 59: TruncatedSeq _mmu_average; > 60: > 61: bool transfer_capacity(ShenandoahGeneration* from, ShenandoahGeneration* to); Nit: Shouldn't the adjustment be in a sizer object rather than in a tracker object? May be we think of this class as an MmuBasedGenerationSizeController which both tracks MMU and controls the size of the generations. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 90: > 88: // allocators by taking the heap lock). The amount of capacity to move > 89: // from one generation to another is controlled by YoungGenerationSizeIncrement > 90: // and defaults to 20% of the heap. The minimum and maximum sizes of the Is the transfer delta always 20%? Wouldn't that cause oscillations about an equilibrium point at steady load? But I should read on to see how this works. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 93: > 91: // young generation are controlled by ShenandoahMinYoungPercentage and > 92: // ShenandoahMaxYoungPercentage, respectively. The method returns true > 93: // when and adjustment is made, false otherwise. and -> an src/hotspot/share/gc/shenandoah/shenandoahYoungGeneration.cpp line 95: > 93: > 94: void ShenandoahYoungGeneration::add_collection_time(double time_seconds) { > 95: if (_old_gen_task_queues != NULL) { This seems a bit subtle. Isn't there a better/official status flag to check, or a default second parm to leverage from caller? src/hotspot/share/gc/shenandoah/shenandoahYoungGeneration.hpp line 59: > 57: virtual ShenandoahHeuristics* initialize_heuristics(ShenandoahMode* gc_mode) override; > 58: > 59: virtual void add_collection_time(double time_seconds) override; A 1-line documentation comment/spec for the method would be nice here. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Fri Dec 9 01:11:18 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 9 Dec 2022 01:11:18 GMT Subject: RFR: Generation resizing [v3] In-Reply-To: References: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> Message-ID: On Thu, 8 Dec 2022 09:17:51 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove vestigial lock, do not enroll periodic task while holding threads_lock > > src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 980: > >> 978: } >> 979: >> 980: void ShenandoahGeneration::increase_capacity(size_t increment) { > > is there some sanity check done on this elsewhere to make sure the increase/decrease make sense? Perhaps I'll see it in the caller(s) when I get to it. I see now that you do. Would it still be worthwhile asserting here as well that bounds are respected. Might make the code more maintainable in the face of changes. > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1095: > >> 1093: } >> 1094: >> 1095: bool ShenandoahHeap::adjust_generation_sizes() { > > Is this method idempotent? I guess it depends on the method of the same name in the MMU Tracker. I guess my question will be answered when I get to it. Left a related comment in `MmuTracker::adjust_generational_size()`. > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 37: > >> 35: * This class is responsible for tracking and adjusting the minimum mutator >> 36: * utilization (MMU). MMU is defined as the percentage of CPU time available >> 37: * to mutator threads over an arbitrary, fixed interval of time. MMU is measured > > Where do we specify the fixed interval used as the basis for the MMU? I'd mention the interval `GCPauseIntervalMillis` here for clarity. (I'd say it's a curious naming of the interval, but it's already used in that sense, so we leave it as is.) > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 90: > >> 88: // allocators by taking the heap lock). The amount of capacity to move >> 89: // from one generation to another is controlled by YoungGenerationSizeIncrement >> 90: // and defaults to 20% of the heap. The minimum and maximum sizes of the > > Is the transfer delta always 20%? Wouldn't that cause oscillations about an equilibrium point at steady load? But I should read on to see how this works. I think the way you use it, it's not 20% of the heap but rather 20% of the free space in the generation that will provide the transfer delta. May be reword for clarity? ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Fri Dec 9 01:11:19 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 9 Dec 2022 01:11:19 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 23:57:57 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: >> >> - Merge branch 'shenandoah-master' into mmu-instrumentation >> - Remove vestigial lock, do not enroll periodic task while holding threads_lock >> - Remove unnecessary logging, clean up imports >> - Merge from shenandoah/master >> - Document the class responsible for adjusting generation sizes >> - Revert unnecessary change >> - Remove unused time between cycle tracking >> - Remove vestigial mmu tracker instance >> - Clamp adjustments to min/max when increment is too large >> - Adjust generation sizes from safepoint >> - ... and 7 more: https://git.openjdk.org/shenandoah/compare/25469283...50896e31 > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 99: > >> 97: _initial_process_time_s = process_time_s; >> 98: double verify_time_s = gc_thread_time_seconds(); >> 99: double verify_elapsed = verify_time_s - _initial_verify_collector_time_s; > > elapsed -> delta Why do you use the `verify_` prefix here? I'm sure I am missing something here... ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Fri Dec 9 01:17:41 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 9 Dec 2022 01:17:41 GMT Subject: RFR: Generation resizing [v3] In-Reply-To: References: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> Message-ID: On Thu, 8 Dec 2022 08:03:10 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove vestigial lock, do not enroll periodic task while holding threads_lock > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 44: > >> 42: * The time spent by GC threads is attributed to the young or old generation. >> 43: * The time given to the controller and regulator threads is attributed to the >> 44: * global generation. At the end of every collection, the average MMU is inspected. > > average over ...? Average MMU over the most recently ended collection cycle? Or over the cumulative history of the run? Or over all of the collection cycles since the last adjustment of generation sizes? etc. I see now that it's a decaying average over samples at every 5 second interval. May be elaborate the comment accrdingly. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From ysr at openjdk.org Fri Dec 9 01:29:32 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 9 Dec 2022 01:29:32 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v3] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: <7SIel7MTWCWMMqbkWEUVe2DvNyrmENS0RkxT6MhU-b0=.14e88010-d2fb-41cb-abba-debefde07292@github.com> > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 38 commits: - Merge branch 'master' into JVM-1264 - NOT_PRODUCT() for the stats allocation. Although the stats colelction calls were inlined and constant-folded away by the comoiler, the allocation was not removed, go figure. Thus it made sense to remove them all via NOT_PRODUCT(). I might revisit this in later iterations as I work on the card-scan loop itself, but for now this is sufficient. - Moved some more methods into non-product mode. - Merge branch 'master' into JVM-1264 - Card stats only in non-product mode (until impact of stats collection is reduce). - Merge branch 'master' into JVM-1264 - Merge branch 'master' into JVM-1264 - jcheck whitespace fixes. - Fix card_stats() so it doesn't crash when card stats aren't enabled. - Fix comment. - ... and 28 more: https://git.openjdk.org/shenandoah/compare/ee49a488...488c9399 ------------- Changes: https://git.openjdk.org/shenandoah/pull/176/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=02 Stats: 740 lines in 8 files changed: 371 ins; 220 del; 149 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Fri Dec 9 16:34:45 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 9 Dec 2022 16:34:45 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 00:59:24 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: >> >> - Merge branch 'shenandoah-master' into mmu-instrumentation >> - Remove vestigial lock, do not enroll periodic task while holding threads_lock >> - Remove unnecessary logging, clean up imports >> - Merge from shenandoah/master >> - Document the class responsible for adjusting generation sizes >> - Revert unnecessary change >> - Remove unused time between cycle tracking >> - Remove vestigial mmu tracker instance >> - Clamp adjustments to min/max when increment is too large >> - Adjust generation sizes from safepoint >> - ... and 7 more: https://git.openjdk.org/shenandoah/compare/25469283...50896e31 > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 124: > >> 122: if (old_time_s > young_time_s) { >> 123: return transfer_capacity(young, old); >> 124: } else { > > In another place I had asked if this method was idempotent. It would be nice if it were. > > This is almost idempotent, but not quite. > > You can make it idempotent by changing the `else` to `else if (young_time_s > old_time_s)` thus sidestepping the case where the two have just been reset and will be 0 (at least until the next gc). Good point, I'll make that change. > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 146: > >> 144: size_t regions_to_transfer = MAX2(1UL, size_t(double(available_regions) * _resize_increment)); >> 145: size_t bytes_to_transfer = regions_to_transfer * ShenandoahHeapRegion::region_size_bytes(); >> 146: if (from->generation_mode() == YOUNG) { > > I'd consider extracting the work in the `if` and `else` arms into a suitable smaller work method (or two, if one won't suffice for both arms) instead of doing it in line here. It might improve readability and maintainability of the code. > > If you tried that and it didn't help, you can ignore this comment. The similarity in shape of the two arms and the "duplication" just seemed to be worth refactoring into a worker method. Yes, this method is a bit long. The similarity breaks a bit because it needs to enforce the min or max constraint depending on the direction of the transfer. I'll split it up in a subsequent PR. > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 54: > >> 52: double _initial_collector_time_s; >> 53: double _initial_process_time_s; >> 54: double _initial_verify_collector_time_s; > > What does this field with `verify` in its name track? > > For each of the data fields, I'd suggest adding a short comment; e.g.: > > > double _initial_collector_time_s; // tracks cumulative collector threads virtual cpu-time at last recording > > > etc. It's vestigial - I was using it originally to "verify" the result of the per-generation based MMU. I'll rename it. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Fri Dec 9 16:34:48 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 9 Dec 2022 16:34:48 GMT Subject: RFR: Generation resizing [v3] In-Reply-To: References: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> Message-ID: On Thu, 8 Dec 2022 07:57:51 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove vestigial lock, do not enroll periodic task while holding threads_lock > > src/hotspot/share/gc/shenandoah/shenandoahYoungGeneration.cpp line 95: > >> 93: >> 94: void ShenandoahYoungGeneration::add_collection_time(double time_seconds) { >> 95: if (_old_gen_task_queues != NULL) { > > This seems a bit subtle. Isn't there a better/official status flag to check, or a default second parm to leverage from caller? I'll turn this into an `is_bootstrapping()` method. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From kdnilsen at openjdk.org Fri Dec 9 23:41:39 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 9 Dec 2022 23:41:39 GMT Subject: RFR: Shrink tlab to capacity Message-ID: When a TLAB request exceeds the currently available memory within young-gen, the existing behavior is to reject the TLAB request outright. This is recognized as a failed allocation request, which triggers degenerated GC. This change introduces code to reduce the likelihood that too-large TLAB requests will be issued, and when they are issued, it makes an effort to shrink the TLAB request in order to reduce the need for degenerated GC. The impact is difficult to measure because this situation is fairly rare. On one Extremem workload, the TLAB-shrinking code is exercised only once during a 16-minute run involving 500 concurrent GCs, a 45 GiB heap, and a 28 GiB young-gen size. The change reduces the degenerated GCs from 6 to 5. One reason that the remaining 5 degenerated GCs are not addressed by this change is that further work is required to handle a situation in which a requested TLAB is smaller than the available young-gen memory, but available memory is set aside in the evacuation reserve so cannot be provided to a mutator. Future work will address this condition. ------------- Commit messages: - Fix whitespace - Merge branch 'master' into shrink-tlab-to-capacity - Experiments to confirm proper operation - Merge remote-tracking branch 'GitFarmBranch/shrink-tlab-to-capacity' into shrink-tlab-to-capacity - Remove some debug instrumentation - Fix log message to avoid assertion failure - Change <= to < in test for shrinking tlab request size - Fix spelling error in assertion - Restructure control to avoid goto statement - Resize tlab request if larger than adjusted available Changes: https://git.openjdk.org/shenandoah/pull/180/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=180&range=00 Stats: 184 lines in 2 files changed: 71 ins; 39 del; 74 mod Patch: https://git.openjdk.org/shenandoah/pull/180.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/180/head:pull/180 PR: https://git.openjdk.org/shenandoah/pull/180 From ysr at openjdk.org Mon Dec 12 03:04:47 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 12 Dec 2022 03:04:47 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v4] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: Extract ShenandoahCardStats into its own {.h,.c}pp files. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/488c9399..388a03da Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=03 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=02-03 Stats: 523 lines in 5 files changed: 293 ins; 227 del; 3 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Mon Dec 12 09:20:49 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 12 Dec 2022 09:20:49 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v5] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: Separated out stats for scan_rs and update_refs Still need to carry cumulative stats, and merge stats from each round into cumulative.The latter needs a "merge" method in NumberSeq, which will be a separate PR. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/388a03da..0d65158c Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=04 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=03-04 Stats: 43 lines in 5 files changed: 19 ins; 0 del; 24 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From fyang at openjdk.org Mon Dec 12 12:45:54 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 12 Dec 2022 12:45:54 GMT Subject: RFR: 8298568: Fastdebug build fails after JDK-8296389 Message-ID: This is a trivial change fixing an typo introduced by JDK-8296389. The correct version is "is_NeverBranch()" instead of "isNeverBranch()". Testing: Fastdebug builds fine with this fix on linux-aarch64 platform. ------------- Commit messages: - 8298568: Fastdebug build fails after JDK-8296389 Changes: https://git.openjdk.org/jdk/pull/11631/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11631&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298568 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11631.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11631/head:pull/11631 PR: https://git.openjdk.org/jdk/pull/11631 From rkennke at openjdk.org Mon Dec 12 13:18:53 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 12 Dec 2022 13:18:53 GMT Subject: RFR: 8298568: Fastdebug build fails after JDK-8296389 In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 12:37:51 GMT, Fei Yang wrote: > This is a trivial change fixing an typo introduced by JDK-8296389. > > The correct version is "is_NeverBranch()" instead of "isNeverBranch()". > > Testing: Fastdebug builds fine with this fix on linux-aarch64 platform. Looks good to me! Thanks! ------------- Marked as reviewed by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/11631 From ysr at openjdk.org Mon Dec 12 20:43:48 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 12 Dec 2022 20:43:48 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v6] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: Cumulative card stats separated out for scan_rs and update_refs phases; merge of per-worker stats into phase-specific cumulative stats stubbed out for now until HdrSeq::merge() is done. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/0d65158c..a6b1a236 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=05 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=04-05 Stats: 33 lines in 3 files changed: 29 ins; 0 del; 4 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Mon Dec 12 21:36:36 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 12 Dec 2022 21:36:36 GMT Subject: RFR: Shrink tlab to capacity In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 23:23:43 GMT, Kelvin Nilsen wrote: > When a TLAB request exceeds the currently available memory within young-gen, the existing behavior is to reject the TLAB request outright. This is recognized as a failed allocation request, which triggers degenerated GC. > > This change introduces code to reduce the likelihood that too-large TLAB requests will be issued, and when they are issued, it makes an effort to shrink the TLAB request in order to reduce the need for degenerated GC. > > The impact is difficult to measure because this situation is fairly rare. On one Extremem workload, the TLAB-shrinking code is exercised only once during a 16-minute run involving 500 concurrent GCs, a 45 GiB heap, and a 28 GiB young-gen size. The change reduces the degenerated GCs from 6 to 5. > > One reason that the remaining 5 degenerated GCs are not addressed by this change is that further work is required to handle a situation in which a requested TLAB is smaller than the available young-gen memory, but available memory is set aside in the evacuation reserve so cannot be provided to a mutator. Future work will address this condition. Looks good, modulo a comment I left inline in the ShenandoahHeap::allocate_memory_under_lock() method. src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1368: > 1366: // satisfy the allocation request. The reality is the actual TLAB size is likely to be even smaller, because it will > 1367: // depend on how much memory is available within mutator regions that are not yet fully used. > 1368: HeapWord* result = allocate_memory_under_lock(smaller_req, in_new_region, is_promotion); Can you help me understand the structure here. Would it not have been simpler to keep sufficient state at the point where the attempt to allocate the larger size failed, and we decided we would shrink the size of the request, to just make the smaller allocation request which would be guaranteed to succeed because we held the heap lock at that point already? Is there a reason to give up and reattempt the smaller allocation request afresh? I realize you explicitly added a scope to make this re-attempt outside the scope of the locker and make the recursive call, but am trying to understand the rationale for doing so. Perhaps it's because I am missing the big picture of the work being done here from various callers to this method, but may be you can help clarify that a bit. ------------- Marked as reviewed by ysr (Author). PR: https://git.openjdk.org/shenandoah/pull/180 From kvn at openjdk.org Mon Dec 12 21:37:56 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 12 Dec 2022 21:37:56 GMT Subject: RFR: 8298568: Fastdebug build fails after JDK-8296389 In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 12:37:51 GMT, Fei Yang wrote: > This is a trivial change fixing a typo introduced by JDK-8296389. > > The correct version is "is_NeverBranch()" instead of "isNeverBranch()". > > Testing: Fastdebug builds fine with this fix on linux-aarch64 platform. Good and trivial. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/11631 From ysr at openjdk.org Mon Dec 12 21:46:04 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 12 Dec 2022 21:46:04 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v7] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: jcheck clean ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/a6b1a236..d5b337bc Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=06 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=05-06 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From kdnilsen at openjdk.org Mon Dec 12 23:17:11 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 12 Dec 2022 23:17:11 GMT Subject: RFR: Shrink tlab to capacity [v2] In-Reply-To: References: Message-ID: > When a TLAB request exceeds the currently available memory within young-gen, the existing behavior is to reject the TLAB request outright. This is recognized as a failed allocation request, which triggers degenerated GC. > > This change introduces code to reduce the likelihood that too-large TLAB requests will be issued, and when they are issued, it makes an effort to shrink the TLAB request in order to reduce the need for degenerated GC. > > The impact is difficult to measure because this situation is fairly rare. On one Extremem workload, the TLAB-shrinking code is exercised only once during a 16-minute run involving 500 concurrent GCs, a 45 GiB heap, and a 28 GiB young-gen size. The change reduces the degenerated GCs from 6 to 5. > > One reason that the remaining 5 degenerated GCs are not addressed by this change is that further work is required to handle a situation in which a requested TLAB is smaller than the available young-gen memory, but available memory is set aside in the evacuation reserve so cannot be provided to a mutator. Future work will address this condition. Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: Clarify recursive implementation of allocate_memory_under_lock (with a comment) ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/180/files - new: https://git.openjdk.org/shenandoah/pull/180/files/774e07a1..2d5da073 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=180&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=180&range=00-01 Stats: 20 lines in 1 file changed: 17 ins; 0 del; 3 mod Patch: https://git.openjdk.org/shenandoah/pull/180.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/180/head:pull/180 PR: https://git.openjdk.org/shenandoah/pull/180 From kdnilsen at openjdk.org Mon Dec 12 23:17:11 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 12 Dec 2022 23:17:11 GMT Subject: RFR: Shrink tlab to capacity [v2] In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 21:31:22 GMT, Y. Srinivas Ramakrishna wrote: >> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: >> >> Clarify recursive implementation of allocate_memory_under_lock >> >> (with a comment) > > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1368: > >> 1366: // satisfy the allocation request. The reality is the actual TLAB size is likely to be even smaller, because it will >> 1367: // depend on how much memory is available within mutator regions that are not yet fully used. >> 1368: HeapWord* result = allocate_memory_under_lock(smaller_req, in_new_region, is_promotion); > > Can you help me understand the structure here. > > Would it not have been simpler to keep sufficient state at the point where the attempt to allocate the larger size failed, and we decided we would shrink the size of the request, to just make the smaller allocation request which would be guaranteed to succeed because we held the heap lock at that point already? Is there a reason to give up and reattempt the smaller allocation request afresh? > > I realize you explicitly added a scope to make this re-attempt outside the scope of the locker and make the recursive call, but am trying to understand the rationale for doing so. Perhaps it's because I am missing the big picture of the work being done here from various callers to this method, but may be you can help clarify that a bit. Thanks for your review. I'm adding a comment to clarify the recursive algorithm and use of secondary ShenandoahAllocationRequest argument. ------------- PR: https://git.openjdk.org/shenandoah/pull/180 From haosun at openjdk.org Tue Dec 13 00:16:49 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 13 Dec 2022 00:16:49 GMT Subject: RFR: 8298568: Fastdebug build fails after JDK-8296389 In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 12:37:51 GMT, Fei Yang wrote: > This is a trivial change fixing a typo introduced by JDK-8296389. > > The correct version is "is_NeverBranch()" instead of "isNeverBranch()". > > Testing: Fastdebug builds fine with this fix on linux-aarch64 platform. LGTM. (I'm not a Reviewer) ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.org/jdk/pull/11631 From fyang at openjdk.org Tue Dec 13 00:52:19 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 13 Dec 2022 00:52:19 GMT Subject: RFR: 8298568: Fastdebug build fails after JDK-8296389 In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 12:37:51 GMT, Fei Yang wrote: > This is a trivial change fixing a typo introduced by JDK-8296389. > > The correct version is "is_NeverBranch()" instead of "isNeverBranch()". > > Testing: Fastdebug builds fine with this fix on linux-aarch64 platform. Thank you! Let's /integrate ------------- PR: https://git.openjdk.org/jdk/pull/11631 From fyang at openjdk.org Tue Dec 13 01:01:40 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 13 Dec 2022 01:01:40 GMT Subject: Integrated: 8298568: Fastdebug build fails after JDK-8296389 In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 12:37:51 GMT, Fei Yang wrote: > This is a trivial change fixing a typo introduced by JDK-8296389. > > The correct version is "is_NeverBranch()" instead of "isNeverBranch()". > > Testing: Fastdebug builds fine with this fix on linux-aarch64 platform. This pull request has now been integrated. Changeset: 173778e2 Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/173778e2fee58e47d35197b78eb23f46154b5b2b Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8298568: Fastdebug build fails after JDK-8296389 Reviewed-by: rkennke, kvn, haosun ------------- PR: https://git.openjdk.org/jdk/pull/11631 From ysr at openjdk.org Tue Dec 13 01:17:20 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 13 Dec 2022 01:17:20 GMT Subject: RFR: Shrink tlab to capacity [v2] In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 23:13:29 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1368: >> >>> 1366: // satisfy the allocation request. The reality is the actual TLAB size is likely to be even smaller, because it will >>> 1367: // depend on how much memory is available within mutator regions that are not yet fully used. >>> 1368: HeapWord* result = allocate_memory_under_lock(smaller_req, in_new_region, is_promotion); >> >> Can you help me understand the structure here. >> >> Would it not have been simpler to keep sufficient state at the point where the attempt to allocate the larger size failed, and we decided we would shrink the size of the request, to just make the smaller allocation request which would be guaranteed to succeed because we held the heap lock at that point already? Is there a reason to give up and reattempt the smaller allocation request afresh? >> >> I realize you explicitly added a scope to make this re-attempt outside the scope of the locker and make the recursive call, but am trying to understand the rationale for doing so. Perhaps it's because I am missing the big picture of the work being done here from various callers to this method, but may be you can help clarify that a bit. > > Thanks for your review. I'm adding a comment to clarify the recursive algorithm and use of secondary ShenandoahAllocationRequest argument. ok, thanks! ------------- PR: https://git.openjdk.org/shenandoah/pull/180 From wkemper at openjdk.org Tue Dec 13 18:52:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 13 Dec 2022 18:52:17 GMT Subject: RFR: Allow adjusted capacity and used regions size to be equal Message-ID: Fix assertion which requires adjusted capacity to be larger than the used regions size (they may be equal). ------------- Commit messages: - Allow adjusted capacity and used regions size to be equal Changes: https://git.openjdk.org/shenandoah/pull/181/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=181&range=00 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/shenandoah/pull/181.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/181/head:pull/181 PR: https://git.openjdk.org/shenandoah/pull/181 From wkemper at openjdk.org Tue Dec 13 18:55:35 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 13 Dec 2022 18:55:35 GMT Subject: RFR: Generation sizing fixes Message-ID: <5u8AOWojAzGbSIL0W9XpeA1hTTXFUA2Dq_crBbJ6Z9o=.4587ca91-fc93-4073-96a2-93f6e63a06ef@github.com> Two small fixes: * Fix windows build * Need gc id for resumed old generation marking ------------- Commit messages: - Fix MAX2 arguments - Use GCIdMark when resuming old marking Changes: https://git.openjdk.org/shenandoah/pull/182/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=182&range=00 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/shenandoah/pull/182.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/182/head:pull/182 PR: https://git.openjdk.org/shenandoah/pull/182 From ysr at openjdk.org Tue Dec 13 23:24:09 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 13 Dec 2022 23:24:09 GMT Subject: RFR: Allow adjusted capacity and used regions size to be equal In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 18:46:37 GMT, William Kemper wrote: > Fix assertion which requires adjusted capacity to be larger than the used regions size (they may be equal). LGTM! ------------- Marked as reviewed by ysr (Author). PR: https://git.openjdk.org/shenandoah/pull/181 From ysr at openjdk.org Tue Dec 13 23:26:09 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 13 Dec 2022 23:26:09 GMT Subject: RFR: Generation sizing fixes In-Reply-To: <5u8AOWojAzGbSIL0W9XpeA1hTTXFUA2Dq_crBbJ6Z9o=.4587ca91-fc93-4073-96a2-93f6e63a06ef@github.com> References: <5u8AOWojAzGbSIL0W9XpeA1hTTXFUA2Dq_crBbJ6Z9o=.4587ca91-fc93-4073-96a2-93f6e63a06ef@github.com> Message-ID: On Tue, 13 Dec 2022 18:48:59 GMT, William Kemper wrote: > Two small fixes: > * Fix windows build > * Need gc id for resumed old generation marking Marked as reviewed by ysr (Author). ------------- PR: https://git.openjdk.org/shenandoah/pull/182 From kdnilsen at openjdk.org Wed Dec 14 00:26:34 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 14 Dec 2022 00:26:34 GMT Subject: RFR: Allow adjusted capacity and used regions size to be equal In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 18:46:37 GMT, William Kemper wrote: > Fix assertion which requires adjusted capacity to be larger than the used regions size (they may be equal). Marked as reviewed by kdnilsen (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/181 From kdnilsen at openjdk.org Wed Dec 14 00:28:45 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 14 Dec 2022 00:28:45 GMT Subject: RFR: Generation sizing fixes In-Reply-To: <5u8AOWojAzGbSIL0W9XpeA1hTTXFUA2Dq_crBbJ6Z9o=.4587ca91-fc93-4073-96a2-93f6e63a06ef@github.com> References: <5u8AOWojAzGbSIL0W9XpeA1hTTXFUA2Dq_crBbJ6Z9o=.4587ca91-fc93-4073-96a2-93f6e63a06ef@github.com> Message-ID: <0Jz3hrh-_sJeoEmTFhaJJIRVLmepJatoKGBipVBD-i0=.3f24f850-06fc-48c2-bc8d-37e79a139946@github.com> On Tue, 13 Dec 2022 18:48:59 GMT, William Kemper wrote: > Two small fixes: > * Fix windows build > * Need gc id for resumed old generation marking Marked as reviewed by kdnilsen (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/182 From wkemper at openjdk.org Wed Dec 14 00:35:35 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 14 Dec 2022 00:35:35 GMT Subject: RFR: Generation resizing [v3] In-Reply-To: References: <8Fse7IxO14Uc0eJJoLMmGXSo8XYD9Qb144mCyrMX3-g=.1d5d7258-f06d-4a03-bf78-8102f22ada7d@github.com> Message-ID: On Fri, 9 Dec 2022 00:55:04 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 980: >> >>> 978: } >>> 979: >>> 980: void ShenandoahGeneration::increase_capacity(size_t increment) { >> >> is there some sanity check done on this elsewhere to make sure the increase/decrease make sense? Perhaps I'll see it in the caller(s) when I get to it. > > I see now that you do. Would it still be worthwhile asserting here as well that bounds are respected. Might make the code more maintainable in the face of changes. I will add asserts here. >> src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.hpp line 90: >> >>> 88: // allocators by taking the heap lock). The amount of capacity to move >>> 89: // from one generation to another is controlled by YoungGenerationSizeIncrement >>> 90: // and defaults to 20% of the heap. The minimum and maximum sizes of the >> >> Is the transfer delta always 20%? Wouldn't that cause oscillations about an equilibrium point at steady load? But I should read on to see how this works. > > I think the way you use it, it's not 20% of the heap but rather 20% of the free space in the generation that will provide the transfer delta. May be reword for clarity? Yes, it's 20% of the available capacity. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Wed Dec 14 00:49:47 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 14 Dec 2022 00:49:47 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: On Thu, 8 Dec 2022 23:32:14 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: >> >> - Merge branch 'shenandoah-master' into mmu-instrumentation >> - Remove vestigial lock, do not enroll periodic task while holding threads_lock >> - Remove unnecessary logging, clean up imports >> - Merge from shenandoah/master >> - Document the class responsible for adjusting generation sizes >> - Revert unnecessary change >> - Remove unused time between cycle tracking >> - Remove vestigial mmu tracker instance >> - Clamp adjustments to min/max when increment is too large >> - Adjust generation sizes from safepoint >> - ... and 7 more: https://git.openjdk.org/shenandoah/compare/25469283...50896e31 > > src/hotspot/share/gc/shenandoah/mode/shenandoahGenerationalMode.cpp line 39: > >> 37: } >> 38: >> 39: SHENANDOAH_ERGO_OVERRIDE_DEFAULT(GCTimeRatio, 70); > > Does this translate to a GC overhead of 1/71*100% = 1.4%? I think it is a confusingly named parameter, but I'm interpreting it based on the description: > "Adaptive size policy application time to GC time ratio" Any time the average MMU drops below this number, it attempts to resize the generations. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Wed Dec 14 00:54:44 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 14 Dec 2022 00:54:44 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: <-VY3oESa76Aaivt-cx5T9CjmLrMDsm3lOM6aSm_w3iw=.76afd677-35be-4305-91a6-3338a6935e1c@github.com> On Thu, 8 Dec 2022 23:57:18 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: >> >> - Merge branch 'shenandoah-master' into mmu-instrumentation >> - Remove vestigial lock, do not enroll periodic task while holding threads_lock >> - Remove unnecessary logging, clean up imports >> - Merge from shenandoah/master >> - Document the class responsible for adjusting generation sizes >> - Revert unnecessary change >> - Remove unused time between cycle tracking >> - Remove vestigial mmu tracker instance >> - Clamp adjustments to min/max when increment is too large >> - Adjust generation sizes from safepoint >> - ... and 7 more: https://git.openjdk.org/shenandoah/compare/25469283...50896e31 > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 96: > >> 94: // This is only called by the periodic thread. >> 95: double process_time_s = process_time_seconds(); >> 96: double elapsed_process_time_s = process_time_s - _initial_process_time_s; > > elapsed -> delta I prefer 'elapsed' here when dealing with time deltas. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From wkemper at openjdk.org Wed Dec 14 00:54:46 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 14 Dec 2022 00:54:46 GMT Subject: RFR: Generation resizing [v4] In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 00:12:56 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 99: >> >>> 97: _initial_process_time_s = process_time_s; >>> 98: double verify_time_s = gc_thread_time_seconds(); >>> 99: double verify_elapsed = verify_time_s - _initial_verify_collector_time_s; >> >> elapsed -> delta > > Why do you use the `verify_` prefix here? I'm sure I am missing something here... It's vestigial, I'll change it. ------------- PR: https://git.openjdk.org/shenandoah/pull/177 From mennen at openjdk.org Wed Dec 14 03:47:06 2022 From: mennen at openjdk.org (Michael Ennen) Date: Wed, 14 Dec 2022 03:47:06 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v39] In-Reply-To: References: <-V_N0Cvh4J0vKNbBYdFcow9E8yFHRIjya8n69MpDSuY=.9626ee4d-95b6-41e4-b21e-395e79840388@github.com> Message-ID: <91WlM45Ykemls6D5vtXZMIIqjjECQTLVuJFhTLYXq-I=.4e670ac0-f1ff-480e-b18e-cea98d01bd6f@github.com> On Mon, 5 Dec 2022 13:46:09 GMT, Maurizio Cimadamore wrote: >> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Preview annotation for JEP 434 > > Note: there are 4 tests failing in x86: > * MemoryLayoutPrincipalTotalityTest > * MemoryLayoutTypeRetentionTest > * TestLargeSegmentCopy > * TestLinker > > These failures are addressed in the dependent PR: https://git.openjdk.org/jdk/pull/11019, which will be integrated immediately after these changes @mcimadamore This PR made my code in [java-vulkan](https://github.com/brcolow/java-vulkan/commit/171f167782eea538b19b60d5fa73e9f75a112f6d) much cleaner! Nice work! ------------- PR: https://git.openjdk.org/jdk/pull/10872 From ysr at openjdk.org Wed Dec 14 08:22:56 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 14 Dec 2022 08:22:56 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v8] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request incrementally with five additional commits since the last revision: - Remove stubs (guarantee/false). This still needs formal work on merging decayed stats, but is OK to ignore for now because no one currently uses the decayed stats. The non-decayed stats also need further review and correction. So this is still an interim checkin. To do: -- print final summary at exit; consider if periodic cumulative summary might be useful as well (Every major collection cycles?) -- check correctness of merged data (ignoring decayed statistics for now) - Merge branch 'stats_merge' into JVM-1264 - More merge() implementation. -- Need to think about merge of decaying stats in AbsSeq. -- Need to add tests. - Interim checkin of code w/beginnings of merge() support. Some implementations are still stubbed out and need to be written. - First cut at merge. More changes to come. May not build yet. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/d5b337bc..695851da Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=07 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=06-07 Stats: 84 lines in 5 files changed: 83 ins; 0 del; 1 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From rkennke at openjdk.org Wed Dec 14 15:47:00 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 14 Dec 2022 15:47:00 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v8] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 09:32:58 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) >> - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: > > - Merge remote-tracking branch 'upstream/master' into fast-locking > - Merge remote-tracking branch 'upstream/master' into fast-locking > - Merge remote-tracking branch 'upstream/master' into fast-locking > - More RISC-V fixes > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - RISC-V port > - Revert "Re-use r0 in call to unlock_object()" > > This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Fix number of rt args to complete_monitor_locking_C, remove some comments > - Re-use r0 in call to unlock_object() > - ... and 27 more: https://git.openjdk.org/jdk/compare/4b89fce0...3f0acba4 Closing this in favour of #10907. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Wed Dec 14 15:47:01 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 14 Dec 2022 15:47:01 GMT Subject: Withdrawn: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 10:23:04 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) > - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From andrew at openjdk.org Wed Dec 14 16:44:55 2022 From: andrew at openjdk.org (Andrew John Hughes) Date: Wed, 14 Dec 2022 16:44:55 GMT Subject: RFR: Merge jdk8u:master Message-ID: Merge jdk8u332-b02 ------------- Commit messages: - Merge jdk8u332-b02 - Merge - 8273575: memory leak in appendBootClassPath(), paths must be deallocated - 8141508: java.lang.invoke.LambdaConversionException: Invalid receiver type - 8209178: Proxied HttpsURLConnection doesn't send BODY when retrying POST request - 8273341: Update Siphash to version 1.0 - 8273229: Update OS detection code to recognize Windows Server 2022 - Added tag jdk8u332-b01 for changeset b81aa0cb6267 The merge commit only contains trivial merges, so no merge-specific webrevs have been generated. Changes: https://git.openjdk.org/shenandoah-jdk8u/pull/7/files Stats: 536 lines in 10 files changed: 502 ins; 11 del; 23 mod Patch: https://git.openjdk.org/shenandoah-jdk8u/pull/7.diff Fetch: git fetch https://git.openjdk.org/shenandoah-jdk8u pull/7/head:pull/7 PR: https://git.openjdk.org/shenandoah-jdk8u/pull/7 From wkemper at openjdk.org Wed Dec 14 16:54:31 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 14 Dec 2022 16:54:31 GMT Subject: Integrated: Generation sizing fixes In-Reply-To: <5u8AOWojAzGbSIL0W9XpeA1hTTXFUA2Dq_crBbJ6Z9o=.4587ca91-fc93-4073-96a2-93f6e63a06ef@github.com> References: <5u8AOWojAzGbSIL0W9XpeA1hTTXFUA2Dq_crBbJ6Z9o=.4587ca91-fc93-4073-96a2-93f6e63a06ef@github.com> Message-ID: <8h3ssMXXg9m4MZsp8WXOHULK2ANszaMShty_a8GSHtM=.304d95db-06a3-4045-aa52-e58ad3ee3af5@github.com> On Tue, 13 Dec 2022 18:48:59 GMT, William Kemper wrote: > Two small fixes: > * Fix windows build > * Need gc id for resumed old generation marking This pull request has now been integrated. Changeset: 7a3ebbcd Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/7a3ebbcdae1659ba9bb9be04e6419aa3b34cc8c9 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Generation sizing fixes Reviewed-by: ysr, kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/182 From wkemper at openjdk.org Wed Dec 14 16:54:40 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 14 Dec 2022 16:54:40 GMT Subject: Integrated: Allow adjusted capacity and used regions size to be equal In-Reply-To: References: Message-ID: On Tue, 13 Dec 2022 18:46:37 GMT, William Kemper wrote: > Fix assertion which requires adjusted capacity to be larger than the used regions size (they may be equal). This pull request has now been integrated. Changeset: 35b26d60 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/35b26d605a61cba864458e4493bf80fe7fda31ad Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Allow adjusted capacity and used regions size to be equal Reviewed-by: ysr, kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/181 From ysr at openjdk.org Thu Dec 15 09:08:24 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 15 Dec 2022 09:08:24 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v9] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: Tested and fixed some bugs; printing frequency of cumulative stats controlled by command-line option. Ready for review. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/695851da..75c09268 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=08 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=07-08 Stats: 64 lines in 7 files changed: 32 ins; 6 del; 26 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 15 09:13:39 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 15 Dec 2022 09:13:39 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v10] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 49 commits: - Merge branch 'master' into JVM-1264 - Tested and fixed some bugs; printing frequency of cumulative stats controlled by command-line option. Ready for review. - Remove stubs (guarantee/false). This still needs formal work on merging decayed stats, but is OK to ignore for now because no one currently uses the decayed stats. The non-decayed stats also need further review and correction. So this is still an interim checkin. To do: -- print final summary at exit; consider if periodic cumulative summary might be useful as well (Every major collection cycles?) -- check correctness of merged data (ignoring decayed statistics for now) - Merge branch 'stats_merge' into JVM-1264 - More merge() implementation. -- Need to think about merge of decaying stats in AbsSeq. -- Need to add tests. - Interim checkin of code w/beginnings of merge() support. Some implementations are still stubbed out and need to be written. - First cut at merge. More changes to come. May not build yet. - jcheck clean - Cumulative card stats separated out for scan_rs and update_refs phases; merge of per-worker stats into phase-specific cumulative stats stubbed out for now until HdrSeq::merge() is done. - Separated out stats for scan_rs and update_refs Still need to carry cumulative stats, and merge stats from each round into cumulative.The latter needs a "merge" method in NumberSeq, which will be a separate PR. - ... and 39 more: https://git.openjdk.org/shenandoah/compare/35b26d60...75933a59 ------------- Changes: https://git.openjdk.org/shenandoah/pull/176/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=09 Stats: 948 lines in 12 files changed: 578 ins; 204 del; 166 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From luhenry at openjdk.org Thu Dec 15 13:27:11 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Thu, 15 Dec 2022 13:27:11 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Wed, 16 Nov 2022 18:18:55 GMT, Claes Redestad wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > I'm getting pulled into other tasks and would request for this to be either accepted as-is, rejected or picked up by someone else to rewrite it to something that can be accepted. > > Obviously I'm biased towards acceptance: While imperfect, it provides improved testing - both functional and performance-wise - and establishes a significantly improved benchmark for more future-proof solutions to beat. There are many ways to iteratively improve upon this solution, some of which would even simplify the implementation. But in the face of upcoming changes that might allow C2 to optimize these kinds of loops without intrinsic support I am not sure spending more time on perfecting the current patch is worth our while. > > Rejecting it might be the reasonable thing to do, too, especially if the C2 loop optimizations @iwanowww points out might be coming around sooner rather than later. Even if that's not coming soon, the PR at hand adds a chunk of complexity for the compiler team to maintain. @cl4es @iwanowww is that change still good to go forward? What else would you like to see for it to be merged? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10847 From duke at openjdk.org Thu Dec 15 15:59:11 2022 From: duke at openjdk.org (Ismael Juma) Date: Thu, 15 Dec 2022 15:59:11 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode Are the C2 loop optimizations happening any time soon? If not, it seems pretty sensible to take this very significant win for a very common path. We can always remove it once the C2 loop optimizations can achieve results that are as good. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From wkemper at openjdk.org Thu Dec 15 17:55:11 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 15 Dec 2022 17:55:11 GMT Subject: RFR: merge openjdk/jdk:master Message-ID: Merge jdk+21-0. There is a small change in the generational mode to use a different API for iterating java threads to resolve a merge conflict. ------------- Commit messages: - Replace use of removed API - Merge tag 'jdk-21+0' into merge-jdk21-0 - 8297642: PhaseIdealLoop::only_has_infinite_loops must detect all loops that never lead to termination - 8298255: JFR provide information about dynamization of number of compiler threads - 8298383: JFR: GenerateJfrFiles.java lacks copyright header - 8298379: JFR: Some UNTIMED events only sets endTime - 8298129: Let checkpoint event sizes grow beyond u4 limit - 8297718: Make NMT free:ing protocol more granular - 8298173: GarbageCollectionNotificationContentTest test failed: no decrease in Eden usage - 8298272: Clean up ProblemList - ... and 163 more: https://git.openjdk.org/shenandoah/compare/35b26d60...63a32877 Changes: https://git.openjdk.org/shenandoah/pull/183/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=183&range=00 Stats: 81449 lines in 1217 files changed: 38340 ins; 35581 del; 7528 mod Patch: https://git.openjdk.org/shenandoah/pull/183.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/183/head:pull/183 PR: https://git.openjdk.org/shenandoah/pull/183 From ysr at openjdk.org Thu Dec 15 21:47:43 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 15 Dec 2022 21:47:43 GMT Subject: RFR: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 17:47:55 GMT, William Kemper wrote: > Merge jdk+21-0. There is a small change in the generational mode to use a different API for iterating java threads to resolve a merge conflict. Thanks for the sync! ------------- Marked as reviewed by ysr (Author). PR: https://git.openjdk.org/shenandoah/pull/183 From wkemper at openjdk.org Thu Dec 15 21:51:33 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 15 Dec 2022 21:51:33 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: <2aRvlZHsG6YOBTCy2kcXWCB5t033UQhm5TuMipoZqX4=.9ff84b6d-86e7-46e8-a671-9d0aec95b50d@github.com> On Thu, 15 Dec 2022 17:47:55 GMT, William Kemper wrote: > Merge jdk+21-0. There is a small change in the generational mode to use a different API for iterating java threads to resolve a merge conflict. This pull request has now been integrated. Changeset: 3901a719 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/3901a719a212dd83a7e32c9d3f3a3d298eb6fb81 Stats: 81449 lines in 1217 files changed: 38340 ins; 35581 del; 7528 mod Merge openjdk/jdk:master Reviewed-by: ysr ------------- PR: https://git.openjdk.org/shenandoah/pull/183 From andrew at openjdk.org Fri Dec 16 00:22:38 2022 From: andrew at openjdk.org (Andrew John Hughes) Date: Fri, 16 Dec 2022 00:22:38 GMT Subject: git: openjdk/shenandoah-jdk8u: Added tag jdk8u332-b02 for changeset c84adc4e Message-ID: <382b0ba2-ede8-4017-9334-2650a39a689b@openjdk.org> Tagged by: Andrew John Hughes Date: 2022-02-08 16:47:38 +0000 Changeset: c84adc4e Author: Andrew John Hughes Date: 2022-02-05 16:34:22 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/c84adc4e7624f263cd06e2df19286bbc4ed82d41 From andrew at openjdk.org Fri Dec 16 00:22:43 2022 From: andrew at openjdk.org (Andrew John Hughes) Date: Fri, 16 Dec 2022 00:22:43 GMT Subject: git: openjdk/shenandoah-jdk8u: Added tag shenandoah8u332-b02 for changeset f02bb443 Message-ID: <3425fc29-78d6-42de-a836-7585cc532604@openjdk.org> Tagged by: Andrew John Hughes Date: 2022-12-16 00:19:03 +0000 Added tag shenandoah8u332-b02 for changeset f02bb443067 Changeset: f02bb443 Author: Andrew John Hughes Date: 2022-11-16 15:32:18 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/f02bb443067363151d33fc64691846afb3292eb9 From andrew at openjdk.org Fri Dec 16 00:23:16 2022 From: andrew at openjdk.org (Andrew John Hughes) Date: Fri, 16 Dec 2022 00:23:16 GMT Subject: git: openjdk/shenandoah-jdk8u: master: 8 new changesets Message-ID: Changeset: 8d5c7386 Author: Andrew John Hughes Date: 2022-02-01 19:55:42 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/8d5c7386c619a2602d9731c4adbbb1b01aeb449f Added tag jdk8u332-b01 for changeset b81aa0cb6267 ! .hgtags Changeset: 4618dfdd Author: Matthias Baesken Date: 2021-09-02 11:22:49 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/4618dfdda5b1d8ac0afbf7c8d5a53fa7b431ab25 8273229: Update OS detection code to recognize Windows Server 2022 Reviewed-by: alanb, dholmes ! hotspot/src/os/windows/vm/os_windows.cpp ! jdk/src/windows/native/java/lang/java_props_md.c Changeset: 83fbd1c6 Author: Coleen Phillimore Date: 2021-11-22 18:08:13 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/83fbd1c6d8a60400e1140aa0b0bd00a298af0b5d 8273341: Update Siphash to version 1.0 Reviewed-by: dholmes ! hotspot/src/share/vm/classfile/altHashing.cpp ! hotspot/src/share/vm/classfile/altHashing.hpp Changeset: 7812e1ac Author: Julia Boes Date: 2019-11-15 11:39:02 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/7812e1ac0fda027cfe12290ac73cc05c64a89ead 8209178: Proxied HttpsURLConnection doesn't send BODY when retrying POST request Preserve BODY in poster output stream before sending CONNECT request Reviewed-by: bae ! jdk/src/share/classes/sun/net/www/http/HttpClient.java + jdk/test/sun/net/www/http/HttpClient/B8209178.java Changeset: f935c7be Author: Srikanth Adayapalam Date: 2015-11-11 18:46:03 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/f935c7bef2e58ea681ca3170b7833e9b1cb6a23d 8141508: java.lang.invoke.LambdaConversionException: Invalid receiver type Incorrect handling of intersection type parameter of functional interface descriptor results in call site initialization exception Reviewed-by: mcimadamore ! langtools/src/share/classes/com/sun/tools/javac/comp/LambdaToMethod.java + langtools/test/tools/javac/lambda/methodReference/IntersectionTypeReceiverTest.java Changeset: aae25adc Author: Serguei Spitsyn Date: 2021-09-15 20:00:21 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/aae25adc4dccef6c55cada1a70cca0ecf1a8b641 8273575: memory leak in appendBootClassPath(), paths must be deallocated Reviewed-by: dholmes, amenkov ! jdk/src/share/instrument/InvocationAdapter.c Changeset: c84adc4e Author: Andrew John Hughes Date: 2022-02-05 16:34:22 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/c84adc4e7624f263cd06e2df19286bbc4ed82d41 Merge Changeset: f02bb443 Author: Andrew John Hughes Date: 2022-11-16 15:32:18 +0000 URL: https://git.openjdk.org/shenandoah-jdk8u/commit/f02bb443067363151d33fc64691846afb3292eb9 Merge jdk8u332-b02 From iris at openjdk.org Fri Dec 16 00:24:52 2022 From: iris at openjdk.org (Iris Clark) Date: Fri, 16 Dec 2022 00:24:52 GMT Subject: Withdrawn: Merge jdk8u:master In-Reply-To: References: Message-ID: On Wed, 14 Dec 2022 16:39:37 GMT, Andrew John Hughes wrote: > Merge jdk8u332-b02 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/shenandoah-jdk8u/pull/7 From andrew at openjdk.org Fri Dec 16 00:24:50 2022 From: andrew at openjdk.org (Andrew John Hughes) Date: Fri, 16 Dec 2022 00:24:50 GMT Subject: RFR: Merge jdk8u:master [v2] In-Reply-To: References: Message-ID: <2UqT2UwDRAIB8V54lc4esImuMyTLba9Mq_9RW17hG_4=.119d5ccd-a6fb-49c5-96bc-b3549d5bec30@github.com> > Merge jdk8u332-b02 Andrew John Hughes has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. ------------- Changes: - all: https://git.openjdk.org/shenandoah-jdk8u/pull/7/files - new: https://git.openjdk.org/shenandoah-jdk8u/pull/7/files/f02bb443..f02bb443 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah-jdk8u&pr=7&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah-jdk8u&pr=7&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/shenandoah-jdk8u/pull/7.diff Fetch: git fetch https://git.openjdk.org/shenandoah-jdk8u pull/7/head:pull/7 PR: https://git.openjdk.org/shenandoah-jdk8u/pull/7 From andrew at openjdk.org Fri Dec 16 00:37:40 2022 From: andrew at openjdk.org (Andrew John Hughes) Date: Fri, 16 Dec 2022 00:37:40 GMT Subject: RFR: Merge jdk8u:master Message-ID: <7AiNiAh1EJtQcsRIhBIiBESeY4i2w22rAotd7dky4-g=.0c73f373-3815-49c6-a2ae-9fda87352e53@github.com> Mere jdk8u332-b03 ------------- Commit messages: - Merge jdk8u332-b03 - 8280060: The sun/rmi/server/Activation.java class use Thread.dumpStack() - 8037259: xerces update: xpointer update - 8210283: Support git as an SCM alternative in the build - Added tag jdk8u332-b02 for changeset 4eff168ecdd9 The merge commit only contains trivial merges, so no merge-specific webrevs have been generated. Changes: https://git.openjdk.org/shenandoah-jdk8u/pull/8/files Stats: 354 lines in 14 files changed: 238 ins; 27 del; 89 mod Patch: https://git.openjdk.org/shenandoah-jdk8u/pull/8.diff Fetch: git fetch https://git.openjdk.org/shenandoah-jdk8u pull/8/head:pull/8 PR: https://git.openjdk.org/shenandoah-jdk8u/pull/8 From ysr at openjdk.org Fri Dec 16 03:35:47 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 03:35:47 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method Message-ID: A merge functionality on stats (distributions) was needed for the remembered set scan that I was using in some companion work. This PR implements a first cut at that, which is sufficient for our first (and only) use case. Unfortunately, for expediency, I am deferring work on decaying statistics, as a result of which users that want decaying statistics will get incorrect results. In the sort term, before I open this draft for review, I'll: - [x] add tests - [x] ensure that if a merge action has been taken on a distribution, then any attempt to access a decayed statistic causes an error - [x] open a linked ticket to take care of the decayed statistics An important goal here was to have an API that would be efficient and correct. The API shape may change when we have considered how to handle decaying statistics. ------------- Commit messages: - Safety tests for decayed stats, until implemented. - gtest for merge. - Vanilla merge test for ShenandoahNumberSeq; needs to be extended some. - Changes based on experience with uses in RS scan stats. - Merge branch 'master' into stats_merge - More merge() implementation. - Interim checkin of code w/beginnings of merge() support. Some - First cut at merge. More changes to come. May not build yet. Changes: https://git.openjdk.org/shenandoah/pull/184/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=184&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8298597 Stats: 227 lines in 5 files changed: 225 ins; 0 del; 2 mod Patch: https://git.openjdk.org/shenandoah/pull/184.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/184/head:pull/184 PR: https://git.openjdk.org/shenandoah/pull/184 From ysr at openjdk.org Fri Dec 16 03:35:51 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 03:35:51 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 19:33:36 GMT, Y. Srinivas Ramakrishna wrote: > A merge functionality on stats (distributions) was needed for the remembered set scan that I was using in some companion work. This PR implements a first cut at that, which is sufficient for our first (and only) use case. > > Unfortunately, for expediency, I am deferring work on decaying statistics, as a result of which users that want decaying statistics will get incorrect results. > > In the sort term, before I open this draft for review, I'll: > > - [x] add tests > - [x] ensure that if a merge action has been taken on a distribution, then any attempt to access a decayed statistic causes an error > - [x] open a linked ticket to take care of the decayed statistics > > An important goal here was to have an API that would be efficient and correct. The API shape may change when we have considered how to handle decaying statistics. Will leave these comments here in the draft PR, until the last two steps are completed and the PR opened for formal review. This PR is open for review. src/hotspot/share/gc/shenandoah/shenandoahNumberSeq.cpp line 59: > 57: if (v > 0) { > 58: mag = 0; > 59: while (v > 1) { This is a bug fix that has been independently pushed to tip. You can ignore it and it'll find its way into shenandoah in due course. src/hotspot/share/gc/shenandoah/shenandoahNumberSeq.hpp line 55: > 53: > 54: // Merge this HdrSeq into hdr2, optionally clearing this HdrSeq > 55: void merge(HdrSeq& hdr2, bool clear_this = true); The default setting here is based on the way its only current client (RS scan instrumentation) makes use of it, but I am happy to change it if reviewers feel that might be better for API hygiene reasons. src/hotspot/share/utilities/numberSeq.cpp line 124: > 122: // Decaying stats need a bit more thought > 123: assert(abs2._alpha == _alpha, "Caution: merge incompatible?"); > 124: // guarantee(false, "NYI"); This will expand into setting some breadcrumbs in `abs2`, such that any attempt to query a decayed stat from the object will result in an error. ------------- PR: https://git.openjdk.org/shenandoah/pull/184 From ysr at openjdk.org Fri Dec 16 03:40:33 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 03:40:33 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 19:33:36 GMT, Y. Srinivas Ramakrishna wrote: > A merge functionality on stats (distributions) was needed for the remembered set scan that I was using in some companion work. This PR implements a first cut at that, which is sufficient for our first (and only) use case. > > Unfortunately, for expediency, I am deferring work on decaying statistics, as a result of which users that want decaying statistics will get NaNs instead (or trigger guarantees). > > In the sort term, before I open this draft for review, I'll: > > - [x] add tests > - [x] ensure that if a merge action has been taken on a distribution, then any attempt to access a decayed statistic causes an error > - [x] open a linked ticket to take care of the decayed statistics > > An important goal here was to have an API that would be efficient and correct. The API shape may change when we have considered how to handle decaying statistics. test/hotspot/gtest/gc/shenandoah/test_shenandoahNumberSeq.cpp line 1: > 1: /* An earlier version of this test is in tip for an earlier bug fix. I am happy to consult if there is any confusion during a merge from tip. In this specific case, the contents of this file should take precedence. ------------- PR: https://git.openjdk.org/shenandoah/pull/184 From ysr at openjdk.org Fri Dec 16 03:56:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 03:56:17 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method [v2] In-Reply-To: References: Message-ID: > A merge functionality on stats (distributions) was needed for the remembered set scan that I was using in some companion work. This PR implements a first cut at that, which is sufficient for our first (and only) use case. > > Unfortunately, for expediency, I am deferring work on decaying statistics, as a result of which users that want decaying statistics will get NaNs instead (or trigger guarantees). > > In the sort term, before I open this draft for review, I'll: > > - [x] add tests > - [x] ensure that if a merge action has been taken on a distribution, then any attempt to access a decayed statistic causes an error > - [x] open a linked ticket to take care of the decayed statistics > > An important goal here was to have an API that would be efficient and correct. The API shape may change when we have considered how to handle decaying statistics. Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Merge branch 'master' into stats_merge - Safety tests for decayed stats, until implemented. - gtest for merge. - Vanilla merge test for ShenandoahNumberSeq; needs to be extended some. - Changes based on experience with uses in RS scan stats. Fixed some bugs. -- We still need to implement a few vanilla tests for the merge method. -- Planning to defer the work on decayed stats (which will be delivered separately in a lower-priority sibling ticket) - Merge branch 'master' into stats_merge - More merge() implementation. -- Need to think about merge of decaying stats in AbsSeq. -- Need to add tests. - Interim checkin of code w/beginnings of merge() support. Some implementations are still stubbed out and need to be written. - First cut at merge. More changes to come. May not build yet. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/184/files - new: https://git.openjdk.org/shenandoah/pull/184/files/f2263402..06c7e983 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=184&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=184&range=00-01 Stats: 81449 lines in 1217 files changed: 38340 ins; 35581 del; 7528 mod Patch: https://git.openjdk.org/shenandoah/pull/184.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/184/head:pull/184 PR: https://git.openjdk.org/shenandoah/pull/184 From ysr at openjdk.org Fri Dec 16 04:18:14 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 04:18:14 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v11] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: <6zlIIGLrIdEUYC9JCBaDBPxCxgYYNKJtlsolG2BY6VI=.97d1e26b-502b-46bc-a40b-2b77d111f4fa@github.com> > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 50 commits: - Merge branch 'master' into JVM-1264 - Merge branch 'master' into JVM-1264 - Tested and fixed some bugs; printing frequency of cumulative stats controlled by command-line option. Ready for review. - Remove stubs (guarantee/false). This still needs formal work on merging decayed stats, but is OK to ignore for now because no one currently uses the decayed stats. The non-decayed stats also need further review and correction. So this is still an interim checkin. To do: -- print final summary at exit; consider if periodic cumulative summary might be useful as well (Every major collection cycles?) -- check correctness of merged data (ignoring decayed statistics for now) - Merge branch 'stats_merge' into JVM-1264 - More merge() implementation. -- Need to think about merge of decaying stats in AbsSeq. -- Need to add tests. - Interim checkin of code w/beginnings of merge() support. Some implementations are still stubbed out and need to be written. - First cut at merge. More changes to come. May not build yet. - jcheck clean - Cumulative card stats separated out for scan_rs and update_refs phases; merge of per-worker stats into phase-specific cumulative stats stubbed out for now until HdrSeq::merge() is done. - ... and 40 more: https://git.openjdk.org/shenandoah/compare/3901a719...616547d6 ------------- Changes: https://git.openjdk.org/shenandoah/pull/176/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=10 Stats: 948 lines in 12 files changed: 578 ins; 204 del; 166 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Fri Dec 16 18:26:21 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 16 Dec 2022 18:26:21 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant Message-ID: Calculation assumed 64 words per card, which does not hold on 32 bit word platforms ------------- Commit messages: - Use CardTable::card_size_in_words rather than hard coded constant - Add more detail to assertion message Changes: https://git.openjdk.org/shenandoah/pull/186/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=186&range=00 Stats: 17 lines in 2 files changed: 6 ins; 4 del; 7 mod Patch: https://git.openjdk.org/shenandoah/pull/186.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/186/head:pull/186 PR: https://git.openjdk.org/shenandoah/pull/186 From ysr at openjdk.org Fri Dec 16 19:07:28 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 19:07:28 GMT Subject: RFR: [Redirect to new dependent version, see below: https://github.com/ysramakrishna/shenandoah/pull/1 ] [v11] In-Reply-To: <6zlIIGLrIdEUYC9JCBaDBPxCxgYYNKJtlsolG2BY6VI=.97d1e26b-502b-46bc-a40b-2b77d111f4fa@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <6zlIIGLrIdEUYC9JCBaDBPxCxgYYNKJtlsolG2BY6VI=.97d1e26b-502b-46bc-a40b-2b77d111f4fa@github.com> Message-ID: On Fri, 16 Dec 2022 04:18:14 GMT, Y. Srinivas Ramakrishna wrote: >> **Note:** >> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) >> >> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. >> >> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. >> >> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. >> >> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. >> >> **Summary:** >> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. >> >> **Details of files changed:** >> >> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. >> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats >> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. >> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq >> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). >> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. >> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. >> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> >> [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning >> [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> The data above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific benchmark config (extremem). >> >> Comparing worker stats from worker 0 and worker 9 indicates very little difference between >> their statistics, as one might typically expect for well-balanced RS scans. >> >> **Questions:** >> >> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? >> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? >> 3. Any suggestions for a more easily consumable format? >> 4. I welcome any other feedback on the pull request. > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 50 commits: > > - Merge branch 'master' into JVM-1264 > - Merge branch 'master' into JVM-1264 > - Tested and fixed some bugs; printing frequency of cumulative stats controlled by > command-line option. > > Ready for review. > - Remove stubs (guarantee/false). > This still needs formal work on merging decayed stats, but is OK to > ignore for now because no one currently uses the decayed stats. The > non-decayed stats also need further review and correction. So this is > still an interim checkin. > > To do: > -- print final summary at exit; consider if periodic cumulative summary might be > useful as well (Every major collection cycles?) > -- check correctness of merged data (ignoring decayed statistics for > now) > - Merge branch 'stats_merge' into JVM-1264 > - More merge() implementation. > -- Need to think about merge of decaying stats in AbsSeq. > -- Need to add tests. > - Interim checkin of code w/beginnings of merge() support. Some > implementations are still stubbed out and need to be written. > - First cut at merge. More changes to come. May not build yet. > - jcheck clean > - Cumulative card stats separated out for scan_rs and update_refs phases; > merge of per-worker stats into phase-specific cumulative stats stubbed > out for now until HdrSeq::merge() is done. > - ... and 40 more: https://git.openjdk.org/shenandoah/compare/3901a719...616547d6 Redirect to a version of these changes that are published as a dependent PR: https://github.com/ysramakrishna/shenandoah/pull/1 ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Fri Dec 16 19:44:58 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 16 Dec 2022 19:44:58 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant [v2] In-Reply-To: References: Message-ID: > Calculation assumed 64 words per card, which does not hold on 32 bit word platforms William Kemper has updated the pull request incrementally with one additional commit since the last revision: Fix warning ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/186/files - new: https://git.openjdk.org/shenandoah/pull/186/files/0d87f14a..521c1491 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=186&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=186&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/shenandoah/pull/186.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/186/head:pull/186 PR: https://git.openjdk.org/shenandoah/pull/186 From ysr at openjdk.org Fri Dec 16 20:05:24 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 20:05:24 GMT Subject: Withdrawn: [Redirect to new dependent version, see below: https://github.com/ysramakrishna/shenandoah/pull/1 ] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Thu, 1 Dec 2022 19:55:45 GMT, Y. Srinivas Ramakrishna wrote: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Fri Dec 16 20:05:24 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 16 Dec 2022 20:05:24 GMT Subject: RFR: [Redirect to new dependent version, see below: https://github.com/ysramakrishna/shenandoah/pull/1 ] [v11] In-Reply-To: <6zlIIGLrIdEUYC9JCBaDBPxCxgYYNKJtlsolG2BY6VI=.97d1e26b-502b-46bc-a40b-2b77d111f4fa@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <6zlIIGLrIdEUYC9JCBaDBPxCxgYYNKJtlsolG2BY6VI=.97d1e26b-502b-46bc-a40b-2b77d111f4fa@github.com> Message-ID: On Fri, 16 Dec 2022 04:18:14 GMT, Y. Srinivas Ramakrishna wrote: >> **Note:** >> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) >> >> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. >> >> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. >> >> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. >> >> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. >> >> **Summary:** >> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. >> >> **Details of files changed:** >> >> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. >> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats >> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. >> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq >> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). >> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. >> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. >> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> >> [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning >> [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> The data above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific benchmark config (extremem). >> >> Comparing worker stats from worker 0 and worker 9 indicates very little difference between >> their statistics, as one might typically expect for well-balanced RS scans. >> >> **Questions:** >> >> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? >> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? >> 3. Any suggestions for a more easily consumable format? >> 4. I welcome any other feedback on the pull request. > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 50 commits: > > - Merge branch 'master' into JVM-1264 > - Merge branch 'master' into JVM-1264 > - Tested and fixed some bugs; printing frequency of cumulative stats controlled by > command-line option. > > Ready for review. > - Remove stubs (guarantee/false). > This still needs formal work on merging decayed stats, but is OK to > ignore for now because no one currently uses the decayed stats. The > non-decayed stats also need further review and correction. So this is > still an interim checkin. > > To do: > -- print final summary at exit; consider if periodic cumulative summary might be > useful as well (Every major collection cycles?) > -- check correctness of merged data (ignoring decayed statistics for > now) > - Merge branch 'stats_merge' into JVM-1264 > - More merge() implementation. > -- Need to think about merge of decaying stats in AbsSeq. > -- Need to add tests. > - Interim checkin of code w/beginnings of merge() support. Some > implementations are still stubbed out and need to be written. > - First cut at merge. More changes to come. May not build yet. > - jcheck clean > - Cumulative card stats separated out for scan_rs and update_refs phases; > merge of per-worker stats into phase-specific cumulative stats stubbed > out for now until HdrSeq::merge() is done. > - ... and 40 more: https://git.openjdk.org/shenandoah/compare/3901a719...616547d6 Closing in favor of https://github.com/ysramakrishna/shenandoah/pull/1 ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From sviswanathan at openjdk.org Fri Dec 16 21:43:56 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 16 Dec 2022 21:43:56 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: <35kCBSJh0N-kNpJyUDG7jr7p0aejbNDnVbZw6XdxZlM=.f0c1314e-707d-407f-b7bf-ea016d6800f1@github.com> On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode src/java.base/share/classes/java/lang/StringUTF16.java line 418: > 416: return 0; > 417: } else { > 418: return ArraysSupport.vectorizedHashCode(value, ArraysSupport.UTF16); Special case for 1 missing here. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From wkemper at openjdk.org Fri Dec 16 22:26:48 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 16 Dec 2022 22:26:48 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: Merge tag jdk+21-2 (this includes @ysramakrishna 's fix for `HdrSeq`). ------------- Commit messages: - Merge tag 'jdk-21+2' into merge-jdk21-2 - 8297639: Remove preventive GCs in G1 - 8298692: Fix typos in test/jdk/com/sun/jdi files - 8296318: use-def assert: special case undetected loops nested in infinite loops - 8193547: Regression automated test '/open/test/jdk/java/awt/Toolkit/DesktopProperties/rfe4758438.java' fails - 8296610: java/net/HttpURLConnection/SetAuthenticator/HTTPSetAuthenticatorTest.java failed with "BindException: Address already in use: connect" - 8298376: ZGC: thaws stackChunk with stale oops - 8298475: Remove JVM_ACC_PROMOTED_FLAGS - 8298636: Fix return value in WB_CountAliveClasses and WB_GetSymbolRefcount - 8298264: Merge OffsetTableContigSpace into TenuredSpace - ... and 59 more: https://git.openjdk.org/shenandoah/compare/3901a719...973e99cb The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=shenandoah&pr=187&range=00.0 - openjdk/jdk:master: https://webrevs.openjdk.org/?repo=shenandoah&pr=187&range=00.1 Changes: https://git.openjdk.org/shenandoah/pull/187/files Stats: 10083 lines in 328 files changed: 6622 ins; 2533 del; 928 mod Patch: https://git.openjdk.org/shenandoah/pull/187.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/187/head:pull/187 PR: https://git.openjdk.org/shenandoah/pull/187 From wkemper at openjdk.org Fri Dec 16 22:41:16 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 16 Dec 2022 22:41:16 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 22:20:38 GMT, William Kemper wrote: > Merge tag jdk+21-2 (this includes @ysramakrishna 's fix for `HdrSeq`). This pull request has now been integrated. Changeset: abedbc17 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/abedbc17bf41ecada0681516b45a821f609e55fb Stats: 10083 lines in 328 files changed: 6622 ins; 2533 del; 928 mod Merge openjdk/jdk:master ------------- PR: https://git.openjdk.org/shenandoah/pull/187 From sviswanathan at openjdk.org Fri Dec 16 23:29:57 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 16 Dec 2022 23:29:57 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sun, 13 Nov 2022 20:57:44 GMT, Claes Redestad wrote: >> src/hotspot/cpu/x86/x86_64.ad line 12073: >> >>> 12071: legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, rRegI tmp3, rFlagsReg cr) >>> 12072: %{ >>> 12073: predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == VectorizedHashCodeNode::LATIN1); >> >> If you represent `VectorizedHashCodeNode::mode()` as an input, it would allow to abstract over supported modes and come up with a single AD instruction. Take a look at `VectorMaskCmp` for an example (not a perfect one though since it has both _predicate member and constant input which is redundant). > > Thanks for the pointer, I'll check it out! I agree with Vladimir, adding mode as another input will help. Please take a look at the RoundDoubleModeNode. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From sviswanathan at openjdk.org Fri Dec 16 23:29:55 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Fri, 16 Dec 2022 23:29:55 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: <_2HYrbXOe6zVuLXHywoy_AjCcGMYR266BcwKUZEA5fs=.1e6f7640-7580-4ff3-ace6-f18f27efbb23@github.com> On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode src/hotspot/cpu/x86/stubRoutines_x86.cpp line 230: > 228: #endif // _LP64 > 229: > 230: jint StubRoutines::x86::_arrays_hashcode_powers_of_31[] = This should be declared only for LP64. src/hotspot/cpu/x86/vm_version_x86.cpp line 1671: > 1669: } > 1670: if (UseAVX >= 2) { > 1671: FLAG_SET_ERGO_IF_DEFAULT(UseVectorizedHashCodeIntrinsic, true); This could be just FLAG_SET_DEFAULT instead of FLAG_SET_ERGO_IF_DEFAULT. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From duke at openjdk.org Sat Dec 17 06:07:56 2022 From: duke at openjdk.org (duke) Date: Sat, 17 Dec 2022 06:07:56 GMT Subject: Withdrawn: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/10602 From ysr at openjdk.org Mon Dec 19 17:46:18 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 17:46:18 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method [v3] In-Reply-To: References: Message-ID: > A merge functionality on stats (distributions) was needed for the remembered set scan that I was using in some companion work. This PR implements a first cut at that, which is sufficient for our first (and only) use case. > > Unfortunately, for expediency, I am deferring work on decaying statistics, as a result of which users that want decaying statistics will get NaNs instead (or trigger guarantees). > > In the sort term, before I open this draft for review, I'll: > > - [x] add tests > - [x] ensure that if a merge action has been taken on a distribution, then any attempt to access a decayed statistic causes an error > - [x] open a linked ticket to take care of the decayed statistics > > An important goal here was to have an API that would be efficient and correct. The API shape may change when we have considered how to handle decaying statistics. Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - Merge branch 'master' into stats_merge - Merge branch 'master' into stats_merge - Safety tests for decayed stats, until implemented. - gtest for merge. - Vanilla merge test for ShenandoahNumberSeq; needs to be extended some. - Changes based on experience with uses in RS scan stats. Fixed some bugs. -- We still need to implement a few vanilla tests for the merge method. -- Planning to defer the work on decayed stats (which will be delivered separately in a lower-priority sibling ticket) - Merge branch 'master' into stats_merge - More merge() implementation. -- Need to think about merge of decaying stats in AbsSeq. -- Need to add tests. - Interim checkin of code w/beginnings of merge() support. Some implementations are still stubbed out and need to be written. - First cut at merge. More changes to come. May not build yet. ------------- Changes: https://git.openjdk.org/shenandoah/pull/184/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=184&range=02 Stats: 180 lines in 5 files changed: 153 ins; 2 del; 25 mod Patch: https://git.openjdk.org/shenandoah/pull/184.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/184/head:pull/184 PR: https://git.openjdk.org/shenandoah/pull/184 From kdnilsen at openjdk.org Mon Dec 19 17:56:30 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 19 Dec 2022 17:56:30 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method [v3] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 17:46:18 GMT, Y. Srinivas Ramakrishna wrote: >> A merge functionality on stats (distributions) was needed for the remembered set scan that I was using in some companion work. This PR implements a first cut at that, which is sufficient for our first (and only) use case. >> >> Unfortunately, for expediency, I am deferring work on decaying statistics, as a result of which users that want decaying statistics will get NaNs instead (or trigger guarantees). >> >> In the sort term, before I open this draft for review, I'll: >> >> - [x] add tests >> - [x] ensure that if a merge action has been taken on a distribution, then any attempt to access a decayed statistic causes an error >> - [x] open a linked ticket to take care of the decayed statistics >> >> An important goal here was to have an API that would be efficient and correct. The API shape may change when we have considered how to handle decaying statistics. > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge branch 'master' into stats_merge > - Merge branch 'master' into stats_merge > - Safety tests for decayed stats, until implemented. > - gtest for merge. > - Vanilla merge test for ShenandoahNumberSeq; needs to be extended some. > - Changes based on experience with uses in RS scan stats. > Fixed some bugs. > > -- We still need to implement a few vanilla tests for the merge method. > -- Planning to defer the work on decayed stats (which will be delivered > separately in a lower-priority sibling ticket) > - Merge branch 'master' into stats_merge > - More merge() implementation. > -- Need to think about merge of decaying stats in AbsSeq. > -- Need to add tests. > - Interim checkin of code w/beginnings of merge() support. Some > implementations are still stubbed out and need to be written. > - First cut at merge. More changes to come. May not build yet. Marked as reviewed by kdnilsen (Committer). Marked as reviewed by kdnilsen (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/184 From ysr at openjdk.org Mon Dec 19 18:35:28 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 18:35:28 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method [v3] In-Reply-To: References: Message-ID: <2h4E0exn8ZLbvpR395hvvrwacHDF0HHIkrgV_CMDgp8=.58f71ddd-6928-46c0-b0ce-9af41ca26e97@github.com> On Mon, 19 Dec 2022 17:53:34 GMT, Kelvin Nilsen wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Merge branch 'master' into stats_merge >> - Merge branch 'master' into stats_merge >> - Safety tests for decayed stats, until implemented. >> - gtest for merge. >> - Vanilla merge test for ShenandoahNumberSeq; needs to be extended some. >> - Changes based on experience with uses in RS scan stats. >> Fixed some bugs. >> >> -- We still need to implement a few vanilla tests for the merge method. >> -- Planning to defer the work on decayed stats (which will be delivered >> separately in a lower-priority sibling ticket) >> - Merge branch 'master' into stats_merge >> - More merge() implementation. >> -- Need to think about merge of decaying stats in AbsSeq. >> -- Need to add tests. >> - Interim checkin of code w/beginnings of merge() support. Some >> implementations are still stubbed out and need to be written. >> - First cut at merge. More changes to come. May not build yet. > > Marked as reviewed by kdnilsen (Committer). Pending sponsorship; thanks for the review @kdnilsen! ------------- PR: https://git.openjdk.org/shenandoah/pull/184 From ysr at openjdk.org Mon Dec 19 18:35:29 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 18:35:29 GMT Subject: RFR: JDK-8298597 : HdrSeq: support for a merge() method [v3] In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 03:37:29 GMT, Y. Srinivas Ramakrishna wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Merge branch 'master' into stats_merge >> - Merge branch 'master' into stats_merge >> - Safety tests for decayed stats, until implemented. >> - gtest for merge. >> - Vanilla merge test for ShenandoahNumberSeq; needs to be extended some. >> - Changes based on experience with uses in RS scan stats. >> Fixed some bugs. >> >> -- We still need to implement a few vanilla tests for the merge method. >> -- Planning to defer the work on decayed stats (which will be delivered >> separately in a lower-priority sibling ticket) >> - Merge branch 'master' into stats_merge >> - More merge() implementation. >> -- Need to think about merge of decaying stats in AbsSeq. >> -- Need to add tests. >> - Interim checkin of code w/beginnings of merge() support. Some >> implementations are still stubbed out and need to be written. >> - First cut at merge. More changes to come. May not build yet. > > test/hotspot/gtest/gc/shenandoah/test_shenandoahNumberSeq.cpp line 1: > >> 1: /* > > An earlier version of this test is in tip for an earlier bug fix. I am happy to consult if there is any confusion during a merge from tip. In this specific case, the contents of this file should take precedence. merged from master; resolving. ------------- PR: https://git.openjdk.org/shenandoah/pull/184 From wkemper at openjdk.org Mon Dec 19 18:54:12 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 18:54:12 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant [v3] In-Reply-To: References: Message-ID: > Calculation assumed 64 words per card, which does not hold on 32 bit word platforms William Kemper has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Do not assume cards will have 64 words ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/186/files - new: https://git.openjdk.org/shenandoah/pull/186/files/521c1491..9ebacb5d Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=186&range=02 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=186&range=01-02 Stats: 20 lines in 2 files changed: 5 ins; 3 del; 12 mod Patch: https://git.openjdk.org/shenandoah/pull/186.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/186/head:pull/186 PR: https://git.openjdk.org/shenandoah/pull/186 From ysr at openjdk.org Mon Dec 19 20:46:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 20:46:17 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant [v3] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 18:54:12 GMT, William Kemper wrote: >> Calculation assumed 64 words per card, which does not hold on 32 bit word platforms > > William Kemper has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Do not assume cards will have 64 words Marked as reviewed by ysr (Author). ------------- PR: https://git.openjdk.org/shenandoah/pull/186 From ysr at openjdk.org Mon Dec 19 20:49:39 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 20:49:39 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant [v3] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 18:54:12 GMT, William Kemper wrote: >> Calculation assumed 64 words per card, which does not hold on 32 bit word platforms > > William Kemper has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Do not assume cards will have 64 words src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.hpp line 1033: > 1031: // ShenandoahCardCluster::CardsPerCluster; > 1032: // We can't perform this computation here, because of encapsulation and initialization constraints. We paste > 1033: // the magic number here, and assert that this number matches the intended computation in constructor. Is the portion of the comment beginning "We paste the maginc number ... etc." not obsolete now, and should there be deleted? ------------- PR: https://git.openjdk.org/shenandoah/pull/186 From ysr at openjdk.org Mon Dec 19 20:59:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 20:59:17 GMT Subject: Integrated: JDK-8298597 : HdrSeq: support for a merge() method In-Reply-To: References: Message-ID: On Thu, 15 Dec 2022 19:33:36 GMT, Y. Srinivas Ramakrishna wrote: > A merge functionality on stats (distributions) was needed for the remembered set scan that I was using in some companion work. This PR implements a first cut at that, which is sufficient for our first (and only) use case. > > Unfortunately, for expediency, I am deferring work on decaying statistics, as a result of which users that want decaying statistics will get NaNs instead (or trigger guarantees). > > In the sort term, before I open this draft for review, I'll: > > - [x] add tests > - [x] ensure that if a merge action has been taken on a distribution, then any attempt to access a decayed statistic causes an error > - [x] open a linked ticket to take care of the decayed statistics > > An important goal here was to have an API that would be efficient and correct. The API shape may change when we have considered how to handle decaying statistics. This pull request has now been integrated. Changeset: bbd4ef34 Author: Y. Srinivas Ramakrishna Committer: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/bbd4ef345122aeb2277c5f24269bda11846ec6ef Stats: 180 lines in 5 files changed: 153 ins; 2 del; 25 mod 8298597: HdrSeq: support for a merge() method Reviewed-by: kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/184 From wkemper at openjdk.org Mon Dec 19 21:13:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 21:13:18 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant [v3] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 20:47:00 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> Do not assume cards will have 64 words > > src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.hpp line 1033: > >> 1031: // ShenandoahCardCluster::CardsPerCluster; >> 1032: // We can't perform this computation here, because of encapsulation and initialization constraints. We paste >> 1033: // the magic number here, and assert that this number matches the intended computation in constructor. > > Is the portion of the comment beginning "We paste the maginc number ... etc." not obsolete now, and should there be deleted? Yes - I'll clean that up. ------------- PR: https://git.openjdk.org/shenandoah/pull/186 From wkemper at openjdk.org Mon Dec 19 21:21:47 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 21:21:47 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant [v4] In-Reply-To: References: Message-ID: > Calculation assumed 64 words per card, which does not hold on 32 bit word platforms. The number of words per card also depends on `GCCardSizeInBytes` command line parameter. William Kemper has updated the pull request incrementally with one additional commit since the last revision: Remove stale comment ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/186/files - new: https://git.openjdk.org/shenandoah/pull/186/files/9ebacb5d..5a4f973d Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=186&range=03 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=186&range=02-03 Stats: 4 lines in 1 file changed: 0 ins; 3 del; 1 mod Patch: https://git.openjdk.org/shenandoah/pull/186.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/186/head:pull/186 PR: https://git.openjdk.org/shenandoah/pull/186 From kdnilsen at openjdk.org Mon Dec 19 23:12:24 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 19 Dec 2022 23:12:24 GMT Subject: RFR: Use CardTable::card_size_in_words rather than hard coded constant [v4] In-Reply-To: References: Message-ID: On Mon, 19 Dec 2022 21:21:47 GMT, William Kemper wrote: >> Calculation assumed 64 words per card, which does not hold on 32 bit word platforms. The number of words per card also depends on `GCCardSizeInBytes` command line parameter. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Remove stale comment Thanks. ------------- Marked as reviewed by kdnilsen (Committer). PR: https://git.openjdk.org/shenandoah/pull/186 From wkemper at openjdk.org Mon Dec 19 23:22:15 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:22:15 GMT Subject: Integrated: Use CardTable::card_size_in_words rather than hard coded constant In-Reply-To: References: Message-ID: <8Wu-3C0Kw5kxZ-4VVDnAZqZOuq7e3jeIa5rrSsUD-A4=.b68f7daf-10fb-4d59-bc98-1117f02033d1@github.com> On Fri, 16 Dec 2022 18:20:20 GMT, William Kemper wrote: > Calculation assumed 64 words per card, which does not hold on 32 bit word platforms. The number of words per card also depends on `GCCardSizeInBytes` command line parameter. This pull request has now been integrated. Changeset: 1a962380 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/1a9623802c3595300fd796539e1a75aa3533cda6 Stats: 16 lines in 2 files changed: 3 ins; 2 del; 11 mod Use CardTable::card_size_in_words rather than hard coded constant Reviewed-by: ysr, kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/186 From wkemper at openjdk.org Mon Dec 19 23:33:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:33:18 GMT Subject: RFR: Initial sizing refactor Message-ID: Some things to highlight here: * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. ------------- Commit messages: - Initial young generation to maximum allowed size - Always transfer bytes in multiple of region size - Fix invalid assertion - Initial soft max size based on young generation's minimum size. - Factor out methods to check young generation size limits - WIP: Refactor handling of generation parameters Changes: https://git.openjdk.org/shenandoah/pull/185/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=185&range=00 Stats: 362 lines in 12 files changed: 233 ins; 68 del; 61 mod Patch: https://git.openjdk.org/shenandoah/pull/185.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/185/head:pull/185 PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Mon Dec 19 23:33:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:33:18 GMT Subject: RFR: Initial sizing refactor In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 00:36:41 GMT, William Kemper wrote: > Some things to highlight here: > * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. > * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. > * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. > * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. Converted this to a draft because the changes are tripping an assert: # Internal Error (/codebuild/output/src797/src/s3/00/src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp:986), pid=1289, tid=1312 -- 520 | # assert(regions * ShenandoahHeapRegion::region_size_bytes() <= heap->young_generation()->adjusted_capacity()) failed: Number of young regions cannot exceed adjusted capacity Ready for review - the assertion noted earlier is unrelated to these changes and will be fixed under a separate PR. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Mon Dec 19 23:42:04 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 23:42:04 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Note:** > This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) > > (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. > > (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. > > (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. > > The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. > > **Summary:** > The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. > > **Details of files changed:** > > 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. > 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats > 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. > 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq > 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). > 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. > 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. > 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. > > **Format of stats produced and how to interpret them: (sample)** > > > [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning > [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: > [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] > [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] > [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] > [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > The data above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific benchmark config (extremem). > > Comparing worker stats from worker 0 and worker 9 indicates very little difference between > their statistics, as one might typically expect for well-balanced RS scans. > > **Questions:** > > 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? > 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? > 3. Any suggestions for a more easily consumable format? > 4. I welcome any other feedback on the pull request. Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: - Merge branch 'master' into JVM-1264-dependent - Add a previously missed ticket#. Doing it here rather than in parent to avoid an otherwise unnecessary re-review touchpoint. - Merge branch 'stats_merge' into JVM-1264-dependent - Merge branch 'master' into stats_merge - jcheck space fix - Fix compiler error on windows. - Fix some tier1 tests. - Remove an unnecessary include, fix some type incorrectness. - Merge branch 'JVM-1264' into JVM-1264-dependent - Merge branch 'master' into JVM-1264 - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f ------------- Changes: https://git.openjdk.org/shenandoah/pull/176/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=11 Stats: 864 lines in 9 files changed: 495 ins; 206 del; 163 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Mon Dec 19 23:43:27 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:43:27 GMT Subject: RFR: Shrink tlab to capacity [v2] In-Reply-To: References: Message-ID: On Mon, 12 Dec 2022 23:17:11 GMT, Kelvin Nilsen wrote: >> When a TLAB request exceeds the currently available memory within young-gen, the existing behavior is to reject the TLAB request outright. This is recognized as a failed allocation request, which triggers degenerated GC. >> >> This change introduces code to reduce the likelihood that too-large TLAB requests will be issued, and when they are issued, it makes an effort to shrink the TLAB request in order to reduce the need for degenerated GC. >> >> The impact is difficult to measure because this situation is fairly rare. On one Extremem workload, the TLAB-shrinking code is exercised only once during a 16-minute run involving 500 concurrent GCs, a 45 GiB heap, and a 28 GiB young-gen size. The change reduces the degenerated GCs from 6 to 5. >> >> One reason that the remaining 5 degenerated GCs are not addressed by this change is that further work is required to handle a situation in which a requested TLAB is smaller than the available young-gen memory, but available memory is set aside in the evacuation reserve so cannot be provided to a mutator. Future work will address this condition. > > Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: > > Clarify recursive implementation of allocate_memory_under_lock > > (with a comment) Looks good. Thank you. ------------- Marked as reviewed by wkemper (Committer). PR: https://git.openjdk.org/shenandoah/pull/180 From ysr at openjdk.org Mon Dec 19 23:46:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 23:46:17 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Mon, 19 Dec 2022 23:42:04 GMT, Y. Srinivas Ramakrishna wrote: >> **Note:** >> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) >> >> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. >> >> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. >> >> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. >> >> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. >> >> **Summary:** >> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. >> >> **Details of files changed:** >> >> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. >> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats >> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. >> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq >> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). >> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. >> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. >> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> >> [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning >> [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> The data above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific benchmark config (extremem). >> >> Comparing worker stats from worker 0 and worker 9 indicates very little difference between >> their statistics, as one might typically expect for well-balanced RS scans. >> >> **Questions:** >> >> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? >> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? >> 3. Any suggestions for a more easily consumable format? >> 4. I welcome any other feedback on the pull request. > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: > > - Merge branch 'master' into JVM-1264-dependent > - Add a previously missed ticket#. Doing it here rather than in parent to > avoid an otherwise unnecessary re-review touchpoint. > - Merge branch 'stats_merge' into JVM-1264-dependent > - Merge branch 'master' into stats_merge > - jcheck space fix > - Fix compiler error on windows. > - Fix some tier1 tests. > - Remove an unnecessary include, fix some type incorrectness. > - Merge branch 'JVM-1264' into JVM-1264-dependent > - Merge branch 'master' into JVM-1264 > - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f Reopening as my dependent PR had an issue with how I had done the PR. Since the parent PR is integrated, I figured it was easiest to update this branch ad reopen the original PR. Sorry for the attendant noise. This PR is now open for review. Thanks! ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Mon Dec 19 23:46:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:46:17 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Mon, 19 Dec 2022 23:42:04 GMT, Y. Srinivas Ramakrishna wrote: >> **Note:** >> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) >> >> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. >> >> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. >> >> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. >> >> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. >> >> **Summary:** >> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. >> >> **Details of files changed:** >> >> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. >> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats >> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. >> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq >> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). >> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. >> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. >> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> >> [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning >> [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> The data above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific benchmark config (extremem). >> >> Comparing worker stats from worker 0 and worker 9 indicates very little difference between >> their statistics, as one might typically expect for well-balanced RS scans. >> >> **Questions:** >> >> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? >> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? >> 3. Any suggestions for a more easily consumable format? >> 4. I welcome any other feedback on the pull request. > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: > > - Merge branch 'master' into JVM-1264-dependent > - Add a previously missed ticket#. Doing it here rather than in parent to > avoid an otherwise unnecessary re-review touchpoint. > - Merge branch 'stats_merge' into JVM-1264-dependent > - Merge branch 'master' into stats_merge > - jcheck space fix > - Fix compiler error on windows. > - Fix some tier1 tests. > - Remove an unnecessary include, fix some type incorrectness. > - Merge branch 'JVM-1264' into JVM-1264-dependent > - Merge branch 'master' into JVM-1264 > - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f src/hotspot/share/gc/shenandoah/shenandoahCardStats.cpp line 57: > 55: if (record) { > 56: // Update global stats for distribution of dirty/clean card %ge > 57: _local_card_stats[DIRTY_CARDS].add((double)_dirty_card_cnt*100/(double)_cards_in_cluster); typo? `%` -> `a` ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Mon Dec 19 23:50:18 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 23:50:18 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Mon, 19 Dec 2022 23:43:38 GMT, William Kemper wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: >> >> - Merge branch 'master' into JVM-1264-dependent >> - Add a previously missed ticket#. Doing it here rather than in parent to >> avoid an otherwise unnecessary re-review touchpoint. >> - Merge branch 'stats_merge' into JVM-1264-dependent >> - Merge branch 'master' into stats_merge >> - jcheck space fix >> - Fix compiler error on windows. >> - Fix some tier1 tests. >> - Remove an unnecessary include, fix some type incorrectness. >> - Merge branch 'JVM-1264' into JVM-1264-dependent >> - Merge branch 'master' into JVM-1264 >> - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f > > src/hotspot/share/gc/shenandoah/shenandoahCardStats.cpp line 57: > >> 55: if (record) { >> 56: // Update global stats for distribution of dirty/clean card %ge >> 57: _local_card_stats[DIRTY_CARDS].add((double)_dirty_card_cnt*100/(double)_cards_in_cluster); > > typo? `%` -> `a` I mean percentage where I said `%ge`. I'll clarify the comments a bit more. Please continue the review and I'll improve some of the documentation comments for clarity. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Mon Dec 19 23:50:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:50:18 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: <_a7yKm6N4ztQ-u70-Ds38Vg2_AepwHAPQxWfZHsTyOY=.3f7615c3-6b02-4915-aa58-b7ebcbdb2b56@github.com> On Mon, 19 Dec 2022 23:42:04 GMT, Y. Srinivas Ramakrishna wrote: >> **Note:** >> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) >> >> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. >> >> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. >> >> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. >> >> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. >> >> **Summary:** >> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. >> >> **Details of files changed:** >> >> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. >> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats >> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. >> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq >> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). >> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. >> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. >> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> >> [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning >> [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> The data above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific benchmark config (extremem). >> >> Comparing worker stats from worker 0 and worker 9 indicates very little difference between >> their statistics, as one might typically expect for well-balanced RS scans. >> >> **Questions:** >> >> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? >> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? >> 3. Any suggestions for a more easily consumable format? >> 4. I welcome any other feedback on the pull request. > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: > > - Merge branch 'master' into JVM-1264-dependent > - Add a previously missed ticket#. Doing it here rather than in parent to > avoid an otherwise unnecessary re-review touchpoint. > - Merge branch 'stats_merge' into JVM-1264-dependent > - Merge branch 'master' into stats_merge > - jcheck space fix > - Fix compiler error on windows. > - Fix some tier1 tests. > - Remove an unnecessary include, fix some type incorrectness. > - Merge branch 'JVM-1264' into JVM-1264-dependent > - Merge branch 'master' into JVM-1264 > - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f src/hotspot/share/gc/shenandoah/shenandoahCardStats.hpp line 61: > 59: _cards_in_cluster(cards_in_cluster), > 60: _local_card_stats(card_stats), > 61: _last_dirty(false), Should it always be the case that `_last_dirty != _last_clean`? Could we use one variable here instead of two? ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Mon Dec 19 23:59:13 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:59:13 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Mon, 19 Dec 2022 23:47:22 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoahCardStats.cpp line 57: >> >>> 55: if (record) { >>> 56: // Update global stats for distribution of dirty/clean card %ge >>> 57: _local_card_stats[DIRTY_CARDS].add((double)_dirty_card_cnt*100/(double)_cards_in_cluster); >> >> typo? `%` -> `a` > > I mean percentage where I said `%ge`. I'll clarify the comments a bit more. Please continue the review and I'll improve some of the documentation comments for clarity. Got it - I had `age` on my brain. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Mon Dec 19 23:59:15 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 19 Dec 2022 23:59:15 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: <_a7yKm6N4ztQ-u70-Ds38Vg2_AepwHAPQxWfZHsTyOY=.3f7615c3-6b02-4915-aa58-b7ebcbdb2b56@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <_a7yKm6N4ztQ-u70-Ds38Vg2_AepwHAPQxWfZHsTyOY=.3f7615c3-6b02-4915-aa58-b7ebcbdb2b56@github.com> Message-ID: On Mon, 19 Dec 2022 23:46:19 GMT, William Kemper wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: >> >> - Merge branch 'master' into JVM-1264-dependent >> - Add a previously missed ticket#. Doing it here rather than in parent to >> avoid an otherwise unnecessary re-review touchpoint. >> - Merge branch 'stats_merge' into JVM-1264-dependent >> - Merge branch 'master' into stats_merge >> - jcheck space fix >> - Fix compiler error on windows. >> - Fix some tier1 tests. >> - Remove an unnecessary include, fix some type incorrectness. >> - Merge branch 'JVM-1264' into JVM-1264-dependent >> - Merge branch 'master' into JVM-1264 >> - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f > > src/hotspot/share/gc/shenandoah/shenandoahCardStats.hpp line 61: > >> 59: _cards_in_cluster(cards_in_cluster), >> 60: _local_card_stats(card_stats), >> 61: _last_dirty(false), > > Should it always be the case that `_last_dirty != _last_clean`? Could we use one variable here instead of two? I believe we switch into one of two modes based on the first card we encounter. So there are 3 states: an initial (neither), and then subsequently either dirty or clean. So there are 3 states, which is 2 bits. It's possible I could shrink it to 2 bits by some cleverness, but figured I wouldn't try too hard as this is still all non-product. I'll think some more about it. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Mon Dec 19 23:59:15 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 19 Dec 2022 23:59:15 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Mon, 19 Dec 2022 23:42:04 GMT, Y. Srinivas Ramakrishna wrote: >> **Note:** >> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.) >> >> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket. >> >> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved. >> >> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above. >> >> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline. >> >> **Summary:** >> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this. >> >> **Details of files changed:** >> >> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled. >> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats >> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code. >> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq >> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above). >> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments. >> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods. >> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> >> [1211.515s][info][gc,task ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning >> [1211.529s][info][gc,remset ] GC(7069) Worker 0 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1245.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1157.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> [1211.529s][info][gc,remset ] GC(7069) Worker 1 Card Stats Histo: >> [1211.529s][info][gc,remset ] GC(7069) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_objs: [ 0.00 0.00 0.00 0.00 1257.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] >> [1211.529s][info][gc,remset ] GC(7069) dirty_scans: [ 0.00 0.00 0.00 0.00 1197.00 ] >> [1211.529s][info][gc,remset ] GC(7069) clean_scans: [ 0.00 0.00 0.00 0.00 17.00 ] >> [1211.529s][info][gc,remset ] GC(7069) alternations: [ 0.00 0.00 0.00 0.00 39.00 ] >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> The data above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific benchmark config (extremem). >> >> Comparing worker stats from worker 0 and worker 9 indicates very little difference between >> their statistics, as one might typically expect for well-balanced RS scans. >> >> **Questions:** >> >> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles? >> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information? >> 3. Any suggestions for a more easily consumable format? >> 4. I welcome any other feedback on the pull request. > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: > > - Merge branch 'master' into JVM-1264-dependent > - Add a previously missed ticket#. Doing it here rather than in parent to > avoid an otherwise unnecessary re-review touchpoint. > - Merge branch 'stats_merge' into JVM-1264-dependent > - Merge branch 'master' into stats_merge > - jcheck space fix > - Fix compiler error on windows. > - Fix some tier1 tests. > - Remove an unnecessary include, fix some type incorrectness. > - Merge branch 'JVM-1264' into JVM-1264-dependent > - Merge branch 'master' into JVM-1264 > - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f src/hotspot/share/gc/shenandoah/shenandoah_globals.hpp line 548: > 546: "Enable statistics collection related to clean & dirty cards") \ > 547: \ > 548: notproduct(int, ShenandoahCardStatsLogInterval, 50, \ This isn't really cycles right? It's number of workers that completed a card scan? ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Tue Dec 20 00:27:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 20 Dec 2022 00:27:17 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Mon, 19 Dec 2022 23:54:52 GMT, William Kemper wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits: >> >> - Merge branch 'master' into JVM-1264-dependent >> - Add a previously missed ticket#. Doing it here rather than in parent to >> avoid an otherwise unnecessary re-review touchpoint. >> - Merge branch 'stats_merge' into JVM-1264-dependent >> - Merge branch 'master' into stats_merge >> - jcheck space fix >> - Fix compiler error on windows. >> - Fix some tier1 tests. >> - Remove an unnecessary include, fix some type incorrectness. >> - Merge branch 'JVM-1264' into JVM-1264-dependent >> - Merge branch 'master' into JVM-1264 >> - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f > > src/hotspot/share/gc/shenandoah/shenandoah_globals.hpp line 548: > >> 546: "Enable statistics collection related to clean & dirty cards") \ >> 547: \ >> 548: notproduct(int, ShenandoahCardStatsLogInterval, 50, \ > > This isn't really cycles right? It's number of workers that completed a card scan? It's a number of card-scan rounds, independently for either RS scan or Update refs, where a round consists of a RS cycle or Update refs cycle by however many worker threads participate. The logging is oblivious to workers, but uses whatever number was used. Let me attach an example log in the PR summary to illusrate, but roughly speaking, it's as follows: ... ... ... ... Every `ShenandoahCardStatsLogInterval` such round, we also produce a cumulative historical log across all workers and rounds to date, but one each for RS and UR. Let me know if that makes sense, and if "cycles" makes sense in the documentation for what I have called "rounds" above. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Tue Dec 20 00:36:14 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 20 Dec 2022 00:36:14 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Tue, 20 Dec 2022 00:24:53 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoah_globals.hpp line 548: >> >>> 546: "Enable statistics collection related to clean & dirty cards") \ >>> 547: \ >>> 548: notproduct(int, ShenandoahCardStatsLogInterval, 50, \ >> >> This isn't really cycles right? It's number of workers that completed a card scan? > > It's a number of card-scan rounds, independently for either RS scan or Update refs, where a round consists of a RS cycle or Update refs cycle by however many worker threads participate. The logging is oblivious to workers, but uses whatever number was used. > > Let me attach an example log in the PR summary to illusrate, but roughly speaking, it's as follows: > > (start of remembered set (RS) scan) > (end of remembered set scan) > > (og of card stats for this round of RS by worker #1) > ... > (log of card stats for this round by RS worker #k1) > ... > (start of update refs (UR) scan) > > (end of update refs scan) > > (log of card stats for this round of UR by worker #1) > ... > (log of card stats for this round of UR by worker #k2) > ... > > Every `ShenandoahCardStatsLogInterval` such rounds, in addition to the per round, per worker stats like we did above, we also produce cumulative statistics across all workers and all rounds to date, but one each for RS and UR. > > Let me know if that makes sense, and if "cycles" makes sense in the documentation for what I have called "rounds" above. See https://github.com/openjdk/shenandoah/pull/176#issuecomment-1342840919. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Tue Dec 20 00:54:22 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 20 Dec 2022 00:54:22 GMT Subject: RFR: Initial sizing refactor In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 00:36:41 GMT, William Kemper wrote: > Some things to highlight here: > * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. > * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. > * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. > * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 995: > 993: shenandoah_assert_heaplocked_or_safepoint(); > 994: #ifdef ASSERT > 995: if (generation_mode() == YOUNG) { Why the special treatment of young here and in the next method? Is that the only one where max capacity matters? I might have expected an assertion oblivious of youth of a generation, which would simpy check upon an increment or a decrement that the floor and ceiling (min and max) capacities of that generation were being respected, irrespective of whether it was a young or old generation? ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Tue Dec 20 01:02:15 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 20 Dec 2022 01:02:15 GMT Subject: RFR: Initial sizing refactor In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 00:36:41 GMT, William Kemper wrote: > Some things to highlight here: > * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. > * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. > * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. > * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 459: > 457: _young_generation = new ShenandoahYoungGeneration(_max_workers, max_capacity_young, initial_capacity_young); > 458: _old_generation = new ShenandoahOldGeneration(_max_workers, max_capacity_old, initial_capacity_old); > 459: _global_generation = new ShenandoahGlobalGeneration(_max_workers, soft_max_capacity(), soft_max_capacity()); A single line of comment here would be helpful here. It sounds as if the idea is that for the so-called global generation (which I assume is identified with the entirety of the committed heap at any time), the initial and max (floor and ceiling) are both set at `soft_max_capacity` ? What does that mean? I might have naively expected this to be, respectivley, `max_old + max_young` and `initial_old + initial_young` like you had it before. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Tue Dec 20 01:10:15 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 20 Dec 2022 01:10:15 GMT Subject: RFR: Initial sizing refactor In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 00:36:41 GMT, William Kemper wrote: > Some things to highlight here: > * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. > * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. > * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. > * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. Overall I really like these refactorings/changes. I've done a quick overview review and left a few comments/suqestions, but will work through some of the remaining details tomorrow. Thanks! ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Tue Dec 20 01:15:14 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 20 Dec 2022 01:15:14 GMT Subject: RFR: Initial sizing refactor In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 00:36:41 GMT, William Kemper wrote: > Some things to highlight here: > * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. > * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. > * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. > * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 85: > 83: > 84: void ShenandoahMmuTracker::record(ShenandoahGeneration* generation) { > 85: // This is only called by the control thread or the VM thread. Would it be worthwhile asserting the calling thread's identity here, just to catch the problem early if someone were to try to do it from a different thread in the future? (although i can't imagine why anyone would.) ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From eosterlund at openjdk.org Tue Dec 20 07:12:29 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 20 Dec 2022 07:12:29 GMT Subject: RFR: 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic Message-ID: The current java_lang_ref_Reference::clear_referent implementation performs a raw reference clear. That doesn't work well with upcoming GC algorithms. It should be made GC agnostic by going through the normal access API. ------------- Commit messages: - 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic Changes: https://git.openjdk.org/jdk/pull/11736/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11736&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299072 Stats: 8 lines in 5 files changed: 5 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/11736.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11736/head:pull/11736 PR: https://git.openjdk.org/jdk/pull/11736 From dholmes at openjdk.org Tue Dec 20 08:15:50 2022 From: dholmes at openjdk.org (David Holmes) Date: Tue, 20 Dec 2022 08:15:50 GMT Subject: RFR: 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic In-Reply-To: References: Message-ID: <8je3w2XaNdQEAKx0lLHp2T2UXkOUqEV0ks-2TFL2AJE=.fbf307d0-9833-4465-a914-3a7e9f05d12b@github.com> On Tue, 20 Dec 2022 07:05:34 GMT, Erik ?sterlund wrote: > The current java_lang_ref_Reference::clear_referent implementation performs a raw reference clear. That doesn't work well with upcoming GC algorithms. It should be made GC agnostic by going through the normal access API. So `clear_referent` is made GC agnostic, but then all the existing GC's are changed to use the original raw version? Why do they not use the GC agnostic version - performance? ------------- PR: https://git.openjdk.org/jdk/pull/11736 From eosterlund at openjdk.org Tue Dec 20 08:24:48 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 20 Dec 2022 08:24:48 GMT Subject: RFR: 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic In-Reply-To: <8je3w2XaNdQEAKx0lLHp2T2UXkOUqEV0ks-2TFL2AJE=.fbf307d0-9833-4465-a914-3a7e9f05d12b@github.com> References: <8je3w2XaNdQEAKx0lLHp2T2UXkOUqEV0ks-2TFL2AJE=.fbf307d0-9833-4465-a914-3a7e9f05d12b@github.com> Message-ID: On Tue, 20 Dec 2022 08:12:41 GMT, David Holmes wrote: > So `clear_referent` is made GC agnostic, but then all the existing GC's are changed to use the original raw version? Why do they not use the GC agnostic version - performance? The usual reasoning we have is that a GC implementation itself is allowed to use "raw" accesses. Because it knows exactly what it needs to do and when. But the shared runtime code should not do that, because it does not know every GC implementation nor should it have to. Therefore, only the shared runtime part is changed to be GC agnostic. The GC implementation code (in this case reference processors) should continue to use raw accesses. ------------- PR: https://git.openjdk.org/jdk/pull/11736 From kbarrett at openjdk.org Tue Dec 20 09:15:49 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 20 Dec 2022 09:15:49 GMT Subject: RFR: 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 07:05:34 GMT, Erik ?sterlund wrote: > The current java_lang_ref_Reference::clear_referent implementation performs a raw reference clear. That doesn't work well with upcoming GC algorithms. It should be made GC agnostic by going through the normal access API. Looks good. Mea culpa I think. ------------- PR: https://git.openjdk.org/jdk/pull/11736 From eosterlund at openjdk.org Tue Dec 20 09:20:49 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 20 Dec 2022 09:20:49 GMT Subject: RFR: 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 09:12:48 GMT, Kim Barrett wrote: >> The current java_lang_ref_Reference::clear_referent implementation performs a raw reference clear. That doesn't work well with upcoming GC algorithms. It should be made GC agnostic by going through the normal access API. > > Looks good. Mea culpa I think. Thanks for the review, @kimbarrett! ------------- PR: https://git.openjdk.org/jdk/pull/11736 From wkemper at openjdk.org Tue Dec 20 19:19:23 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 20 Dec 2022 19:19:23 GMT Subject: RFR: Initial sizing refactor In-Reply-To: References: Message-ID: <0aHvoWKYrsxlTTZzxXIC6phNFpSSFvC6U28Wtn5OKA8=.3a0050b6-b648-47c3-a4f3-93fbaae3b222@github.com> On Tue, 20 Dec 2022 00:59:34 GMT, Y. Srinivas Ramakrishna wrote: >> Some things to highlight here: >> * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. >> * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. >> * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. >> * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. > > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 459: > >> 457: _young_generation = new ShenandoahYoungGeneration(_max_workers, max_capacity_young, initial_capacity_young); >> 458: _old_generation = new ShenandoahOldGeneration(_max_workers, max_capacity_old, initial_capacity_old); >> 459: _global_generation = new ShenandoahGlobalGeneration(_max_workers, soft_max_capacity(), soft_max_capacity()); > > A single line of comment here would be helpful here. It sounds as if the idea is that for the so-called global generation (which I assume is identified with the entirety of the committed heap at any time), the initial and max (floor and ceiling) are both set at `soft_max_capacity` ? What does that mean? I might have naively expected this to be, respectivley, `max_old + max_young` and `initial_old + initial_young` like you had it before. I've been thinking of the max capacity as the maximum _allowed_ capacity. For example, the maximum _allowed_ capacity for old would be `total heap - minimum capacity of young`. So, the sum of the maximum allowed for old and young could exceed the total. If that makes sense, I will put the explanation in a comment here. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From redestad at openjdk.org Tue Dec 20 19:57:55 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 20 Dec 2022 19:57:55 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: <_2HYrbXOe6zVuLXHywoy_AjCcGMYR266BcwKUZEA5fs=.1e6f7640-7580-4ff3-ace6-f18f27efbb23@github.com> References: <_2HYrbXOe6zVuLXHywoy_AjCcGMYR266BcwKUZEA5fs=.1e6f7640-7580-4ff3-ace6-f18f27efbb23@github.com> Message-ID: On Fri, 16 Dec 2022 22:58:23 GMT, Sandhya Viswanathan wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/vm_version_x86.cpp line 1671: > >> 1669: } >> 1670: if (UseAVX >= 2) { >> 1671: FLAG_SET_ERGO_IF_DEFAULT(UseVectorizedHashCodeIntrinsic, true); > > This could be just FLAG_SET_DEFAULT instead of FLAG_SET_ERGO_IF_DEFAULT. Right, it seems HW-dependent intrinsics in generally doesn't mark that they've been enabled ergonomically, rather just make it on "by default" when support is available. > src/java.base/share/classes/java/lang/StringUTF16.java line 418: > >> 416: return 0; >> 417: } else { >> 418: return ArraysSupport.vectorizedHashCode(value, ArraysSupport.UTF16); > > Special case for 1 missing here. Intentionally left out. Array length is always even for `UTF16` arrays, but we could add a case for `2` that'd return `getChar(bytes, 0)` but I didn't see much of a win when I tested this. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From wkemper at openjdk.org Tue Dec 20 20:12:39 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 20 Dec 2022 20:12:39 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: > Some things to highlight here: > * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. > * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. > * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. > * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. William Kemper has updated the pull request incrementally with one additional commit since the last revision: Improve assertions and comments ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/185/files - new: https://git.openjdk.org/shenandoah/pull/185/files/193f0975..30caeadc Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=185&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=185&range=00-01 Stats: 46 lines in 5 files changed: 34 ins; 9 del; 3 mod Patch: https://git.openjdk.org/shenandoah/pull/185.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/185/head:pull/185 PR: https://git.openjdk.org/shenandoah/pull/185 From redestad at openjdk.org Tue Dec 20 20:21:57 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 20 Dec 2022 20:21:57 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: <_2HYrbXOe6zVuLXHywoy_AjCcGMYR266BcwKUZEA5fs=.1e6f7640-7580-4ff3-ace6-f18f27efbb23@github.com> References: <_2HYrbXOe6zVuLXHywoy_AjCcGMYR266BcwKUZEA5fs=.1e6f7640-7580-4ff3-ace6-f18f27efbb23@github.com> Message-ID: <_h335iIGqDY-NVIC2k0TYzwb6gZS06ynM76d4-nJaUk=.eb491368-9c6f-4edd-8527-ef8f28c45d20@github.com> On Fri, 16 Dec 2022 23:00:53 GMT, Sandhya Viswanathan wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/stubRoutines_x86.cpp line 230: > >> 228: #endif // _LP64 >> 229: >> 230: jint StubRoutines::x86::_arrays_hashcode_powers_of_31[] = > > This should be declared only for LP64. Hmm, I guess same goes for all the new `arrays_hashcode` methods in `c2_MacroAssembler_x86`, since we only wire this up properly on 64-bit. I'll make a pass to put `_LP64` guards around all new methods. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Tue Dec 20 21:11:40 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 20 Dec 2022 21:11:40 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v14] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 64 commits: - Pass the constant mode node through, removing need for all but one instruct declarations - FLAG_SET_DEFAULT - Merge branch 'master' into 8282664-polyhash - Merge branch 'master' into 8282664-polyhash - Missing & 0xff in StringLatin1::hashCode - Qualified guess on shenandoahSupport fix-up - Whitespace - Final touch-ups, restored 2-stride with dependency chain breakage - Minor cleanup - Revert accidental ModuleHashes change - ... and 54 more: https://git.openjdk.org/jdk/compare/8dfb6d76...c9e7c561 ------------- Changes: https://git.openjdk.org/jdk/pull/10847/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=13 Stats: 1021 lines in 33 files changed: 962 ins; 8 del; 51 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Tue Dec 20 21:13:55 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 20 Dec 2022 21:13:55 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> Message-ID: On Mon, 14 Nov 2022 18:28:53 GMT, Vladimir Ivanov wrote: >>> Also, I'd like to note that C2 auto-vectorization support is not too far away from being able to optimize hash code computations. At some point, I was able to achieve some promising results with modest tweaking of SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ >> >> Intriguing. How far off is this - and do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc? >> >> If we turn this intrinsic into a stub we might also be able to reuse the optimization in other places, including from within the VM (calculating String hashCodes happen in a couple of places, including String deduplication). So I think there are still a few compelling reasons to go the manual route and continue on this path. > >> How far off is this ...? > > Back then it looked way too constrained (tight constraints on code shapes). But I considered it as a generally applicable optimization. > >> ... do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc? > > Yes, it is able to build the constant table at runtime when folding multiplications of constant coefficients produced during loop unrolling and then packing scalars into a constant vector. > > Moreover, briefly looking at the code shape, the vectorizer would produce a more optimal loop shape (pre-loop would align vector accesses and would use 512-bit vectors when available; vector post-loop could help as well). Passing the constant node through as an input as suggested by @iwanowww and @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a significant chunk of code and a little bit of complexity from the proposed patch. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From dholmes at openjdk.org Tue Dec 20 21:21:51 2022 From: dholmes at openjdk.org (David Holmes) Date: Tue, 20 Dec 2022 21:21:51 GMT Subject: RFR: 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 07:05:34 GMT, Erik ?sterlund wrote: > The current java_lang_ref_Reference::clear_referent implementation performs a raw reference clear. That doesn't work well with upcoming GC algorithms. It should be made GC agnostic by going through the normal access API. Okay - thanks for the explanation. Looks good. ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.org/jdk/pull/11736 From kdnilsen at openjdk.org Tue Dec 20 21:48:31 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 20 Dec 2022 21:48:31 GMT Subject: Integrated: Shrink tlab to capacity In-Reply-To: References: Message-ID: On Fri, 9 Dec 2022 23:23:43 GMT, Kelvin Nilsen wrote: > When a TLAB request exceeds the currently available memory within young-gen, the existing behavior is to reject the TLAB request outright. This is recognized as a failed allocation request, which triggers degenerated GC. > > This change introduces code to reduce the likelihood that too-large TLAB requests will be issued, and when they are issued, it makes an effort to shrink the TLAB request in order to reduce the need for degenerated GC. > > The impact is difficult to measure because this situation is fairly rare. On one Extremem workload, the TLAB-shrinking code is exercised only once during a 16-minute run involving 500 concurrent GCs, a 45 GiB heap, and a 28 GiB young-gen size. The change reduces the degenerated GCs from 6 to 5. > > One reason that the remaining 5 degenerated GCs are not addressed by this change is that further work is required to handle a situation in which a requested TLAB is smaller than the available young-gen memory, but available memory is set aside in the evacuation reserve so cannot be provided to a mutator. Future work will address this condition. This pull request has now been integrated. Changeset: 9114616c Author: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/9114616c01bdeeddad50bec93869decee90f5a58 Stats: 201 lines in 2 files changed: 88 ins; 39 del; 74 mod Shrink tlab to capacity Reviewed-by: ysr, wkemper ------------- PR: https://git.openjdk.org/shenandoah/pull/180 From sviswanathan at openjdk.org Wed Dec 21 00:14:54 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 21 Dec 2022 00:14:54 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> Message-ID: On Tue, 20 Dec 2022 21:11:18 GMT, Claes Redestad wrote: >>> How far off is this ...? >> >> Back then it looked way too constrained (tight constraints on code shapes). But I considered it as a generally applicable optimization. >> >>> ... do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc? >> >> Yes, it is able to build the constant table at runtime when folding multiplications of constant coefficients produced during loop unrolling and then packing scalars into a constant vector. >> >> Moreover, briefly looking at the code shape, the vectorizer would produce a more optimal loop shape (pre-loop would align vector accesses and would use 512-bit vectors when available; vector post-loop could help as well). > > Passing the constant node through as an input as suggested by @iwanowww and @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a significant chunk of code and a little bit of complexity from the proposed patch. @cl4es Thanks for passing the constant node through, the code looks much cleaner now. The attached patch should handle the signed bytes/shorts as well. Please take a look. [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch) ------------- PR: https://git.openjdk.org/jdk/pull/10847 From sviswanathan at openjdk.org Wed Dec 21 01:58:59 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 21 Dec 2022 01:58:59 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v14] In-Reply-To: References: Message-ID: <-qoflnp34219qc7cA_xaazdxkbFkEOzdZfCbOeYPCxA=.5f57ad5e-a099-425d-81e1-87e2eda09cf2@github.com> On Tue, 20 Dec 2022 21:11:40 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 64 commits: > > - Pass the constant mode node through, removing need for all but one instruct declarations > - FLAG_SET_DEFAULT > - Merge branch 'master' into 8282664-polyhash > - Merge branch 'master' into 8282664-polyhash > - Missing & 0xff in StringLatin1::hashCode > - Qualified guess on shenandoahSupport fix-up > - Whitespace > - Final touch-ups, restored 2-stride with dependency chain breakage > - Minor cleanup > - Revert accidental ModuleHashes change > - ... and 54 more: https://git.openjdk.org/jdk/compare/8dfb6d76...c9e7c561 src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3420: > 3418: arrays_hashcode_elload(tmp3, Address(ary1, index, Address::times(elsize), -elsize), eltype, is_string_hashcode); > 3419: addl(result, tmp3); > 3420: jmp(END); This jmp can be removed. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From sviswanathan at openjdk.org Wed Dec 21 01:59:01 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 21 Dec 2022 01:59:01 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: <_2HYrbXOe6zVuLXHywoy_AjCcGMYR266BcwKUZEA5fs=.1e6f7640-7580-4ff3-ace6-f18f27efbb23@github.com> Message-ID: <2l2l7-2EKQr3UORegtVHtWN0uf9AUH8awtUPvC0MfS0=.8f94982f-f399-4ffd-b494-117cb73bf606@github.com> On Tue, 20 Dec 2022 19:52:34 GMT, Claes Redestad wrote: >> src/java.base/share/classes/java/lang/StringUTF16.java line 418: >> >>> 416: return 0; >>> 417: } else { >>> 418: return ArraysSupport.vectorizedHashCode(value, ArraysSupport.UTF16); >> >> Special case for 1 missing here. > > Intentionally left out. Array length is always even for `UTF16` arrays, but we could add a case for `2` that'd return `getChar(bytes, 0)` but I didn't see much of a win when I tested this. I do see a 1.5x gain with this special case added: return switch (value.length) { case 0 -> 0; case 2 -> getChar(value, 0); default -> ArraysSupport.vectorizedHashCode(value, ArraysSupport.UTF16); }; before: 0.987 ns/op after: 0.640 ns/op ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Wed Dec 21 17:01:17 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 21 Dec 2022 17:01:17 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v15] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with two additional commits since the last revision: - Handle signed subword arrays, contributed by @sviswa7 - @sviswa7 comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/c9e7c561..16733c4d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=13-14 Stats: 51 lines in 3 files changed: 36 ins; 6 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Wed Dec 21 17:04:07 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 21 Dec 2022 17:04:07 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v14] In-Reply-To: <-qoflnp34219qc7cA_xaazdxkbFkEOzdZfCbOeYPCxA=.5f57ad5e-a099-425d-81e1-87e2eda09cf2@github.com> References: <-qoflnp34219qc7cA_xaazdxkbFkEOzdZfCbOeYPCxA=.5f57ad5e-a099-425d-81e1-87e2eda09cf2@github.com> Message-ID: On Wed, 21 Dec 2022 01:02:35 GMT, Sandhya Viswanathan wrote: >> Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 64 commits: >> >> - Pass the constant mode node through, removing need for all but one instruct declarations >> - FLAG_SET_DEFAULT >> - Merge branch 'master' into 8282664-polyhash >> - Merge branch 'master' into 8282664-polyhash >> - Missing & 0xff in StringLatin1::hashCode >> - Qualified guess on shenandoahSupport fix-up >> - Whitespace >> - Final touch-ups, restored 2-stride with dependency chain breakage >> - Minor cleanup >> - Revert accidental ModuleHashes change >> - ... and 54 more: https://git.openjdk.org/jdk/compare/8dfb6d76...c9e7c561 > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3420: > >> 3418: arrays_hashcode_elload(tmp3, Address(ary1, index, Address::times(elsize), -elsize), eltype, is_string_hashcode); >> 3419: addl(result, tmp3); >> 3420: jmp(END); > > This jmp can be removed. Ok, special-cased for `value.length == 2`, removed the superfluous `jmp`, and committed your patch to implement the vectorization for `short`s and signed `byte`s. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From ysr at openjdk.org Wed Dec 21 17:16:30 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 17:16:30 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Tue, 20 Dec 2022 00:33:29 GMT, Y. Srinivas Ramakrishna wrote: >> It's a number of card-scan rounds, independently for either RS scan or Update refs, where a round consists of a RS cycle or Update refs cycle by however many worker threads participate. The logging is oblivious to workers, but uses whatever number was used. >> >> Let me attach an example log in the PR summary to illusrate, but roughly speaking, it's as follows: >> >> (start of remembered set (RS) scan) >> (end of remembered set scan) >> >> (og of card stats for this round of RS by worker #1) >> ... >> (log of card stats for this round by RS worker #k1) >> ... >> (start of update refs (UR) scan) >> >> (end of update refs scan) >> >> (log of card stats for this round of UR by worker #1) >> ... >> (log of card stats for this round of UR by worker #k2) >> ... >> >> Every `ShenandoahCardStatsLogInterval` such rounds, in addition to the per round, per worker stats like we did above, we also produce cumulative statistics across all workers and all rounds to date, but one each for RS and UR. >> >> Let me know if that makes sense, and if "cycles" makes sense in the documentation for what I have called "rounds" above. > > See https://github.com/openjdk/shenandoah/pull/176#issuecomment-1342840919. I updated the summary comment at the top of the PR at https://github.com/openjdk/shenandoah/pull/176#issue-1471869802 with the new format based on Kelvin's suggestion in https://github.com/openjdk/shenandoah/pull/176#issuecomment-1342840919 above. It shows that the per round stats, separated by worker and scan type (UR or RS) is needed, and that the cumulative stats may have lost some of the nuance present in the per round/per scan type stats. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From redestad at openjdk.org Wed Dec 21 17:29:23 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 21 Dec 2022 17:29:23 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v16] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Treat Op_VectorizedHashCode as other similar Ops in split_unique_types ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/16733c4d..62e98e1b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=14-15 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From ysr at openjdk.org Wed Dec 21 17:29:55 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 17:29:55 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v13] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Updated 12/21** > > **Summary:** > The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. > > We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. > > Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. > > **Format of stats produced and how to interpret them: (sample)** > > The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. > > > [560.766s][info][gc,remset ] GC(13) Scan Remembered Set > [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] > [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms > [560.766s][info][gc,start ] GC(13) Concurrent marking roots > ... > [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms > [585.433s][info][gc,start ] GC(13) Pause Init Update Refs > [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms > [585.434s][info][gc,start ] GC(13) Concurrent update references > [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update > [585.567s][info][gc ] Average MMU = 2.925 > [590.583s][info][gc ] Average MMU = 1.509 > [595.600s][info][gc ] Average MMU = 0.835 > [600.618s][info][gc ] Average MMU = 0.447 > [605.635s][info][gc ] Average MMU = 0.253 > [610.651s][info][gc ] Average MMU = 0.114 > [615.669s][info][gc ] Average MMU = 0.130 > [620.686s][info][gc ] Average MMU = 0.129 > [622.209s][info][gc,remset ] GC(13) Update Refs > [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] > [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] > [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] > [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] > [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms > ... > (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set > [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] > [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] > [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] > [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] > [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc,remset ] GC(15) Cumulative stats > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms > ... > [631.875s][info][gc,remset ] GC(15) Update Refs > [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] > [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc,remset ] GC(15) Cumulative stats > [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] > [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] > [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific prefix of the run. > > Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. > > **Question:** > Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: Reword some code comments for greater clarity. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/9c5c741f..1bc59f89 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=12 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=11-12 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 21 17:29:56 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 17:29:56 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Mon, 19 Dec 2022 23:56:23 GMT, William Kemper wrote: >> I mean percentage where I said `%ge`. I'll clarify the comments a bit more. Please continue the review and I'll improve some of the documentation comments for clarity. > > Got it - I had `age` on my brain. Slightly reworded some of the code comments for greater clarity. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 21 17:43:22 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 17:43:22 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <_a7yKm6N4ztQ-u70-Ds38Vg2_AepwHAPQxWfZHsTyOY=.3f7615c3-6b02-4915-aa58-b7ebcbdb2b56@github.com> Message-ID: <9XSE_9xYUeivBuMphNIxLbN-qviEjMK8KhjFIz9lp9E=.389475e2-b9b6-41cd-9409-8ebb75e40861@github.com> On Mon, 19 Dec 2022 23:52:39 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoahCardStats.hpp line 61: >> >>> 59: _cards_in_cluster(cards_in_cluster), >>> 60: _local_card_stats(card_stats), >>> 61: _last_dirty(false), >> >> Should it always be the case that `_last_dirty != _last_clean`? Could we use one variable here instead of two? > > I believe we switch into one of two modes based on the first card we encounter. So there are 3 states: an initial (neither), and then subsequently either dirty or clean. So there are 3 states, which is 2 bits. It's possible I could shrink it to 1 bit with some cleverness, but figured I wouldn't try too hard as this is still all non-product. I'll think some more about it. I re-examined the code with an eye to reducing the use of flags. Although I think it might be possible, I believe it would complicate the structure of the code a bit because of teh asymmetric treatment of clean and dirty. As things stand the symmetry of the treatment leads to more uniform and consistent code structure that makes the code more maintainable. We can revisit the reduction in state separately in the fullness of time. I'll resolve this comment based on the above reasoning. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 21 17:46:25 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 17:46:25 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Wed, 21 Dec 2022 17:13:50 GMT, Y. Srinivas Ramakrishna wrote: >> See https://github.com/openjdk/shenandoah/pull/176#issuecomment-1342840919. > > I updated the summary comment at the top of the PR at https://github.com/openjdk/shenandoah/pull/176#issue-1471869802 with the new format based on Kelvin's suggestion in https://github.com/openjdk/shenandoah/pull/176#issuecomment-1342840919 above. It shows that the per round stats, separated by worker and scan type (UR or RS) is needed, and that the cumulative stats may have lost some of the nuance present in the per round/per scan type stats. Please let me know if there is still any confusion wrt the documentation of `ShenandoahCardStatsLogInterval` or if you'd prefer a rewording. Thanks! ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From sviswanathan at openjdk.org Wed Dec 21 18:13:54 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 21 Dec 2022 18:13:54 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v16] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 17:29:23 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Treat Op_VectorizedHashCode as other similar Ops in split_unique_types The PR looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.org/jdk/pull/10847 From kdnilsen at openjdk.org Wed Dec 21 18:57:25 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 21 Dec 2022 18:57:25 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 20:12:39 GMT, William Kemper wrote: >> Some things to highlight here: >> * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. >> * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. >> * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. >> * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Improve assertions and comments Marked as reviewed by kdnilsen (Committer). src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 235: > 233: } > 234: } > 235: I still question whether this is the right trigger for when to enlarge old-gen. A properly running generational GC will spend nearly all its time doing young-gen and very little time doing old-gen. The trigger for enlarging old-gen should be that we experience promotion failures (and/or that we identify at the end of init mark that we have more live data in aged regions than will fit in the current old-gen). Old-gen collection triggers need to be refined when we are auto-sizing. We can't use "percent free in old" or even "time to collect old > time to exhaust old", because we are trying to auto-tune to maintain that the percent free in old is very small. We need a new way to trigger old-gen GCs. Maybe, we trigger old-gen GC (rather than enlarging old-gen) if an old-gen enlargement request would cause us to exceed a target max-size for old-gen? This doesn't have to be addressed in current PR. But I'm just thinking that refinements of these heuristics may be necessary eventually. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From kdnilsen at openjdk.org Wed Dec 21 18:57:26 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 21 Dec 2022 18:57:26 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 18:44:38 GMT, Kelvin Nilsen wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve assertions and comments > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 235: > >> 233: } >> 234: } >> 235: > > I still question whether this is the right trigger for when to enlarge old-gen. A properly running generational GC will spend nearly all its time doing young-gen and very little time doing old-gen. > > The trigger for enlarging old-gen should be that we experience promotion failures (and/or that we identify at the end of init mark that we have more live data in aged regions than will fit in the current old-gen). > > Old-gen collection triggers need to be refined when we are auto-sizing. We can't use "percent free in old" or even "time to collect old > time to exhaust old", because we are trying to auto-tune to maintain that the percent free in old is very small. We need a new way to trigger old-gen GCs. > > Maybe, we trigger old-gen GC (rather than enlarging old-gen) if an old-gen enlargement request would cause us to exceed a target max-size for old-gen? > > This doesn't have to be addressed in current PR. But I'm just thinking that refinements of these heuristics may be necessary eventually. Maybe the target max-size for old-gen is auto-tuned to, e.g., 10% larger than the maximum old-gen live memory following old-gen concurrent mark with some bias given to more recently observed measurements of old-gen live-memory. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Wed Dec 21 19:17:28 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 19:17:28 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 00:51:43 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve assertions and comments > > src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 995: > >> 993: shenandoah_assert_heaplocked_or_safepoint(); >> 994: #ifdef ASSERT >> 995: if (generation_mode() == YOUNG) { > > Why the special treatment of young here and in the next method? Is that the only one where max capacity matters? > > I might have expected an assertion oblivious of youth of a generation, which would simpy check upon an increment or a decrement that the floor and ceiling (min and max) capacities of that generation were being respected, irrespective of whether it was a young or old generation? I'll fix this. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Wed Dec 21 19:17:28 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 19:17:28 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 18:48:06 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 235: >> >>> 233: } >>> 234: } >>> 235: >> >> I still question whether this is the right trigger for when to enlarge old-gen. A properly running generational GC will spend nearly all its time doing young-gen and very little time doing old-gen. >> >> The trigger for enlarging old-gen should be that we experience promotion failures (and/or that we identify at the end of init mark that we have more live data in aged regions than will fit in the current old-gen). >> >> Old-gen collection triggers need to be refined when we are auto-sizing. We can't use "percent free in old" or even "time to collect old > time to exhaust old", because we are trying to auto-tune to maintain that the percent free in old is very small. We need a new way to trigger old-gen GCs. >> >> Maybe, we trigger old-gen GC (rather than enlarging old-gen) if an old-gen enlargement request would cause us to exceed a target max-size for old-gen? >> >> This doesn't have to be addressed in current PR. But I'm just thinking that refinements of these heuristics may be necessary eventually. > > Maybe the target max-size for old-gen is auto-tuned to, e.g., 10% larger than the maximum old-gen live memory following old-gen concurrent mark with some bias given to more recently observed measurements of old-gen live-memory. I agree we can wire up more signals to the resizing mechanism. In the scenario you describe, where old generation has become _too small_ and old collections are running _too frequently_, the MMU based resizing would enlarge the old generation. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Wed Dec 21 19:45:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 19:45:18 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: <9XSE_9xYUeivBuMphNIxLbN-qviEjMK8KhjFIz9lp9E=.389475e2-b9b6-41cd-9409-8ebb75e40861@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <_a7yKm6N4ztQ-u70-Ds38Vg2_AepwHAPQxWfZHsTyOY=.3f7615c3-6b02-4915-aa58-b7ebcbdb2b56@github.com> <9XSE_9xYUeivBuMphNIxLbN-qviEjMK8KhjFIz9lp9E=.389475e2-b9b6-41cd-9409-8ebb75e40861@github.com> Message-ID: On Wed, 21 Dec 2022 17:40:44 GMT, Y. Srinivas Ramakrishna wrote: >> I believe we switch into one of two modes based on the first card we encounter. So there are 3 states: an initial (neither), and then subsequently either dirty or clean. So there are 3 states, which is 2 bits. It's possible I could shrink it to 1 bit with some cleverness, but figured I wouldn't try too hard as this is still all non-product. I'll think some more about it. > > I re-examined the code with an eye to reducing the use of flags. Although I think it might be possible, I believe it would complicate the structure of the code a bit because of teh asymmetric treatment of clean and dirty. As things stand the symmetry of the treatment leads to more uniform and consistent code structure that makes the code more maintainable. We can revisit the reduction in state separately in the fullness of time. > > I'll resolve this comment based on the above reasoning. Okay, I was thinking that code of the form: if (_last_dirty) { // ... } else if (_last_clean) { // ... } Could be: if (_last_dirty) { // ... } else { // ... } But, as you point out, this doesn't handle the initial case. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Wed Dec 21 19:58:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 19:58:17 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Wed, 21 Dec 2022 17:43:55 GMT, Y. Srinivas Ramakrishna wrote: >> I updated the summary comment at the top of the PR at https://github.com/openjdk/shenandoah/pull/176#issue-1471869802 with the new format based on Kelvin's suggestion in https://github.com/openjdk/shenandoah/pull/176#issuecomment-1342840919 above. It shows that the per round stats, separated by worker and scan type (UR or RS) is needed, and that the cumulative stats may have lost some of the nuance present in the per round/per scan type stats. > > Please let me know if there is still any confusion wrt the documentation of `ShenandoahCardStatsLogInterval` or if you'd prefer a rewording. Thanks! Maybe "Log cumulative card stats every so many scans of the remembered set"? "Cycle" is a bit overloaded. If I read this, I would expect to see a log message every 50 GC cycles, but with (probably) two rset scans per GC cycle, it would be closer to every 25 GC cycles. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 21 21:46:36 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 21:46:36 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: <0aHvoWKYrsxlTTZzxXIC6phNFpSSFvC6U28Wtn5OKA8=.3a0050b6-b648-47c3-a4f3-93fbaae3b222@github.com> References: <0aHvoWKYrsxlTTZzxXIC6phNFpSSFvC6U28Wtn5OKA8=.3a0050b6-b648-47c3-a4f3-93fbaae3b222@github.com> Message-ID: On Tue, 20 Dec 2022 19:16:45 GMT, William Kemper wrote: >> src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 459: >> >>> 457: _young_generation = new ShenandoahYoungGeneration(_max_workers, max_capacity_young, initial_capacity_young); >>> 458: _old_generation = new ShenandoahOldGeneration(_max_workers, max_capacity_old, initial_capacity_old); >>> 459: _global_generation = new ShenandoahGlobalGeneration(_max_workers, soft_max_capacity(), soft_max_capacity()); >> >> A single line of comment here would be helpful here. It sounds as if the idea is that for the so-called global generation (which I assume is identified with the entirety of the committed heap at any time), the initial and max (floor and ceiling) are both set at `soft_max_capacity` ? What does that mean? I might have naively expected this to be, respectivley, `max_old + max_young` and `initial_old + initial_young` like you had it before. > > I've been thinking of the max capacity as the maximum _allowed_ capacity. For example, the maximum _allowed_ capacity for old would be `total heap - minimum capacity of young`. So, the sum of the maximum allowed for old and young could exceed the total. If that makes sense, I will put the explanation in a comment here. Makes sense; thanks for the updated comments! ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Wed Dec 21 21:56:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 21:56:17 GMT Subject: Integrated: Initial sizing refactor In-Reply-To: References: Message-ID: On Fri, 16 Dec 2022 00:36:41 GMT, William Kemper wrote: > Some things to highlight here: > * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. > * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. > * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. > * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. This pull request has now been integrated. Changeset: da950117 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/da9501170820b3e32b903228bd921aaa860e90c0 Stats: 389 lines in 13 files changed: 259 ins; 69 del; 61 mod Initial sizing refactor Reviewed-by: kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Wed Dec 21 22:05:20 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 22:05:20 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 20:12:39 GMT, William Kemper wrote: >> Some things to highlight here: >> * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. >> * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. >> * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. >> * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Improve assertions and comments src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 268: > 266: } > 267: > 268: size_t round_down_to_multiple_of_region_size(size_t bytes) { I could have sworn there was a rounding utility/macro extensively used in sizing code, but the only one I found was a power of 2 rounder. The alternative, if one maintained a log of heap region size (being a power of 2) would be to use a bit-mask here. Anyway, nothing to do here; this looks good for now. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Wed Dec 21 22:20:19 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 22:20:19 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Tue, 20 Dec 2022 20:12:39 GMT, William Kemper wrote: >> Some things to highlight here: >> * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. >> * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. >> * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. >> * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Improve assertions and comments src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 222: > 220: log_info(gc)("Thread Usr+Sys YOUNG = %.3f, OLD = %.3f, GLOBAL = %.3f", young_time_s, old_time_s, global_time_s); > 221: > 222: if (abs(delta) <= transfer_threshold) { I thought the original idea was to use the difference in MMU's for old and young as the error signal to drive the (direction of the) transfer, rather than the difference in the actual times? Am I misinterpreting what `reset_collection_time` returns? You do refer to it as `thread utilization` (akin to MMU) in the log message below. src/hotspot/share/gc/shenandoah/shenandoahYoungGeneration.hpp line 57: > 55: // Returns true if the young generation is configured to enqueue old > 56: // oops for the old generation mark queues. > 57: bool is_bootstrap_cycle() { Why is this called a `bootstrap cycle`? I must be missing some big picture background of nomenclature here. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Wed Dec 21 22:20:20 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 22:20:20 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 19:14:14 GMT, William Kemper wrote: >> Maybe the target max-size for old-gen is auto-tuned to, e.g., 10% larger than the maximum old-gen live memory following old-gen concurrent mark with some bias given to more recently observed measurements of old-gen live-memory. > > I agree we can wire up more signals to the resizing mechanism. In the scenario you describe, where old generation has become _too small_ and old collections are running _too frequently_, the MMU based resizing would enlarge the old generation. What do you do before any data is available for one of the MMU trackers? (for example before the first old collection cycle has happened.) I'd assume that the control algorithm wouldn't kick in until the data driving the control signal was valid and available. Where is that done? ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Wed Dec 21 22:25:28 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 22:25:28 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: <6RCKf5W_EcmS_hB1-ypkoyKoxwMakPl61oJtoUdeoG8=.d9c53e13-be73-4ece-9108-8f2e81810e98@github.com> On Tue, 20 Dec 2022 20:12:39 GMT, William Kemper wrote: >> Some things to highlight here: >> * This change borrows a bit of code from G1 to handle processing of command line arguments used to size the young generation. >> * A (hard coded for now) threshold on the difference between young/old time has been added to reduce resizing churn. >> * The adaptive heuristic doesn't consider the `soft_tail` anymore. `available` is already adjusted for the soft max capacity. >> * `SoftMaxHeapSize` is used to compute the soft max size and max size for the young generation. > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Improve assertions and comments General direction looks good to me, as well as the refactoring which greatly improved the structure of the code. I just had one question about the error signal that drives the control actuation for resizing. I'd expected that to be the _difference of mmu averages_ measured whenever new data is available for these control signals, rather than _difference of collection times_, which it looked like to me (although it's possible I misunderstood). Also, it'd be great to compare some basic performance numbers with the control algorithm off vs on to show its efficacy. Thanks! ------------- Marked as reviewed by ysr (Author). PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Wed Dec 21 22:26:20 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 22:26:20 GMT Subject: RFR: Avoid divide by zero error, improve variable names Message-ID: Depending on when the periodic thread runs the accounting task and how much CPU time the process receives, we may see a very small elapsed process time. Such a time should not be used to compute MMU. ------------- Commit messages: - Avoid divide by zero error, improve variable names Changes: https://git.openjdk.org/shenandoah/pull/188/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=188&range=00 Stats: 12 lines in 1 file changed: 6 ins; 1 del; 5 mod Patch: https://git.openjdk.org/shenandoah/pull/188.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/188/head:pull/188 PR: https://git.openjdk.org/shenandoah/pull/188 From wkemper at openjdk.org Wed Dec 21 22:31:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 22:31:17 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 22:17:54 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve assertions and comments > > src/hotspot/share/gc/shenandoah/shenandoahYoungGeneration.hpp line 57: > >> 55: // Returns true if the young generation is configured to enqueue old >> 56: // oops for the old generation mark queues. >> 57: bool is_bootstrap_cycle() { > > Why is this called a `bootstrap cycle`? I must be missing some big picture background of nomenclature here. Every old generation cycle is preceded by a young collection. We call this a bootstrap cycle because it populates the old generation mark queues with old objects it encountered during the marking of young. Otherwise, we'd have to maintain a reverse-remembered set for young->old pointers. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Wed Dec 21 23:09:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 23:09:17 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: <83ZqvSMvhjIoNc3-w21hG5ssMhCTzdWODEdSKgFYlM4=.b2da43e0-e722-4cd9-a80b-9bd2afc389c5@github.com> On Wed, 21 Dec 2022 22:13:42 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve assertions and comments > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 222: > >> 220: log_info(gc)("Thread Usr+Sys YOUNG = %.3f, OLD = %.3f, GLOBAL = %.3f", young_time_s, old_time_s, global_time_s); >> 221: >> 222: if (abs(delta) <= transfer_threshold) { > > I thought the original idea was to use the difference in MMU's for old and young as the error signal to drive the (direction of the) transfer, rather than the difference in the actual times? Am I misinterpreting what `reset_collection_time` returns? You do refer to it as `thread utilization` (akin to MMU) in the log message below. I reasoned that the denominator in a comparison between MMU's (i.e., process time) would be the same on both sides of the comparison, so I omitted it. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From kdnilsen at openjdk.org Wed Dec 21 23:22:17 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 21 Dec 2022 23:22:17 GMT Subject: RFR: Avoid divide by zero error, improve variable names In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 22:19:13 GMT, William Kemper wrote: > Depending on when the periodic thread runs the accounting task and how much CPU time the process receives, we may see a very small elapsed process time. Such a time should not be used to compute MMU. Marked as reviewed by kdnilsen (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/188 From wkemper at openjdk.org Wed Dec 21 23:44:19 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 23:44:19 GMT Subject: Integrated: Avoid divide by zero error, improve variable names In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 22:19:13 GMT, William Kemper wrote: > Depending on when the periodic thread runs the accounting task and how much CPU time the process receives, we may see a very small elapsed process time. Such a time should not be used to compute MMU. This pull request has now been integrated. Changeset: d793fd16 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/d793fd1620e430be6fcba83361b421856f5bbfd5 Stats: 12 lines in 1 file changed: 6 ins; 1 del; 5 mod Avoid divide by zero error, improve variable names Reviewed-by: kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/188 From ysr at openjdk.org Wed Dec 21 23:54:22 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 21 Dec 2022 23:54:22 GMT Subject: RFR: Avoid divide by zero error, improve variable names In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 22:19:13 GMT, William Kemper wrote: > Depending on when the periodic thread runs the accounting task and how much CPU time the process receives, we may see a very small elapsed process time. Such a time should not be used to compute MMU. Marked as reviewed by ysr (Author). src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 97: > 95: double process_time_s = process_time_seconds(); > 96: double elapsed_process_time_s = process_time_s - _process_reference_time_s; > 97: if (elapsed_process_time_s <= 0.01) { Is there a different mechanism one could use to check if there was an update? E.g. a change in the count of GC cycles started or ended, rather than relying on the magnitude of the difference, although the magnitude of the difference may still be sufficient. One related question: is this unit "seconds" or some smaller unit? The `_s` implies it's a second and 0.01 s is 10 ms which is not insubstantial... ------------- PR: https://git.openjdk.org/shenandoah/pull/188 From wkemper at openjdk.org Wed Dec 21 23:55:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 21 Dec 2022 23:55:18 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 22:03:03 GMT, Y. Srinivas Ramakrishna wrote: >> William Kemper has updated the pull request incrementally with one additional commit since the last revision: >> >> Improve assertions and comments > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 268: > >> 266: } >> 267: >> 268: size_t round_down_to_multiple_of_region_size(size_t bytes) { > > I could have sworn there was a rounding utility/macro extensively used in sizing code, but the only one I found was a power of 2 rounder. The alternative, if one maintained a log of heap region size (being a power of 2) would be to use a bit-mask here. > > Anyway, nothing to do here; this looks good for now. I wasn't sure if region size was coerced to a power of a 2, but it seems to be the case: size_t ShenandoahHeapRegion::setup_sizes(size_t max_heap_size) { // ... int region_size_log = log2i(region_size); // Recalculate the region size to make sure it's a power of // 2. This means that region_size is the largest power of 2 that's // <= what we've calculated so far. region_size = size_t(1) << region_size_log; I'll make that change. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Thu Dec 22 00:02:21 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 00:02:21 GMT Subject: RFR: Avoid divide by zero error, improve variable names In-Reply-To: References: Message-ID: <5oG7-XLGQR1sNYJHGEDZyuHH-xwiU7Rw7gztUYnMSBA=.c2d69cf1-b488-4214-9723-0337617a7691@github.com> On Wed, 21 Dec 2022 23:49:26 GMT, Y. Srinivas Ramakrishna wrote: >> Depending on when the periodic thread runs the accounting task and how much CPU time the process receives, we may see a very small elapsed process time. Such a time should not be used to compute MMU. > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 97: > >> 95: double process_time_s = process_time_seconds(); >> 96: double elapsed_process_time_s = process_time_s - _process_reference_time_s; >> 97: if (elapsed_process_time_s <= 0.01) { > > Is there a different mechanism one could use to check if there was an update? E.g. a change in the count of GC cycles started or ended, rather than relying on the magnitude of the difference, although the magnitude of the difference may still be sufficient. One related question: is this unit "seconds" or some smaller unit? The `_s` implies it's a second and 0.01 s is 10 ms which is not insubstantial... In general the synchronous, time-based, sampling of the MMU suffers from this issue of high variance by catching the co-initial or co-terminal portion of a cycle. Have you looked at the variance of MMUs to see if you should sample only at the end of complete cycles _synchronous with cycles_ rather than using a _time-based sampling trigger_? Anyway something to think about a little bit. Each has its advantages and disadvantages, but how we gather these stats would matter if used as the basis for the error signal that drives your sizing control loop. ------------- PR: https://git.openjdk.org/shenandoah/pull/188 From ysr at openjdk.org Thu Dec 22 00:02:22 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 00:02:22 GMT Subject: RFR: Avoid divide by zero error, improve variable names In-Reply-To: References: Message-ID: <-yKh8JLLN8TuF-BUHc5yf2wWgTCy2eXrGGqgxWF7kw8=.a502bbf8-b472-4108-a722-11608c3e98e9@github.com> On Wed, 21 Dec 2022 22:19:13 GMT, William Kemper wrote: > Depending on when the periodic thread runs the accounting task and how much CPU time the process receives, we may see a very small elapsed process time. Such a time should not be used to compute MMU. src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 106: > 104: double elapsed_collector_time_s = collector_time_s - _collector_reference_time_s; > 105: _collector_reference_time_s = collector_time_s; > 106: double minimum_mutator_utilization = ((elapsed_process_time_s - elapsed_collector_time_s) / elapsed_process_time_s) * 100; Alternative: You could check if result `isnan()` and not use it, although I don't know if divide by zero is costly. ------------- PR: https://git.openjdk.org/shenandoah/pull/188 From wkemper at openjdk.org Thu Dec 22 00:07:19 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 22 Dec 2022 00:07:19 GMT Subject: RFR: Avoid divide by zero error, improve variable names In-Reply-To: <5oG7-XLGQR1sNYJHGEDZyuHH-xwiU7Rw7gztUYnMSBA=.c2d69cf1-b488-4214-9723-0337617a7691@github.com> References: <5oG7-XLGQR1sNYJHGEDZyuHH-xwiU7Rw7gztUYnMSBA=.c2d69cf1-b488-4214-9723-0337617a7691@github.com> Message-ID: On Wed, 21 Dec 2022 23:55:45 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 97: >> >>> 95: double process_time_s = process_time_seconds(); >>> 96: double elapsed_process_time_s = process_time_s - _process_reference_time_s; >>> 97: if (elapsed_process_time_s <= 0.01) { >> >> Is there a different mechanism one could use to check if there was an update? E.g. a change in the count of GC cycles started or ended, rather than relying on the magnitude of the difference, although the magnitude of the difference may still be sufficient. One related question: is this unit "seconds" or some smaller unit? The `_s` implies it's a second and 0.01 s is 10 ms which is not insubstantial... > > In general the synchronous, time-based, sampling of the MMU suffers from this issue of high variance by catching the co-initial or co-terminal portion of a cycle. Have you looked at the variance of MMUs to see if you should sample only at the end of complete cycles _synchronous with cycles_ rather than using a _time-based sampling trigger_? Anyway something to think about a little bit. Each has its advantages and disadvantages, but how we gather these stats would matter if used as the basis for the error signal that drives your sizing control loop. I didn't want to use GC count changing because then it would miss 100% MMU (which is something to celebrate). This should be called once every GCPauseIntervalMillis (5000 by default). It's hard to imagine the process didn't have any CPU time over such an interval, but not impossible. Seems more likely the scheduled task thread fell behind and executed the task twice in quick succession. I'm not totally sure how this happens. ------------- PR: https://git.openjdk.org/shenandoah/pull/188 From wkemper at openjdk.org Thu Dec 22 00:27:16 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 22 Dec 2022 00:27:16 GMT Subject: RFR: Avoid divide by zero error, improve variable names In-Reply-To: <-yKh8JLLN8TuF-BUHc5yf2wWgTCy2eXrGGqgxWF7kw8=.a502bbf8-b472-4108-a722-11608c3e98e9@github.com> References: <-yKh8JLLN8TuF-BUHc5yf2wWgTCy2eXrGGqgxWF7kw8=.a502bbf8-b472-4108-a722-11608c3e98e9@github.com> Message-ID: <9UYuMzyQNh6yAXe7cmJojAo0Kq5uz26JkcHJDy04fkE=.a4c9f4ad-42a7-44c3-a520-1056e77575f9@github.com> On Wed, 21 Dec 2022 23:59:40 GMT, Y. Srinivas Ramakrishna wrote: >> Depending on when the periodic thread runs the accounting task and how much CPU time the process receives, we may see a very small elapsed process time. Such a time should not be used to compute MMU. > > src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 106: > >> 104: double elapsed_collector_time_s = collector_time_s - _collector_reference_time_s; >> 105: _collector_reference_time_s = collector_time_s; >> 106: double minimum_mutator_utilization = ((elapsed_process_time_s - elapsed_collector_time_s) / elapsed_process_time_s) * 100; > > Alternative: You could check if result `isnan()` and not use it, although I don't know if divide by zero is costly. I'd just as soon catch it earlier and I'm not 100% certain that other values won't also cause this error. Also, `isnan` won't flag the result of division by zero (`isinf` would though). ------------- PR: https://git.openjdk.org/shenandoah/pull/188 From ysr at openjdk.org Thu Dec 22 00:40:16 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 00:40:16 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: <83ZqvSMvhjIoNc3-w21hG5ssMhCTzdWODEdSKgFYlM4=.b2da43e0-e722-4cd9-a80b-9bd2afc389c5@github.com> References: <83ZqvSMvhjIoNc3-w21hG5ssMhCTzdWODEdSKgFYlM4=.b2da43e0-e722-4cd9-a80b-9bd2afc389c5@github.com> Message-ID: On Wed, 21 Dec 2022 23:06:38 GMT, William Kemper wrote: >> src/hotspot/share/gc/shenandoah/shenandoahMmuTracker.cpp line 222: >> >>> 220: log_info(gc)("Thread Usr+Sys YOUNG = %.3f, OLD = %.3f, GLOBAL = %.3f", young_time_s, old_time_s, global_time_s); >>> 221: >>> 222: if (abs(delta) <= transfer_threshold) { >> >> I thought the original idea was to use the difference in MMU's for old and young as the error signal to drive the (direction of the) transfer, rather than the difference in the actual times? Am I misinterpreting what `reset_collection_time` returns? You do refer to it as `thread utilization` (akin to MMU) in the log message below. > > I reasoned that the denominator in a comparison between MMU's (i.e., process time) would be the same on both sides of the comparison, so I omitted it. My worry is that using the latest sample might lead to too much variance. Using a TruncatedSeq's decaying average would help with damping the error signal and lead to smoother control over sizing changes. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Thu Dec 22 01:05:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 22 Dec 2022 01:05:18 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: <83ZqvSMvhjIoNc3-w21hG5ssMhCTzdWODEdSKgFYlM4=.b2da43e0-e722-4cd9-a80b-9bd2afc389c5@github.com> Message-ID: On Thu, 22 Dec 2022 00:37:05 GMT, Y. Srinivas Ramakrishna wrote: >> I reasoned that the denominator in a comparison between MMU's (i.e., process time) would be the same on both sides of the comparison, so I omitted it. > > My worry is that using the latest sample might lead to too much variance. Using a TruncatedSeq's decaying average would help with damping the error signal and lead to smoother control over sizing changes. I'll experiment with that. I was going for responsiveness. Smoothing out the signal may also introduce a delay in resizing. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From wkemper at openjdk.org Thu Dec 22 01:05:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 22 Dec 2022 01:05:18 GMT Subject: RFR: Initial sizing refactor [v2] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 22:15:58 GMT, Y. Srinivas Ramakrishna wrote: >> I agree we can wire up more signals to the resizing mechanism. In the scenario you describe, where old generation has become _too small_ and old collections are running _too frequently_, the MMU based resizing would enlarge the old generation. > > What do you do before any data is available for one of the MMU trackers? (for example before the first old collection cycle has happened.) I'd assume that the control algorithm wouldn't kick in until the data driving the control signal was valid and available. Where is that done? We generally have a flurry of "learning cycles" at startup, but it makes sense to add an explicit check to avoid premature resizing. To Kelvin's earlier point, it may also make sense to re-initiate the learning phase after resizing the generations. ------------- PR: https://git.openjdk.org/shenandoah/pull/185 From ysr at openjdk.org Thu Dec 22 04:03:16 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 04:03:16 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Wed, 21 Dec 2022 19:55:18 GMT, William Kemper wrote: >> Please let me know if there is still any confusion wrt the documentation of `ShenandoahCardStatsLogInterval` or if you'd prefer a rewording. Thanks! > > Maybe "Log cumulative card stats every so many scans of the remembered set"? "Cycle" is a bit overloaded. If I read this, I would expect to see a log message every 50 GC cycles, but with (probably) two rset scans per GC cycle, it would be closer to every 25 GC cycles. I'll reword along the lines of your suggestion. For the specific example you gave, we will in fact see one cumulative RS log message every 50 RS scans, and one cumulative UR log message every 50 UR scans, thus roughly one each every 50 GC cycles, if you will. The case when they may not be in lockstep might be if there were full gc's or degenerated cycles that did one (e.g. RS) but skipped the other (e.g. UR -- can this happen?), because we maintain two independent counters one for RS scans and one for UR scans. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 22 04:16:41 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 04:16:41 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v14] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Updated 12/21** > > **Summary:** > The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. > > We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. > > Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. > > **Format of stats produced and how to interpret them: (sample)** > > The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. > > > [560.766s][info][gc,remset ] GC(13) Scan Remembered Set > [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] > [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms > [560.766s][info][gc,start ] GC(13) Concurrent marking roots > ... > [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms > [585.433s][info][gc,start ] GC(13) Pause Init Update Refs > [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms > [585.434s][info][gc,start ] GC(13) Concurrent update references > [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update > [585.567s][info][gc ] Average MMU = 2.925 > [590.583s][info][gc ] Average MMU = 1.509 > [595.600s][info][gc ] Average MMU = 0.835 > [600.618s][info][gc ] Average MMU = 0.447 > [605.635s][info][gc ] Average MMU = 0.253 > [610.651s][info][gc ] Average MMU = 0.114 > [615.669s][info][gc ] Average MMU = 0.130 > [620.686s][info][gc ] Average MMU = 0.129 > [622.209s][info][gc,remset ] GC(13) Update Refs > [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] > [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] > [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] > [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] > [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms > ... > (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set > [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] > [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] > [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] > [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] > [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc,remset ] GC(15) Cumulative stats > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms > ... > [631.875s][info][gc,remset ] GC(15) Update Refs > [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] > [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc,remset ] GC(15) Cumulative stats > [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] > [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] > [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific prefix of the run. > > Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. > > **Question:** > Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: A couple of changes based on review feedback. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/1bc59f89..bfddb220 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=13 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=12-13 Stats: 3 lines in 2 files changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 22 04:34:00 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 04:34:00 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Updated 12/21** > > **Summary:** > The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. > > We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. > > Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. > > **Format of stats produced and how to interpret them: (sample)** > > The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. > > > [560.766s][info][gc,remset ] GC(13) Scan Remembered Set > [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] > [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms > [560.766s][info][gc,start ] GC(13) Concurrent marking roots > ... > [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms > [585.433s][info][gc,start ] GC(13) Pause Init Update Refs > [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms > [585.434s][info][gc,start ] GC(13) Concurrent update references > [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update > [585.567s][info][gc ] Average MMU = 2.925 > [590.583s][info][gc ] Average MMU = 1.509 > [595.600s][info][gc ] Average MMU = 0.835 > [600.618s][info][gc ] Average MMU = 0.447 > [605.635s][info][gc ] Average MMU = 0.253 > [610.651s][info][gc ] Average MMU = 0.114 > [615.669s][info][gc ] Average MMU = 0.130 > [620.686s][info][gc ] Average MMU = 0.129 > [622.209s][info][gc,remset ] GC(13) Update Refs > [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] > [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] > [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] > [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] > [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms > ... > (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set > [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] > [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] > [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] > [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] > [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc,remset ] GC(15) Cumulative stats > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms > ... > [631.875s][info][gc,remset ] GC(15) Update Refs > [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] > [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc,remset ] GC(15) Cumulative stats > [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] > [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] > [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific prefix of the run. > > Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. > > **Question:** > Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 68 commits: - Merge branch 'master' into JVM-1264 - A couple of changes based on review feedback. - Reword some code comments for greater clarity. - Merge branch 'master' into JVM-1264-dependent - Add a previously missed ticket#. Doing it here rather than in parent to avoid an otherwise unnecessary re-review touchpoint. - Merge branch 'stats_merge' into JVM-1264-dependent - Merge branch 'master' into stats_merge - jcheck space fix - Fix compiler error on windows. - Fix some tier1 tests. - ... and 58 more: https://git.openjdk.org/shenandoah/compare/d793fd16...4e5ad4ca ------------- Changes: https://git.openjdk.org/shenandoah/pull/176/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=14 Stats: 865 lines in 9 files changed: 496 ins; 206 del; 163 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From redestad at openjdk.org Thu Dec 22 13:12:54 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 22 Dec 2022 13:12:54 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> Message-ID: On Wed, 21 Dec 2022 00:11:34 GMT, Sandhya Viswanathan wrote: >> Passing the constant node through as an input as suggested by @iwanowww and @sviswa7 meant we could eliminate most of the `instruct` blocks, removing a significant chunk of code and a little bit of complexity from the proposed patch. > > @cl4es Thanks for passing the constant node through, the code looks much cleaner now. The attached patch should handle the signed bytes/shorts as well. Please take a look. > [signed.patch](https://github.com/openjdk/jdk/files/10273480/signed.patch) I ran tests and some quick microbenchmarking to validate @sviswa7's patch to activate vectorization for `short` and `byte` arrays and it looks good: Before: Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.bytes 10000 avgt 5 7845.586 ? 23.440 ns/op ArraysHashCode.chars 10000 avgt 5 1203.163 ? 11.995 ns/op ArraysHashCode.ints 10000 avgt 5 1131.915 ? 7.843 ns/op ArraysHashCode.multibytes 10000 avgt 5 4136.487 ? 5.790 ns/op ArraysHashCode.multichars 10000 avgt 5 671.328 ? 17.629 ns/op ArraysHashCode.multiints 10000 avgt 5 699.051 ? 8.135 ns/op ArraysHashCode.multishorts 10000 avgt 5 4139.300 ? 10.633 ns/op ArraysHashCode.shorts 10000 avgt 5 7844.019 ? 26.071 ns/op After: Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.bytes 10000 avgt 5 1193.208 ? 1.965 ns/op ArraysHashCode.chars 10000 avgt 5 1193.311 ? 5.941 ns/op ArraysHashCode.ints 10000 avgt 5 1132.592 ? 10.410 ns/op ArraysHashCode.multibytes 10000 avgt 5 657.343 ? 25.343 ns/op ArraysHashCode.multichars 10000 avgt 5 672.668 ? 5.229 ns/op ArraysHashCode.multiints 10000 avgt 5 697.143 ? 3.929 ns/op ArraysHashCode.multishorts 10000 avgt 5 666.738 ? 12.236 ns/op ArraysHashCode.shorts 10000 avgt 5 1193.563 ? 5.449 ns/op ------------- PR: https://git.openjdk.org/jdk/pull/10847 From eosterlund at openjdk.org Thu Dec 22 14:46:47 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 22 Dec 2022 14:46:47 GMT Subject: RFR: 8299072: java_lang_ref_Reference::clear_referent should be GC agnostic In-Reply-To: <8je3w2XaNdQEAKx0lLHp2T2UXkOUqEV0ks-2TFL2AJE=.fbf307d0-9833-4465-a914-3a7e9f05d12b@github.com> References: <8je3w2XaNdQEAKx0lLHp2T2UXkOUqEV0ks-2TFL2AJE=.fbf307d0-9833-4465-a914-3a7e9f05d12b@github.com> Message-ID: On Tue, 20 Dec 2022 08:12:41 GMT, David Holmes wrote: >> The current java_lang_ref_Reference::clear_referent implementation performs a raw reference clear. That doesn't work well with upcoming GC algorithms. It should be made GC agnostic by going through the normal access API. > > So `clear_referent` is made GC agnostic, but then all the existing GC's are changed to use the original raw version? Why do they not use the GC agnostic version - performance? Thanks for the review, @dholmes-ora! ------------- PR: https://git.openjdk.org/jdk/pull/11736 From wkemper at openjdk.org Thu Dec 22 20:04:25 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 22 Dec 2022 20:04:25 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Thu, 22 Dec 2022 04:34:00 GMT, Y. Srinivas Ramakrishna wrote: >> **Updated 12/21** >> >> **Summary:** >> The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. >> >> We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. >> >> Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. >> >> >> [560.766s][info][gc,remset ] GC(13) Scan Remembered Set >> [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: >> [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] >> [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] >> [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: >> [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] >> [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] >> [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms >> [560.766s][info][gc,start ] GC(13) Concurrent marking roots >> ... >> [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms >> [585.433s][info][gc,start ] GC(13) Pause Init Update Refs >> [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms >> [585.434s][info][gc,start ] GC(13) Concurrent update references >> [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update >> [585.567s][info][gc ] Average MMU = 2.925 >> [590.583s][info][gc ] Average MMU = 1.509 >> [595.600s][info][gc ] Average MMU = 0.835 >> [600.618s][info][gc ] Average MMU = 0.447 >> [605.635s][info][gc ] Average MMU = 0.253 >> [610.651s][info][gc ] Average MMU = 0.114 >> [615.669s][info][gc ] Average MMU = 0.130 >> [620.686s][info][gc ] Average MMU = 0.129 >> [622.209s][info][gc,remset ] GC(13) Update Refs >> [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: >> [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] >> [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] >> [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] >> [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] >> [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: >> [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms >> ... >> (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set >> [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: >> [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] >> [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] >> [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] >> [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] >> [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: >> [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [627.627s][info][gc,remset ] GC(15) Cumulative stats >> [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms >> ... >> [631.875s][info][gc,remset ] GC(15) Update Refs >> [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: >> [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] >> [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] >> [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: >> [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [631.876s][info][gc,remset ] GC(15) Cumulative stats >> [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] >> [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] >> [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific prefix of the run. >> >> Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. >> >> **Question:** >> Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 68 commits: > > - Merge branch 'master' into JVM-1264 > - A couple of changes based on review feedback. > - Reword some code comments for greater clarity. > - Merge branch 'master' into JVM-1264-dependent > - Add a previously missed ticket#. Doing it here rather than in parent to > avoid an otherwise unnecessary re-review touchpoint. > - Merge branch 'stats_merge' into JVM-1264-dependent > - Merge branch 'master' into stats_merge > - jcheck space fix > - Fix compiler error on windows. > - Fix some tier1 tests. > - ... and 58 more: https://git.openjdk.org/shenandoah/compare/d793fd16...4e5ad4ca Thank you. Excited to see what we can learn and optimize from these metrics. ------------- Marked as reviewed by wkemper (Committer). PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Thu Dec 22 20:04:25 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 22 Dec 2022 20:04:25 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Thu, 22 Dec 2022 04:00:10 GMT, Y. Srinivas Ramakrishna wrote: >> Maybe "Log cumulative card stats every so many scans of the remembered set"? "Cycle" is a bit overloaded. If I read this, I would expect to see a log message every 50 GC cycles, but with (probably) two rset scans per GC cycle, it would be closer to every 25 GC cycles. > > I reworded along the lines of your suggestion. > > For the specific example you gave, we will in fact see one cumulative RS log message every 50 RS scans, and one cumulative UR log message every 50 UR scans, thus roughly one each every 50 GC cycles, if you will. The case when they may not be in lockstep might be if there were full gc's or degenerated cycles that did one (e.g. RS) but skipped the other (e.g. UR -- can this happen?), because we maintain two independent counters one for RS scans and one for UR scans. Okay - thank you. Yes, Shenandoah will skip evacuation (and update references) if it finds a sufficient number of regions with no live objects after final mark. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 22 23:50:24 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 23:50:24 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Thu, 22 Dec 2022 20:01:16 GMT, William Kemper wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 68 commits: >> >> - Merge branch 'master' into JVM-1264 >> - A couple of changes based on review feedback. >> - Reword some code comments for greater clarity. >> - Merge branch 'master' into JVM-1264-dependent >> - Add a previously missed ticket#. Doing it here rather than in parent to >> avoid an otherwise unnecessary re-review touchpoint. >> - Merge branch 'stats_merge' into JVM-1264-dependent >> - Merge branch 'master' into stats_merge >> - jcheck space fix >> - Fix compiler error on windows. >> - Fix some tier1 tests. >> - ... and 58 more: https://git.openjdk.org/shenandoah/compare/d793fd16...4e5ad4ca > > Thank you. Excited to see what we can learn and optimize from these metrics. Thank you for your review, @earthling-amzn ! ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Thu Dec 22 23:50:24 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Thu, 22 Dec 2022 23:50:24 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v2] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <9E6NmFY5877JXtI7RKpqa1r2nXDaEJ7xxLG9q0hEP6U=.03c76ffe-bac9-4401-9091-4ee19d6a394e@github.com> Message-ID: On Thu, 8 Dec 2022 14:42:53 GMT, Kelvin Nilsen wrote: > Thanks for sharing this code. A few overview comments: > > 1. Yes, I think it would be useful to see the data collected for each mark scan and each update-reference scan independently. Sometimes, abnormal behavior of the application causes spikes in performance, and it would be nice to understand the degree to which remembered set scanning is part of this spike. > 2. It is also useful to have a cumulative summary of all costs at the end of a run, probably still separating out the mark scans from the update-refs scans. > 3. Is it possible to eliminate the overhead entirely of this instrumentation by compiling it out for release builds? @kdnilsen : The above have all been taken care of. Please re-review and approve/sponsor. Thank you! ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From eosterlund at openjdk.org Fri Dec 23 14:58:37 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 23 Dec 2022 14:58:37 GMT Subject: RFR: 8299312: Clean up BarrierSetNMethod Message-ID: The terminology in BarrierSetNMethod is not crisp. In platform code we talk about a per-nmethod "guard value", but on shared level we call the same value arm value or disarm value in different contexts. But it really depends on the value whether the nmethod is disarmed or armed. We should embrace the "guard value" terminology and lift it in to the shared code level. We also have more functionality than we need on platform level. The platform level only needs to know how to deoptimize, and how to set/get the guard value of an nmethod. The more specific functionality should be moved to the shared code and be expressed in terms of said setter/getter. ------------- Commit messages: - Fix Shenandoah build - 8299312: Clean up BarrierSetNMethod Changes: https://git.openjdk.org/jdk/pull/11774/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11774&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8299312 Stats: 159 lines in 26 files changed: 10 ins; 73 del; 76 mod Patch: https://git.openjdk.org/jdk/pull/11774.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11774/head:pull/11774 PR: https://git.openjdk.org/jdk/pull/11774 From luhenry at openjdk.org Fri Dec 23 22:53:55 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Fri, 23 Dec 2022 22:53:55 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v16] In-Reply-To: References: Message-ID: On Wed, 21 Dec 2022 17:29:23 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Treat Op_VectorizedHashCode as other similar Ops in split_unique_types Marked as reviewed by luhenry (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/10847 From kdnilsen at openjdk.org Tue Dec 27 22:10:26 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 27 Dec 2022 22:10:26 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Thu, 22 Dec 2022 04:34:00 GMT, Y. Srinivas Ramakrishna wrote: >> **Updated 12/21** >> >> **Summary:** >> The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. >> >> We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. >> >> Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. >> >> >> [560.766s][info][gc,remset ] GC(13) Scan Remembered Set >> [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: >> [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] >> [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] >> [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: >> [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] >> [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] >> [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms >> [560.766s][info][gc,start ] GC(13) Concurrent marking roots >> ... >> [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms >> [585.433s][info][gc,start ] GC(13) Pause Init Update Refs >> [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms >> [585.434s][info][gc,start ] GC(13) Concurrent update references >> [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update >> [585.567s][info][gc ] Average MMU = 2.925 >> [590.583s][info][gc ] Average MMU = 1.509 >> [595.600s][info][gc ] Average MMU = 0.835 >> [600.618s][info][gc ] Average MMU = 0.447 >> [605.635s][info][gc ] Average MMU = 0.253 >> [610.651s][info][gc ] Average MMU = 0.114 >> [615.669s][info][gc ] Average MMU = 0.130 >> [620.686s][info][gc ] Average MMU = 0.129 >> [622.209s][info][gc,remset ] GC(13) Update Refs >> [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: >> [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] >> [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] >> [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] >> [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] >> [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: >> [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms >> ... >> (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set >> [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: >> [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] >> [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] >> [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] >> [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] >> [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: >> [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [627.627s][info][gc,remset ] GC(15) Cumulative stats >> [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms >> ... >> [631.875s][info][gc,remset ] GC(15) Update Refs >> [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: >> [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] >> [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] >> [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: >> [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [631.876s][info][gc,remset ] GC(15) Cumulative stats >> [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] >> [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] >> [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific prefix of the run. >> >> Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. >> >> **Question:** >> Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? > > Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 68 commits: > > - Merge branch 'master' into JVM-1264 > - A couple of changes based on review feedback. > - Reword some code comments for greater clarity. > - Merge branch 'master' into JVM-1264-dependent > - Add a previously missed ticket#. Doing it here rather than in parent to > avoid an otherwise unnecessary re-review touchpoint. > - Merge branch 'stats_merge' into JVM-1264-dependent > - Merge branch 'master' into stats_merge > - jcheck space fix > - Fix compiler error on windows. > - Fix some tier1 tests. > - ... and 58 more: https://git.openjdk.org/shenandoah/compare/d793fd16...4e5ad4ca Thank you. This is very thorough. src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 83: > 81: _rp->set_mark_closure(worker_id, &cl); > 82: struct ShenandoahRegionChunk assignment; > 83: while (_work_list->next(&assignment)) { No need for a code change here, but just want to make clear that we may want to enable cancellation of rem-set scanning at some future time. The primary benefit would be to allow quicker transition to Full GC in the case that we have to degenerate. src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.hpp line 321: > 319: // 3. Non-array objects are precisely dirtied by the interpreter and the compilers > 320: // (why? Are offsets of a field in an object that expensive to determine?). > 321: // For such objects that extend over multiple cards, or even multiple clusters, Historically, we borrowed the card-marking barrier from existing generational GC implementations and did not want to burden ourselves with trying to change it. Presumably, experience with other GCs demonstrates that this works "well enough". It would appear that non-array objects are usually not "extremely large". src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.hpp line 323: > 321: // For such objects that extend over multiple cards, or even multiple clusters, > 322: // the entire object is scanned by the worker that processes the (dirty) card on > 323: // which the object's header lies. However, GC workers then precisley dirty the typo: precisely ------------- Marked as reviewed by kdnilsen (Committer). PR: https://git.openjdk.org/shenandoah/pull/176 From kdnilsen at openjdk.org Tue Dec 27 22:10:26 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 27 Dec 2022 22:10:26 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Tue, 27 Dec 2022 21:53:12 GMT, Kelvin Nilsen wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 68 commits: >> >> - Merge branch 'master' into JVM-1264 >> - A couple of changes based on review feedback. >> - Reword some code comments for greater clarity. >> - Merge branch 'master' into JVM-1264-dependent >> - Add a previously missed ticket#. Doing it here rather than in parent to >> avoid an otherwise unnecessary re-review touchpoint. >> - Merge branch 'stats_merge' into JVM-1264-dependent >> - Merge branch 'master' into stats_merge >> - jcheck space fix >> - Fix compiler error on windows. >> - Fix some tier1 tests. >> - ... and 58 more: https://git.openjdk.org/shenandoah/compare/d793fd16...4e5ad4ca > > src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 83: > >> 81: _rp->set_mark_closure(worker_id, &cl); >> 82: struct ShenandoahRegionChunk assignment; >> 83: while (_work_list->next(&assignment)) { > > No need for a code change here, but just want to make clear that we may want to enable cancellation of rem-set scanning at some future time. The primary benefit would be to allow quicker transition to Full GC in the case that we have to degenerate. This would also accelerate the transition to degenerated mode, so we can get ourselves out of the STW pause more quickly. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Tue Dec 27 23:16:32 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 27 Dec 2022 23:16:32 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: <0kj10C2G8dpG_wSQIV4O-N1Q9mlRNrK8_aTH65jPzHs=.3fd5e140-98a0-4e24-ac88-378f41672395@github.com> On Tue, 27 Dec 2022 21:54:24 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 83: >> >>> 81: _rp->set_mark_closure(worker_id, &cl); >>> 82: struct ShenandoahRegionChunk assignment; >>> 83: while (_work_list->next(&assignment)) { >> >> No need for a code change here, but just want to make clear that we may want to enable cancellation of rem-set scanning at some future time. The primary benefit would be to allow quicker transition to Full GC in the case that we have to degenerate. > > This would also accelerate the transition to degenerated mode, so we can get ourselves out of the STW pause more quickly. When a work item is picked up, it should be completed since it can't currently be placed back on the work list. Hence the code change, moving the check to the end of the loop. My expectation had been that this would fix the issue with prompt cancellation causing crashes previously. However, testing revealed that this was still causing crashes, so I left the cancellation commented out with the intention of following up on this in the fullness of time. I'll leave a comment to that effect as you suggest, and mark it with a TODO so it's easily flagged/found. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Tue Dec 27 23:25:14 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Tue, 27 Dec 2022 23:25:14 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Tue, 27 Dec 2022 22:04:49 GMT, Kelvin Nilsen wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 68 commits: >> >> - Merge branch 'master' into JVM-1264 >> - A couple of changes based on review feedback. >> - Reword some code comments for greater clarity. >> - Merge branch 'master' into JVM-1264-dependent >> - Add a previously missed ticket#. Doing it here rather than in parent to >> avoid an otherwise unnecessary re-review touchpoint. >> - Merge branch 'stats_merge' into JVM-1264-dependent >> - Merge branch 'master' into stats_merge >> - jcheck space fix >> - Fix compiler error on windows. >> - Fix some tier1 tests. >> - ... and 58 more: https://git.openjdk.org/shenandoah/compare/d793fd16...4e5ad4ca > > src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.hpp line 321: > >> 319: // 3. Non-array objects are precisely dirtied by the interpreter and the compilers >> 320: // (why? Are offsets of a field in an object that expensive to determine?). >> 321: // For such objects that extend over multiple cards, or even multiple clusters, > > Historically, we borrowed the card-marking barrier from existing generational GC implementations and did not want to burden ourselves with trying to change it. Presumably, experience with other GCs demonstrates that this works "well enough". It would appear that non-array objects are usually not "extremely large". I realize that my comments in lines 324-328 were aspirational for changes I wanted to make but which is not the case today. I'll correct those and a few other such "aspirational" comments that got left behind in the code as I was working on subsequent changes. It is true what you say that non-array objects are usually not very large (unless the result of code generated by frameworks such as, e.g., protobufs). ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 28 00:21:44 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 28 Dec 2022 00:21:44 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v16] In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: > **Updated 12/21** > > **Summary:** > The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. > > We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. > > Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. > > **Format of stats produced and how to interpret them: (sample)** > > The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. > > > [560.766s][info][gc,remset ] GC(13) Scan Remembered Set > [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] > [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms > [560.766s][info][gc,start ] GC(13) Concurrent marking roots > ... > [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms > [585.433s][info][gc,start ] GC(13) Pause Init Update Refs > [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms > [585.434s][info][gc,start ] GC(13) Concurrent update references > [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update > [585.567s][info][gc ] Average MMU = 2.925 > [590.583s][info][gc ] Average MMU = 1.509 > [595.600s][info][gc ] Average MMU = 0.835 > [600.618s][info][gc ] Average MMU = 0.447 > [605.635s][info][gc ] Average MMU = 0.253 > [610.651s][info][gc ] Average MMU = 0.114 > [615.669s][info][gc ] Average MMU = 0.130 > [620.686s][info][gc ] Average MMU = 0.129 > [622.209s][info][gc,remset ] GC(13) Update Refs > [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] > [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] > [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] > [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] > [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms > ... > (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set > [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] > [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] > [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] > [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] > [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc,remset ] GC(15) Cumulative stats > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms > ... > [631.875s][info][gc,remset ] GC(15) Update Refs > [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] > [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc,remset ] GC(15) Cumulative stats > [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] > [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] > [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific prefix of the run. > > Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. > > **Question:** > Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: Fix some comments based on review feedback. ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/176/files - new: https://git.openjdk.org/shenandoah/pull/176/files/4e5ad4ca..cf8c7e54 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=15 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=14-15 Stats: 16 lines in 2 files changed: 3 ins; 5 del; 8 mod Patch: https://git.openjdk.org/shenandoah/pull/176.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176 PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 28 00:40:14 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 28 Dec 2022 00:40:14 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v16] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: <5t4esBi2u-J1zWtF1yked2BEwa59Pef-WUWY6_Q-gso=.61f4867a-af39-4d53-b6ad-45f8d37a272d@github.com> On Wed, 28 Dec 2022 00:21:44 GMT, Y. Srinivas Ramakrishna wrote: >> **Updated 12/21** >> >> **Summary:** >> The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. >> >> We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. >> >> Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. >> >> **Format of stats produced and how to interpret them: (sample)** >> >> The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. >> >> >> [560.766s][info][gc,remset ] GC(13) Scan Remembered Set >> [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: >> [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] >> [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] >> [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] >> [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: >> [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] >> [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] >> [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] >> [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] >> [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] >> [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms >> [560.766s][info][gc,start ] GC(13) Concurrent marking roots >> ... >> [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms >> [585.433s][info][gc,start ] GC(13) Pause Init Update Refs >> [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms >> [585.434s][info][gc,start ] GC(13) Concurrent update references >> [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update >> [585.567s][info][gc ] Average MMU = 2.925 >> [590.583s][info][gc ] Average MMU = 1.509 >> [595.600s][info][gc ] Average MMU = 0.835 >> [600.618s][info][gc ] Average MMU = 0.447 >> [605.635s][info][gc ] Average MMU = 0.253 >> [610.651s][info][gc ] Average MMU = 0.114 >> [615.669s][info][gc ] Average MMU = 0.130 >> [620.686s][info][gc ] Average MMU = 0.129 >> [622.209s][info][gc,remset ] GC(13) Update Refs >> [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: >> [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] >> [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] >> [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] >> [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] >> [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] >> [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: >> [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] >> [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms >> ... >> (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set >> [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: >> [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] >> [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] >> [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] >> [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] >> [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] >> [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] >> [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: >> [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [627.627s][info][gc,remset ] GC(15) Cumulative stats >> [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] >> [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] >> [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] >> [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms >> ... >> [631.875s][info][gc,remset ] GC(15) Update Refs >> [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: >> [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] >> [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] >> [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: >> [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] >> [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] >> [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] >> [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [631.876s][info][gc,remset ] GC(15) Cumulative stats >> [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] >> [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] >> [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] >> [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] >> [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] >> [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] >> [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms >> ... >> >> >> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: >> >> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread >> - clean_run: as above, but the length of an uninterrupted run of clean cards >> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk >> - max_dirty_run & max_clean_run: Similarly for the maximum of each. >> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned >> - dirty_scans, clean_scans: numbers of objects scanned by the closure >> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk >> >> For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, >> and cards are almost always mostly clean for this specific prefix of the run. >> >> Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. >> >> **Question:** >> Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? > > Y. Srinivas Ramakrishna has updated the pull request incrementally with one additional commit since the last revision: > > Fix some comments based on review feedback. Took care of review feedback. This just needs a sponsor, thanks @kdnilsen / @earthling-amzn ! ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 28 00:40:15 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 28 Dec 2022 00:40:15 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: <0kj10C2G8dpG_wSQIV4O-N1Q9mlRNrK8_aTH65jPzHs=.3fd5e140-98a0-4e24-ac88-378f41672395@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> <0kj10C2G8dpG_wSQIV4O-N1Q9mlRNrK8_aTH65jPzHs=.3fd5e140-98a0-4e24-ac88-378f41672395@github.com> Message-ID: On Tue, 27 Dec 2022 23:13:05 GMT, Y. Srinivas Ramakrishna wrote: >> This would also accelerate the transition to degenerated mode, so we can get ourselves out of the STW pause more quickly. > > When a work item is picked up, it should be completed since it can't currently be placed back on the work list. Hence the code change, moving the check to the end of the loop. My expectation had been that this would fix the issue with prompt cancellation causing crashes previously. However, testing revealed that this was still causing crashes, so I left the cancellation commented out with the intention of following up on this in the fullness of time. > > I'll leave a comment to that effect as you suggest, and mark it with a TODO so it's easily flagged/found. Fixed. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 28 00:40:15 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 28 Dec 2022 00:40:15 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Tue, 27 Dec 2022 23:22:52 GMT, Y. Srinivas Ramakrishna wrote: >> src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.hpp line 321: >> >>> 319: // 3. Non-array objects are precisely dirtied by the interpreter and the compilers >>> 320: // (why? Are offsets of a field in an object that expensive to determine?). >>> 321: // For such objects that extend over multiple cards, or even multiple clusters, >> >> Historically, we borrowed the card-marking barrier from existing generational GC implementations and did not want to burden ourselves with trying to change it. Presumably, experience with other GCs demonstrates that this works "well enough". It would appear that non-array objects are usually not "extremely large". > > I realize that my comments in lines 324-328 were aspirational for changes I wanted to make but which is not the case today. I'll correct those and a few other such "aspirational" comments that got left behind in the code as I was working on subsequent changes. > > It is true what you say that non-array objects are usually not very large (unless the result of code generated by frameworks such as, e.g., protobufs). Fixed. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 28 00:40:17 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 28 Dec 2022 00:40:17 GMT Subject: RFR: JDK-8297796 GenShen: instrument the remembered set scan [v15] In-Reply-To: References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Tue, 27 Dec 2022 22:00:32 GMT, Kelvin Nilsen wrote: >> Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 68 commits: >> >> - Merge branch 'master' into JVM-1264 >> - A couple of changes based on review feedback. >> - Reword some code comments for greater clarity. >> - Merge branch 'master' into JVM-1264-dependent >> - Add a previously missed ticket#. Doing it here rather than in parent to >> avoid an otherwise unnecessary re-review touchpoint. >> - Merge branch 'stats_merge' into JVM-1264-dependent >> - Merge branch 'master' into stats_merge >> - jcheck space fix >> - Fix compiler error on windows. >> - Fix some tier1 tests. >> - ... and 58 more: https://git.openjdk.org/shenandoah/compare/d793fd16...4e5ad4ca > > src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.hpp line 323: > >> 321: // For such objects that extend over multiple cards, or even multiple clusters, >> 322: // the entire object is scanned by the worker that processes the (dirty) card on >> 323: // which the object's header lies. However, GC workers then precisley dirty the > > typo: precisely fixed. ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From ysr at openjdk.org Wed Dec 28 00:44:31 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 28 Dec 2022 00:44:31 GMT Subject: Integrated: JDK-8297796 GenShen: instrument the remembered set scan In-Reply-To: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> References: <3-iFBSR1DHkrBgskzogR_KdmBvQtPQXb3MiHuqd-y7c=.7ae6200d-ed99-4766-b1a5-e331c4dcbb13@github.com> Message-ID: On Thu, 1 Dec 2022 19:55:45 GMT, Y. Srinivas Ramakrishna wrote: > **Updated 12/21** > > **Summary:** > The main change is card stats collection during remembered set (RS) and update refs (UR) phases when the card-table is scanned. The code is protected by a new non-product only flag `ShenandoahEnableCardStats`, which is on by default in debug builds and off in the optimized build. > > We tested the impact of the code with the flag enabled in product mode and felt the impact was non-trivial. We might, in the future, enable the code in product mode if performance can be improved. > > Stats are logged per worker thread at the end of each RS and UR scan. These stats are specific to the most recent round of scanning. Global cumulative stats across all threads (but specific to RS or UR) are also maintained, and these are logged at periodic intervals as determined by the setting of `ShenandoahCardStatsLogInterval`. > > **Format of stats produced and how to interpret them: (sample)** > > The following format is an example from a slowdebug run where the logging is enabled. In this case there are 2 concurrent gc worker threads, and `ShenandoahCardStatsLogInterval` was set at 2. The first two logs show the stats for those particular scans for each of the two worker threads, and the next set show the stats for particular scans for the two worker threads, followed by a cumulative one for that type of scan (RS or UR) across all workers and scans of that type, respectively. > > > [560.766s][info][gc,remset ] GC(13) Scan Remembered Set > [560.766s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 99.61 99.61 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 53.12 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 818.36 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 8.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 705.08 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 16.00 ] > [560.766s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [560.766s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 96.88 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_cards: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) max_dirty_run: [ 18.75 82.81 98.44 99.61 100.00 ] > [560.766s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 46.88 ] > [560.766s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 269.53 353.52 814.45 1366.00 ] > [560.766s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 263.67 351.56 671.88 1365.00 ] > [560.766s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [560.766s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 3.00 ] > [560.766s][info][gc ] GC(13) Concurrent remembered set scanning 1150.359ms > [560.766s][info][gc,start ] GC(13) Concurrent marking roots > ... > [585.433s][info][gc ] GC(13) Concurrent evacuation 6225.829ms > [585.433s][info][gc,start ] GC(13) Pause Init Update Refs > [585.434s][info][gc ] GC(13) Pause Init Update Refs 0.264ms > [585.434s][info][gc,start ] GC(13) Concurrent update references > [585.434s][info][gc,task ] GC(13) Using 2 of 4 workers for concurrent reference update > [585.567s][info][gc ] Average MMU = 2.925 > [590.583s][info][gc ] Average MMU = 1.509 > [595.600s][info][gc ] Average MMU = 0.835 > [600.618s][info][gc ] Average MMU = 0.447 > [605.635s][info][gc ] Average MMU = 0.253 > [610.651s][info][gc ] Average MMU = 0.114 > [615.669s][info][gc ] Average MMU = 0.130 > [620.686s][info][gc ] Average MMU = 0.129 > [622.209s][info][gc,remset ] GC(13) Update Refs > [622.209s][info][gc,remset ] GC(13) Worker 0 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 3.12 50.00 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 26.56 92.19 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 1.56 29.69 99.61 100.00 ] > [622.209s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 9.38 70.31 100.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 50.00 1366.00 ] > [622.209s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 3.98 54.88 64.00 ] > [622.209s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 33.98 1365.00 ] > [622.209s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 1.00 16.00 ] > [622.209s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 2.99 33.00 ] > [622.209s][info][gc,remset ] GC(13) Worker 1 Card Stats Histo: > [622.209s][info][gc,remset ] GC(13) dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.209s][info][gc,remset ] GC(13) clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_cards: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_dirty_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) max_clean_run: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_objs: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) dirty_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) clean_scans: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc,remset ] GC(13) alternations: [ 0.00 0.00 0.00 0.00 0.00 ] > [622.210s][info][gc ] GC(13) Concurrent update references 36776.258ms > ... > (init[627.626s][info][gc,remset ] GC(15) Scan Remembered Set > [627.626s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [627.626s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 4.69 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 0.00 6.25 32.81 100.00 ] > [627.626s][info][gc,remset ] GC(15) clean_cards: [ 0.00 48.44 90.62 98.44 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 0.00 3.12 15.62 100.00 ] > [627.626s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 23.44 60.94 95.31 100.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 45.90 164.06 1366.00 ] > [627.626s][info][gc,remset ] GC(15) clean_objs: [ 0.00 11.91 53.91 60.94 63.00 ] > [627.626s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 31.84 150.39 1365.00 ] > [627.626s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 1.00 1.99 11.00 ] > [627.626s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 1.99 6.00 24.00 ] > [627.627s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 6.25 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 70.31 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 53.12 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 0.00 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 40.82 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 0.00 1364.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc,remset ] GC(15) Cumulative stats > [627.627s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 40.62 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 31.25 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 23.44 99.61 99.61 100.00 ] > [627.627s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 12.50 100.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 326.17 1366.00 ] > [627.627s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 3.98 64.00 ] > [627.627s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 314.45 1365.00 ] > [627.627s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [627.627s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [627.627s][info][gc ] GC(15) Concurrent remembered set scanning 1119.698ms > ... > [631.875s][info][gc,remset ] GC(15) Update Refs > [631.875s][info][gc,remset ] GC(15) Worker 0 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 3.12 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 4.69 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 90.62 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 3.12 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 68.75 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 0.00 29.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 52.93 64.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 0.00 22.85 1364.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 11.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 1.99 24.00 ] > [631.875s][info][gc,remset ] GC(15) Worker 1 Card Stats Histo: > [631.875s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 26.56 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 62.50 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 59.38 99.61 99.61 100.00 ] > [631.875s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 0.00 100.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 230.47 818.36 871.09 1366.00 ] > [631.875s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 0.00 63.00 ] > [631.875s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 181.64 707.03 796.88 1365.00 ] > [631.875s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.875s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc,remset ] GC(15) Cumulative stats > [631.876s][info][gc,remset ] GC(15) dirty_run: [ 0.00 0.00 0.00 6.25 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_run: [ 0.00 0.00 0.00 1.56 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_cards: [ 0.00 32.81 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) clean_cards: [ 0.00 0.00 0.00 43.75 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_dirty_run: [ 0.00 15.62 99.61 99.61 100.00 ] > [631.876s][info][gc,remset ] GC(15) max_clean_run: [ 0.00 0.00 0.00 20.31 100.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_objs: [ 0.00 0.00 20.90 695.31 1366.00 ] > [631.876s][info][gc,remset ] GC(15) clean_objs: [ 0.00 0.00 0.00 11.91 64.00 ] > [631.876s][info][gc,remset ] GC(15) dirty_scans: [ 0.00 0.00 11.91 562.50 1365.00 ] > [631.876s][info][gc,remset ] GC(15) clean_scans: [ 0.00 0.00 0.00 0.00 16.00 ] > [631.876s][info][gc,remset ] GC(15) alternations: [ 0.00 0.00 0.00 0.00 33.00 ] > [631.876s][info][gc ] GC(15) Concurrent update references 1953.893ms > ... > > > The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: > > - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread > - clean_run: as above, but the length of an uninterrupted run of clean cards > - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk > - max_dirty_run & max_clean_run: Similarly for the maximum of each. > - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned > - dirty_scans, clean_scans: numbers of objects scanned by the closure > - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk > > For example, the last cumulative log data (for UR) above indicates that at least 75% of the chunks have no alternations at all, > and cards are almost always mostly clean for this specific prefix of the run. > > Comparing worker stats from worker 0 and worker 1 indicates that in particular scans they may see different distributions of dirty cards for specific benchmarks based on their promotion and mutation behavior. > > **Question:** > Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles, min, and max? This pull request has now been integrated. Changeset: 6c8fa0f7 Author: Y. Srinivas Ramakrishna Committer: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/6c8fa0f735bbc9f80f628145867db8bed5d074f4 Stats: 867 lines in 9 files changed: 497 ins; 209 del; 161 mod 8297796: GenShen: instrument the remembered set scan Reviewed-by: wkemper, kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/176 From wkemper at openjdk.org Thu Dec 29 23:12:57 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 29 Dec 2022 23:12:57 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: This merges tag jdk-21+3 ------------- Commit messages: - Use new constant for two operand lir form - Merge tag 'jdk-21+3' into merge-jdk-21-3 - 8299061: Using lambda to optimize GraphKit::compute_stack_effects() - 8269736: Optimize CDS PatchEmbeddedPointers::do_bit() - 8297724: Loop strip mining prevents some empty loops from being eliminated - 8299015: Ensure that HttpResponse.BodySubscribers.ofFile writes all bytes - 8296275: Write a test to verify setAccelerator method of JMenuItem - 8297682: Use Collections.emptyIterator where applicable - 8299025: BMPImageReader.java readColorPalette could use staggeredReadByteStream - 8299146: No copyright statement on ArtifactResolverException.java - ... and 100 more: https://git.openjdk.org/shenandoah/compare/6c8fa0f7...1acda05b The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=shenandoah&pr=189&range=00.0 - openjdk/jdk:master: https://webrevs.openjdk.org/?repo=shenandoah&pr=189&range=00.1 Changes: https://git.openjdk.org/shenandoah/pull/189/files Stats: 7642 lines in 349 files changed: 4348 ins; 1449 del; 1845 mod Patch: https://git.openjdk.org/shenandoah/pull/189.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/189/head:pull/189 PR: https://git.openjdk.org/shenandoah/pull/189 From wkemper at openjdk.org Thu Dec 29 23:21:22 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 29 Dec 2022 23:21:22 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: On Thu, 29 Dec 2022 23:05:57 GMT, William Kemper wrote: > This merges tag jdk-21+3 This pull request has now been integrated. Changeset: 301d8226 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/301d822681afd71e5f817f23aaee30e81200d9ee Stats: 7642 lines in 349 files changed: 4348 ins; 1449 del; 1845 mod Merge openjdk/jdk:master ------------- PR: https://git.openjdk.org/shenandoah/pull/189 From wkemper at openjdk.org Fri Dec 30 00:18:55 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 30 Dec 2022 00:18:55 GMT Subject: RFR: Allow heuristic trigger to increase capacity instead of running a collection Message-ID: Before the adaptive heuristic starts a collection, it will attempt to increase the capacity of its generation. If the capacity is increased, the heuristic will re-evaluate the trigger criteria. There is also a change here to attempt to increase the size of the old generation in response to a promotion failure. ------------- Commit messages: - Remove trailing whitespace - Centralize resetting gc learning count, increase old for first promotion failure. - Use consistent assertions for locking or safepoint - WIP: Allow heuristics to resize generation instead of collecting - Capacity changes should also apply to adjusted capacity Changes: https://git.openjdk.org/shenandoah/pull/190/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=190&range=00 Stats: 97 lines in 10 files changed: 73 ins; 9 del; 15 mod Patch: https://git.openjdk.org/shenandoah/pull/190.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/190/head:pull/190 PR: https://git.openjdk.org/shenandoah/pull/190 From jamil.j.nimeh at oracle.com Mon Dec 19 21:51:58 2022 From: jamil.j.nimeh at oracle.com (Jamil Nimeh) Date: Mon, 19 Dec 2022 21:51:58 -0000 Subject: Calls array and intrinsic stub routines in shenandoahSupport.cpp Message-ID: <98fa25ef-ddf9-20bf-69f9-90f7d54f9588@oracle.com> Hello all, Volodymyr and I have implemented some new intrinsics for JDK 20 (see OpenJDK PRs https://github.com/openjdk/jdk/pull/7702 and https://github.com/openjdk/jdk/pull/10582).? Volodymyr recently came across the calls[] array in shenandoahSupport.cpp and was wondering whether our new intrinsics need to be added to this array and what the impact of having there there or not is. I'm not on this list (and I'm guessing Volodymyr is not either) so if you could please include us directly in the reply that would be helpful. Thanks, --Jamil