From rkennke at openjdk.org Wed Oct 5 11:44:43 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 5 Oct 2022 11:44:43 GMT Subject: [master] RFR: Replace stack-locking with non-racy fast-locking [v2] In-Reply-To: References: <9py6Ws-DEAgQ6OUKvSqgm3KVcbThu1DOAtTXgV8rjC0=.b8a0ec8b-5183-4294-a958-8ac6922b724d@github.com> Message-ID: On Tue, 16 Aug 2022 13:49:25 GMT, Roman Kennke wrote: >> This PR isolates the change to replace stack-locking with fast-locking (see #51) from the additional improvements to load-Klass* and hashcode and a few other minor places. It is derived from the proposed upstream change (https://github.com/openjdk/jdk/pull/9680) and has undergone a lot more additional testing. >> >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Add testcase to verify hand-over-hand-locking Closing in favour of eventual merging from upstream. ------------- PR: https://git.openjdk.org/lilliput/pull/54 From rkennke at openjdk.org Wed Oct 5 11:44:44 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Wed, 5 Oct 2022 11:44:44 GMT Subject: [master] Withdrawn: Replace stack-locking with non-racy fast-locking In-Reply-To: <9py6Ws-DEAgQ6OUKvSqgm3KVcbThu1DOAtTXgV8rjC0=.b8a0ec8b-5183-4294-a958-8ac6922b724d@github.com> References: <9py6Ws-DEAgQ6OUKvSqgm3KVcbThu1DOAtTXgV8rjC0=.b8a0ec8b-5183-4294-a958-8ac6922b724d@github.com> Message-ID: On Mon, 15 Aug 2022 19:14:29 GMT, Roman Kennke wrote: > This PR isolates the change to replace stack-locking with fast-locking (see #51) from the additional improvements to load-Klass* and hashcode and a few other minor places. It is derived from the proposed upstream change (https://github.com/openjdk/jdk/pull/9680) and has undergone a lot more additional testing. > > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/lilliput/pull/54 From rkennke at openjdk.org Thu Oct 6 06:43:26 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 06:43:26 GMT Subject: [master] RFR: Merge jdk Message-ID: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> I'd like to merge from upstream at tag jdk-20+17 (latest tag). Testing: - [x] tier1 (x86_64, x86_32, aarch64) - [x] tier2 (x86_64, x86_32, aarch64) ------------- Commit messages: - Merge tag 'jdk-20+17' into merge-jdk-20+17 - 8293613: need to properly handle and hide tmp VTMS transitions - 8290920: sspi_bridge.dll not built if BUILD_CRYPTO is false - 8294430: RISC-V: Small refactoring for movptr_with_offset - 8292158: AES-CTR cipher state corruption with AVX-512 - 8290482: Update JNI Specification of DestroyJavaVM for better alignment with JLS, JVMS, and Java SE API Specifications - 8294483: Remove vmTestbase/nsk/jvmti/GetThreadState tests. - 8293143: Workaround for JDK-8292217 when doing "step over" of bytecode with unresolved cp reference - 8294471: SpecTaglet is inconsistent with SpecTree for inline property - 8293592: Remove JVM_StopThread, stillborn, and related cleanup - ... and 702 more: https://git.openjdk.org/lilliput/compare/ceeaefe3...3c14d484 The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=lilliput&pr=55&range=00.0 - jdk: https://webrevs.openjdk.org/?repo=lilliput&pr=55&range=00.1 Changes: https://git.openjdk.org/lilliput/pull/55/files Stats: 179288 lines in 3090 files changed: 85536 ins; 74846 del; 18906 mod Patch: https://git.openjdk.org/lilliput/pull/55.diff Fetch: git fetch https://git.openjdk.org/lilliput pull/55/head:pull/55 PR: https://git.openjdk.org/lilliput/pull/55 From stuefe at openjdk.org Thu Oct 6 07:55:46 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Oct 2022 07:55:46 GMT Subject: [master] RFR: Merge jdk In-Reply-To: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> References: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> Message-ID: On Wed, 5 Oct 2022 19:26:42 GMT, Roman Kennke wrote: > I'd like to merge from upstream at tag jdk-20+17 (latest tag). > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) cursory glance: ok ------------- Marked as reviewed by stuefe (Committer). PR: https://git.openjdk.org/lilliput/pull/55 From rkennke at openjdk.org Thu Oct 6 10:32:24 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:32:24 GMT Subject: [master] RFR: Merge jdk [v2] In-Reply-To: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> References: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> Message-ID: > I'd like to merge from upstream at tag jdk-20+17 (latest tag). > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: - Merge tag 'jdk-20+17' into merge-jdk-20+17 Added tag jdk-20+17 for changeset 79ccc791 - Merge jdk Reviewed-by: stuefe - Implement Shenandoah support Reviewed-by: zgu - Merge tag 'jdk-19+20' into merge-jdk-19+20 Added tag jdk-19+20 for changeset 16a8ebbf - Relax array elements alignment Reviewed-by: stuefe - Merge remote-tracking branch 'jdk-upstream/master' - Merge remote-tracking branch 'jdk-upstream/master' - Simplify stable_mark() routine Reviewed-by: zgu, stuefe - Merge remote-tracking branch 'jdk-upstream/master' - Merge tag 'jdk-19+17' Added tag jdk-19+17 for changeset dd4a1bba - ... and 56 more: https://git.openjdk.org/lilliput/compare/79ccc791...3c14d484 ------------- Changes: https://git.openjdk.org/lilliput/pull/55/files Webrev: https://webrevs.openjdk.org/?repo=lilliput&pr=55&range=01 Stats: 6138 lines in 209 files changed: 4813 ins; 772 del; 553 mod Patch: https://git.openjdk.org/lilliput/pull/55.diff Fetch: git fetch https://git.openjdk.org/lilliput pull/55/head:pull/55 PR: https://git.openjdk.org/lilliput/pull/55 From rkennke at openjdk.org Thu Oct 6 10:32:24 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:32:24 GMT Subject: [master] RFR: Merge jdk In-Reply-To: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> References: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> Message-ID: On Wed, 5 Oct 2022 19:26:42 GMT, Roman Kennke wrote: > I'd like to merge from upstream at tag jdk-20+17 (latest tag). > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) Thanks, Thomas! ------------- PR: https://git.openjdk.org/lilliput/pull/55 From rkennke at openjdk.org Thu Oct 6 10:33:58 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:33:58 GMT Subject: [master] Integrated: Merge jdk In-Reply-To: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> References: <2dWPq1zbzVmx869NcO5H4HV1Hz866vawLkhRvcn8cDo=.7ac2db5b-499d-4989-8889-a937c2b55592@github.com> Message-ID: On Wed, 5 Oct 2022 19:26:42 GMT, Roman Kennke wrote: > I'd like to merge from upstream at tag jdk-20+17 (latest tag). > > Testing: > - [x] tier1 (x86_64, x86_32, aarch64) > - [x] tier2 (x86_64, x86_32, aarch64) This pull request has now been integrated. Changeset: b61bf6aa Author: Roman Kennke URL: https://git.openjdk.org/lilliput/commit/b61bf6aaf11107b9f6f9a7a87f96e9c0738b1f51 Stats: 179288 lines in 3090 files changed: 85536 ins; 74846 del; 18906 mod Merge jdk Reviewed-by: stuefe ------------- PR: https://git.openjdk.org/lilliput/pull/55 From rkennke at openjdk.org Thu Oct 6 19:20:44 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 19:20:44 GMT Subject: [master] RFR: Replace stack-locking with fast-locking Message-ID: I'd like to integrate the changes from the corresponding upstream PR into Lilliput upfront. This allows to get some testing in context of Lilliput, and also allows to work on follow-up changes, e.g. eliminate a lot of the load-klass stuff, improve hash-code, and simplification in GCs (especially Shenandoah and ZGC), as well as implement/enable some features in serviceability that haven't been possible because SA couldn't safely access object's Klass*. See upstream PR for description: https://github.com/openjdk/jdk/pull/10590 This PR is a 1:1 mirror of the upstream change, with some additional changes that touch Lilliput parts (e.g. synchronizer.cpp and Shenandoah). The upstream PR has received an enormous amount of testing, perf testing, discussion and design reviews, etc. Ok to go into Lilliput? Thanks, Roman ------------- Commit messages: - Required changes in Lilliput and Shenandoah code after merging fast-locking - Merge remote-tracking branch 'jdk-rkennke/fast-locking' into fast-locking3 - Merge tag 'jdk-20+17' into fast-locking - Fix OSR packing in AArch64, part 2 - Fix OSR packing in AArch64 - Merge remote-tracking branch 'upstream/master' into fast-locking - Fix register in interpreter unlock x86_32 - Support unstructured locking in interpreter (x86 parts) - Support unstructured locking in interpreter (aarch64 and shared parts) - Merge branch 'master' into fast-locking - ... and 19 more: https://git.openjdk.org/lilliput/compare/b61bf6aa...5aadb9c5 Changes: https://git.openjdk.org/lilliput/pull/56/files Webrev: https://webrevs.openjdk.org/?repo=lilliput&pr=56&range=00 Stats: 3834 lines in 130 files changed: 660 ins; 2625 del; 549 mod Patch: https://git.openjdk.org/lilliput/pull/56.diff Fetch: git fetch https://git.openjdk.org/lilliput pull/56/head:pull/56 PR: https://git.openjdk.org/lilliput/pull/56 From shade at openjdk.org Tue Oct 11 11:32:10 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 11:32:10 GMT Subject: [master] RFR: Replace stack-locking with fast-locking In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 15:00:24 GMT, Roman Kennke wrote: > I'd like to integrate the changes from the corresponding upstream PR into Lilliput upfront. This allows to get some testing in context of Lilliput, and also allows to work on follow-up changes, e.g. eliminate a lot of the load-klass stuff, improve hash-code, and simplification in GCs (especially Shenandoah and ZGC), as well as implement/enable some features in serviceability that haven't been possible because SA couldn't safely access object's Klass*. > > See upstream PR for description: > https://github.com/openjdk/jdk/pull/10590 > > This PR is a 1:1 mirror of the upstream change, with some additional changes that touch Lilliput parts (e.g. synchronizer.cpp and Shenandoah). The upstream PR has received an enormous amount of testing, perf testing, discussion and design reviews, etc. > > Ok to go into Lilliput? > > Thanks, > Roman OK for Lilliput. ------------- Marked as reviewed by shade (Committer). PR: https://git.openjdk.org/lilliput/pull/56 From rkennke at openjdk.org Thu Oct 13 11:14:01 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 11:14:01 GMT Subject: [master] RFR: Replace stack-locking with fast-locking [v2] In-Reply-To: References: Message-ID: <-keJQ-r27e1ES0Axe53_U4tynp7CsPPdO04iNxPETmo=.02ef55c0-7a6f-4ecd-8de6-a525eb0f5c63@github.com> > I'd like to integrate the changes from the corresponding upstream PR into Lilliput upfront. This allows to get some testing in context of Lilliput, and also allows to work on follow-up changes, e.g. eliminate a lot of the load-klass stuff, improve hash-code, and simplification in GCs (especially Shenandoah and ZGC), as well as implement/enable some features in serviceability that haven't been possible because SA couldn't safely access object's Klass*. > > See upstream PR for description: > https://github.com/openjdk/jdk/pull/10590 > > This PR is a 1:1 mirror of the upstream change, with some additional changes that touch Lilliput parts (e.g. synchronizer.cpp and Shenandoah). The upstream PR has received an enormous amount of testing, perf testing, discussion and design reviews, etc. > > Ok to go into Lilliput? > > Thanks, > Roman Roman Kennke has updated the pull request incrementally with seven additional commits since the last revision: - Merge remote-tracking branch 'jdk-rkennke/fast-locking' into fast-locking3 - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - RISC-V port - Revert "Re-use r0 in call to unlock_object()" This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - Fix number of rt args to complete_monitor_locking_C, remove some comments - Re-use r0 in call to unlock_object() ------------- Changes: - all: https://git.openjdk.org/lilliput/pull/56/files - new: https://git.openjdk.org/lilliput/pull/56/files/5aadb9c5..71f8cd8e Webrevs: - full: https://webrevs.openjdk.org/?repo=lilliput&pr=56&range=01 - incr: https://webrevs.openjdk.org/?repo=lilliput&pr=56&range=00-01 Stats: 381 lines in 17 files changed: 89 ins; 222 del; 70 mod Patch: https://git.openjdk.org/lilliput/pull/56.diff Fetch: git fetch https://git.openjdk.org/lilliput pull/56/head:pull/56 PR: https://git.openjdk.org/lilliput/pull/56 From stuefe at openjdk.org Mon Oct 17 12:10:14 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Mon, 17 Oct 2022 12:10:14 GMT Subject: [master] RFR: Replace stack-locking with fast-locking [v2] In-Reply-To: <-keJQ-r27e1ES0Axe53_U4tynp7CsPPdO04iNxPETmo=.02ef55c0-7a6f-4ecd-8de6-a525eb0f5c63@github.com> References: <-keJQ-r27e1ES0Axe53_U4tynp7CsPPdO04iNxPETmo=.02ef55c0-7a6f-4ecd-8de6-a525eb0f5c63@github.com> Message-ID: On Thu, 13 Oct 2022 11:14:01 GMT, Roman Kennke wrote: >> I'd like to integrate the changes from the corresponding upstream PR into Lilliput upfront. This allows to get some testing in context of Lilliput, and also allows to work on follow-up changes, e.g. eliminate a lot of the load-klass stuff, improve hash-code, and simplification in GCs (especially Shenandoah and ZGC), as well as implement/enable some features in serviceability that haven't been possible because SA couldn't safely access object's Klass*. >> >> See upstream PR for description: >> https://github.com/openjdk/jdk/pull/10590 >> >> This PR is a 1:1 mirror of the upstream change, with some additional changes that touch Lilliput parts (e.g. synchronizer.cpp and Shenandoah). The upstream PR has received an enormous amount of testing, perf testing, discussion and design reviews, etc. >> >> Ok to go into Lilliput? >> >> Thanks, >> Roman > > Roman Kennke has updated the pull request incrementally with seven additional commits since the last revision: > > - Merge remote-tracking branch 'jdk-rkennke/fast-locking' into fast-locking3 > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - RISC-V port > - Revert "Re-use r0 in call to unlock_object()" > > This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Fix number of rt args to complete_monitor_locking_C, remove some comments > - Re-use r0 in call to unlock_object() +1. Let's cook this. ------------- Marked as reviewed by stuefe (Committer). PR: https://git.openjdk.org/lilliput/pull/56 From rkennke at openjdk.org Mon Oct 17 15:09:46 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 17 Oct 2022 15:09:46 GMT Subject: [master] RFR: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: > I'd like to integrate the changes from the corresponding upstream PR into Lilliput upfront. This allows to get some testing in context of Lilliput, and also allows to work on follow-up changes, e.g. eliminate a lot of the load-klass stuff, improve hash-code, and simplification in GCs (especially Shenandoah and ZGC), as well as implement/enable some features in serviceability that haven't been possible because SA couldn't safely access object's Klass*. > > See upstream PR for description: > https://github.com/openjdk/jdk/pull/10590 > > This PR is a 1:1 mirror of the upstream change, with some additional changes that touch Lilliput parts (e.g. synchronizer.cpp and Shenandoah). The upstream PR has received an enormous amount of testing, perf testing, discussion and design reviews, etc. > > Ok to go into Lilliput? > > Thanks, > Roman Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: - Merge remote-tracking branch 'jdk-rkennke/fast-locking' into fast-locking3 - More RISC-V fixes ------------- Changes: - all: https://git.openjdk.org/lilliput/pull/56/files - new: https://git.openjdk.org/lilliput/pull/56/files/71f8cd8e..470a0701 Webrevs: - full: https://webrevs.openjdk.org/?repo=lilliput&pr=56&range=02 - incr: https://webrevs.openjdk.org/?repo=lilliput&pr=56&range=01-02 Stats: 37 lines in 5 files changed: 0 ins; 8 del; 29 mod Patch: https://git.openjdk.org/lilliput/pull/56.diff Fetch: git fetch https://git.openjdk.org/lilliput pull/56/head:pull/56 PR: https://git.openjdk.org/lilliput/pull/56 From rkennke at openjdk.org Tue Oct 18 14:55:38 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 18 Oct 2022 14:55:38 GMT Subject: [master] RFR: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: <43vTYjPbeFks1ARFTTOc8pFhuLUORN5vjdWlCGLf_co=.66464ef3-34b1-4bf6-9b04-d873f0aa5524@github.com> On Mon, 17 Oct 2022 15:09:46 GMT, Roman Kennke wrote: >> I'd like to integrate the changes from the corresponding upstream PR into Lilliput upfront. This allows to get some testing in context of Lilliput, and also allows to work on follow-up changes, e.g. eliminate a lot of the load-klass stuff, improve hash-code, and simplification in GCs (especially Shenandoah and ZGC), as well as implement/enable some features in serviceability that haven't been possible because SA couldn't safely access object's Klass*. >> >> See upstream PR for description: >> https://github.com/openjdk/jdk/pull/10590 >> >> This PR is a 1:1 mirror of the upstream change, with some additional changes that touch Lilliput parts (e.g. synchronizer.cpp and Shenandoah). The upstream PR has received an enormous amount of testing, perf testing, discussion and design reviews, etc. >> >> Ok to go into Lilliput? >> >> Thanks, >> Roman > > Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: > > - Merge remote-tracking branch 'jdk-rkennke/fast-locking' into fast-locking3 > - More RISC-V fixes Thanks everybody. Let's ------------- PR: https://git.openjdk.org/lilliput/pull/56 From rkennke at openjdk.org Tue Oct 18 14:58:31 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 18 Oct 2022 14:58:31 GMT Subject: [master] Integrated: Replace stack-locking with fast-locking In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 15:00:24 GMT, Roman Kennke wrote: > I'd like to integrate the changes from the corresponding upstream PR into Lilliput upfront. This allows to get some testing in context of Lilliput, and also allows to work on follow-up changes, e.g. eliminate a lot of the load-klass stuff, improve hash-code, and simplification in GCs (especially Shenandoah and ZGC), as well as implement/enable some features in serviceability that haven't been possible because SA couldn't safely access object's Klass*. > > See upstream PR for description: > https://github.com/openjdk/jdk/pull/10590 > > This PR is a 1:1 mirror of the upstream change, with some additional changes that touch Lilliput parts (e.g. synchronizer.cpp and Shenandoah). The upstream PR has received an enormous amount of testing, perf testing, discussion and design reviews, etc. > > Ok to go into Lilliput? > > Thanks, > Roman This pull request has now been integrated. Changeset: cce9e5d8 Author: Roman Kennke URL: https://git.openjdk.org/lilliput/commit/cce9e5d870344e832e991d4aaa261bfefd3d3e32 Stats: 4205 lines in 140 files changed: 741 ins; 2847 del; 617 mod Replace stack-locking with fast-locking Reviewed-by: shade, stuefe ------------- PR: https://git.openjdk.org/lilliput/pull/56 From rkennke at openjdk.org Thu Oct 20 13:28:28 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 20 Oct 2022 13:28:28 GMT Subject: [master] RFR: ObjectMonitor Storage In-Reply-To: References: Message-ID: On Thu, 17 Feb 2022 08:53:41 GMT, Thomas Stuefe wrote: > Hi, > > This prepares the way for an idea Roman had last year: to store OM references in a compressed form in the header, instead of as 64-bit pointer. Similar to narrow Klass pointers, we need an index or an offset into a memory region. > > With this patch, OMs live in an array now. That new OM array is pre-reserved, gets committed on demand, has a freelist to manage released OMs. In a way, it is very similar to the old TSM solution before [JDK-8253064](https://bugs.openjdk.java.net/browse/JDK-8253064). Thanks a lot to @dcubed-ojdk for patiently explaining the details of the old solution to me([1], [2], [3]). > > ### Performance and memory use > > I found that the renaissance "philosophers" benchmark is a good tool for measuring ObjectMonitor memory usage and performance. The benchmark, if you run with default VM options, does a lot of synchronization and creates millions of OMs. Running with a lower value for `-XX:MonitorUsedDeflationThreshold` diminishes the effect, and according to Dan [3] it would be a typical case for threshold reduction too. > > This is not a typical use case. Typically we have only a few thousand OMs, and OM storage management does not matter that much. But we don't want huge performance drops in these outlier scenarios. > > The first version of my patch was very naive and used a single global allocator and synchronized each access. Performance loss was brutal, about 15% compared to malloc. > > I improved the patch to reduce contention: > - OMs are now preallocated in bulk on a per thread base (by default, 64, adjustable via `-XX:PreallocatedObjectMonitors`). > - OMs are released in bulk by preparing the OM freelist off-lock and only drawing the lock when appending the list to the central free list. > > Again, all somewhat similar to the old TSM solution. > > The improved version now has similar or even better performance and memory use than the stock VM, see below. > > #### Measurements > > Running renaissance philosphers benchmark with both Stock (Lilliput) VM and patched lilliput VM. > > Options: `-XX:+UnlockDiagnosticVMOptions -Xmx2g -Xms2g -XX:NativeMemoryTracking=summary -XX:+PrintNMTStatistics -XX:+DumpVitalsAtExit` > > We compare two allocators, one outside our control, in the libc. So I measure RSS, not Committed memory use, because I have no idea how much memory the libc commits in order to fulfill its mallocs. It may actually be a lot - to prevent contention, at least the glibc uses thread local arenas, which can get huge but are often mostly unused. > > For the same reason I do not use `AlwaysPretouch`: we would pretouch committed mmaped memory in the patched version, but neither os::malloc() nor whatever overhead the libc produces would be touched. `AlwaysPretouch` would hence bias against stock. > > The performance numbers and RSS numbers wobbled a bit, but the following run is average (smaller numbers better): > > ##### Default run (`-XX:MonitorUsedDeflationThreshold=90`, `-XX:PreallocatedObjectMonitors=64`) > > 1) Stock > > Benchmark result: 4333,61537 > Highest rss: 3,7g > > 2) New ObjectMonitorStorage > > Benchmark result: 4122,14034 (+4.9%) > Highest rss: 3,7g > > > ##### Run with `-XX:MonitorUsedDeflationThreshold=50` > > 1) Stock > > Benchmark result: 4202,40876 > Highest rss: rss=2,1g > > 2) New ObjectMonitorStorage > > Benchmark result: 4142,66361 (+1.4%) > Highest rss: 1.9g (-9%) > > > ##### Run with `-XX:PreallocatedObjectMonitors=1024` > > 2) New ObjectMonitorStorage > > Benchmark result: 4322,59806 (-2.8%) > Highest rss: 3.5g (+66%) > > #### Analysis > > Patched version uses a bit less RSS and is 1..5% faster than unpatched version. > > Enlarging the number of thread local preallocated OMs to 1024 was not so hot. Reduced contention comes at a high memory price and an actual performance loss too. There is an optimal point, maybe even the default of 64 is too high. I ran out of time though, but this is certainly a knob to optimize. > > I suspect that if we go down this road, we'll need to spend more time optimizing memory and performance of OM storage. Even though my results are encouraging. But this is an area where a lot of tweaking happened upstream over time. > > ### Patch details: > > The patch introduces the general-purpose classes `AddressStableArray`. A templatified array that is pre-reserved, is address stable, contiguous, committed on demand, keeps a freelist of released items. Code is well tested (see the new gtests) and can serve as building block for similar uses. E.g. Roman played with the idea of placing Thread into such an array too. > > ### What this patch does not: > > This patch does not change the way OMs are stored in the markword. I had a quick glance, but its not trivial to find all places in generated code that read OMs from the mark word (e.g. `C2_MacroAssembler::rtm_inflated_locking`). I ran out of time and leave this for another day. > > ### Tests: > > - GHAs > - SAP nightlies (scheduled) > - manual test on Linux X64, x86, aarch64, Windows x64 > > [1] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-January/053683.html > [2] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-February/053903.html > [3] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-March/054187.html I realized that, at least in the long run, mapping oops to OMs in the header is not feasible. The problem is that it is not compatible with compact-hashcode. Compact hashcode can expand an object to include an additional field for storing the hashcode. This is done during GC. This means that the size of the original copy and the new copy of an object may differ. The size of an object is determined by looking at the object's header. If the header is displaced, e.g. into an OM, then we only have a single header for both copies of the object. This mean that we can not safely determine the object size during GC if we displace the header. ------------- PR: https://git.openjdk.org/lilliput/pull/39