From mdoerr at openjdk.org Tue Oct 4 09:37:57 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Oct 2022 09:37:57 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 We have tested it for a couple of days and there were no new failures. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/10264 From shade at openjdk.org Tue Oct 4 10:50:14 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 4 Oct 2022 10:50:14 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 So we can now enter this code when holding `AdapterHandlerLibrary_lock`, which has a rank of `safepoint-1`. These locks should probably match the rank of `Heap_lock`, which is `safepoint-2` now. Please update `_alloc_failure_waiters_lock` rank as well. ------------- Changes requested by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10264 From nick.gasson at arm.com Tue Oct 4 12:56:12 2022 From: nick.gasson at arm.com (Nick Gasson) Date: Tue, 04 Oct 2022 13:56:12 +0100 Subject: Improving the scalability of the evac OOM protocol Message-ID: Hi, I've been running SPECjbb with Shenandoah on some large multi-socket Arm systems and I noticed the concurrent evacuation OOM protocol is a bit of a bottleneck. The problem here is that we have a single variable, _threads_in_evac, shared between all threads. To enter the protocol we do a CAS to increment the counter and to leave we do an atomic decrement. For the GC threads this isn't really an issue as they only enter/leave once per cycle, but Java threads have to enter/leave every time they help evacuate an object on the load barrier slow path. This means _threads_in_evac is very heavily contended and we effectively serialise Java thread execution through access to this variable: I counted several million CAS failures per second in ShenandoahEvacOOMHandler::register_thread() on one Arm N1 system while running SPECjbb. This is especially problematic on multi-socket systems where the communication overhead of the cache coherency protocol can be high. I tried fixing this in a fairly simple way by replicating the counter N times on separate cache lines (N=64, somewhat arbitrarily). See the draft patch below: https://github.com/nick-arm/jdk/commit/ca78e77f0c6 Each thread hashes to a particular counter based on its Thread*. To signal an OOM we CAS in OOM_MARKER_MASK on every counter and then in wait_for_no_evac_threads() we wait for every counter to go to zero (and also to see OOM_MARKER_MASK set in that counter). I think this is safe and race-free based on the fact that, once OOM_MARKER_MASK is set, the counter can only ever decrease. So once we've seen a particular counter go to zero we know that the value will never change except when clear() is called at a safepoint. This means we can just iterate over all the counters, and if we see that they are all zero, then we know no more threads are inside or can enter the evacuation path. On a 160-core dual-socket Arm N1 system this improves SPECjbb max-jOPS by ~8% and critical-jOPS by ~98% (!), averaged over 10 runs. On a 32-core dual-socket Xeon system I get +0.4% max-jOPS and +43% critical-jOPS. There's also some benefit on single-socket systems: with AWS c7g.16xlarge I see +0.3% max-jOPS and +3% critical-jOPS. I've also tested SPECjbb on a fastdebug build with -XX:+ShenandoahOOMDuringEvacALot and didn't see any errors. I experimented with taking this to its logical conclusion and giving each thread its own counter in ShenandoahThreadLocalData, but it's difficult to avoid races with thread creation and this simple approach seems to give most of the benefit anyway. Any thoughts on this? -- Thanks, Nick From shade at redhat.com Tue Oct 4 14:42:11 2022 From: shade at redhat.com (Aleksey Shipilev) Date: Tue, 4 Oct 2022 16:42:11 +0200 Subject: Improving the scalability of the evac OOM protocol In-Reply-To: References: Message-ID: <6b3eface-94b9-744d-bc7d-ec5bb4b05c90@redhat.com> On 10/4/22 14:56, Nick Gasson wrote: > I tried fixing this in a fairly simple way by replicating the counter N > times on separate cache lines (N=64, somewhat arbitrarily). See the > draft patch below: > > https://github.com/nick-arm/jdk/commit/ca78e77f0c6 > > Each thread hashes to a particular counter based on its Thread*. Yes, stripped counter works here fine. > I've also tested SPECjbb on a fastdebug build with > -XX:+ShenandoahOOMDuringEvacALot and didn't see any errors. make test TEST=hotspot_gc_shenandoah exercises evac paths a lot, consider running it on affected platforms. > I experimented with taking this to its logical conclusion and giving > each thread its own counter in ShenandoahThreadLocalData, but it's > difficult to avoid races with thread creation and this simple approach > seems to give most of the benefit anyway. > > Any thoughts on this? Looks very good, please PR this. There are minor improvements we can do to this patch. -- Thanks, -Aleksey From ngasson at openjdk.org Wed Oct 5 11:20:53 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 5 Oct 2022 11:20:53 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac Message-ID: The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html Also tested `hotspot_gc_shenandoah` on x86 and AArch64. ------------- Commit messages: - 8294775: Shenandoah: reduce contention on _threads_in_evac Changes: https://git.openjdk.org/jdk/pull/10573/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294775 Stats: 87 lines in 4 files changed: 62 ins; 6 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/10573.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10573/head:pull/10573 PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Wed Oct 5 11:20:53 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 5 Oct 2022 11:20:53 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 11:10:29 GMT, Nick Gasson wrote: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.inline.hpp line 35: > 33: > 34: void ShenandoahEvacOOMHandler::enter_evacuation(Thread* thr) { > 35: jint threads_in_evac = Atomic::load_acquire(&_threads_in_evac); This load seems to be redundant. I don't think it has any ordering effects and we will load it again immediately either below or in `register_thread()`. ------------- PR: https://git.openjdk.org/jdk/pull/10573 From rkennke at openjdk.org Thu Oct 6 07:47:02 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:02 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking Message-ID: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. This change enables to simplify (and speed-up!) a lot of code: - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR ### Benchmarks All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. #### DaCapo/AArch64 Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? benchmark | baseline | fast-locking | % | size -- | -- | -- | -- | -- avrora | 27859 | 27563 | 1.07% | large batik | 20786 | 20847 | -0.29% | large biojava | 27421 | 27334 | 0.32% | default eclipse | 59918 | 60522 | -1.00% | large fop | 3670 | 3678 | -0.22% | default graphchi | 2088 | 2060 | 1.36% | default h2 | 297391 | 291292 | 2.09% | huge jme | 8762 | 8877 | -1.30% | default jython | 18938 | 18878 | 0.32% | default luindex | 1339 | 1325 | 1.06% | default lusearch | 918 | 936 | -1.92% | default pmd | 58291 | 58423 | -0.23% | large sunflow | 32617 | 24961 | 30.67% | large tomcat | 25481 | 25992 | -1.97% | large tradebeans | 314640 | 311706 | 0.94% | huge tradesoap | 107473 | 110246 | -2.52% | huge xalan | 6047 | 5882 | 2.81% | default zxing | 970 | 926 | 4.75% | default #### DaCapo/x86_64 The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. benchmark | baseline | fast-Locking | % | size -- | -- | -- | -- | -- avrora | 127690 | 126749 | 0.74% | large batik | 12736 | 12641 | 0.75% | large biojava | 15423 | 15404 | 0.12% | default eclipse | 41174 | 41498 | -0.78% | large fop | 2184 | 2172 | 0.55% | default graphchi | 1579 | 1560 | 1.22% | default h2 | 227614 | 230040 | -1.05% | huge jme | 8591 | 8398 | 2.30% | default jython | 13473 | 13356 | 0.88% | default luindex | 824 | 813 | 1.35% | default lusearch | 962 | 968 | -0.62% | default pmd | 40827 | 39654 | 2.96% | large sunflow | 53362 | 43475 | 22.74% | large tomcat | 27549 | 28029 | -1.71% | large tradebeans | 190757 | 190994 | -0.12% | huge tradesoap | 68099 | 67934 | 0.24% | huge xalan | 7969 | 8178 | -2.56% | default zxing | 1176 | 1148 | 2.44% | default #### Renaissance/AArch64 This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 2558.832 | 2513.594 | 1.80% Reactors | 14715.626 | 14311.246 | 2.83% Als | 1851.485 | 1869.622 | -0.97% ChiSquare | 1007.788 | 1003.165 | 0.46% GaussMix | 1157.491 | 1149.969 | 0.65% LogRegression | 717.772 | 733.576 | -2.15% MovieLens | 7916.181 | 8002.226 | -1.08% NaiveBayes | 395.296 | 386.611 | 2.25% PageRank | 4294.939 | 4346.333 | -1.18% FjKmeans | 519.2 | 498.357 | 4.18% FutureGenetic | 2578.504 | 2589.255 | -0.42% Mnemonics | 4898.886 | 4903.689 | -0.10% ParMnemonics | 4260.507 | 4210.121 | 1.20% Scrabble | 139.37 | 138.312 | 0.76% RxScrabble | 320.114 | 322.651 | -0.79% Dotty | 1056.543 | 1068.492 | -1.12% ScalaDoku | 3443.117 | 3449.477 | -0.18% ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% FinagleChirper | 6814.192 | 6853.38 | -0.57% FinagleHttp | 4762.902 | 4807.564 | -0.93% #### Renaissance/x86_64 benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 1117.185 | 1116.425 | 0.07% Reactors | 11561.354 | 11812.499 | -2.13% Als | 1580.838 | 1575.318 | 0.35% ChiSquare | 459.601 | 467.109 | -1.61% GaussMix | 705.944 | 685.595 | 2.97% LogRegression | 659.944 | 656.428 | 0.54% MovieLens | 7434.303 | 7592.271 | -2.08% NaiveBayes | 413.482 | 417.369 | -0.93% PageRank | 3259.233 | 3276.589 | -0.53% FjKmeans | 946.429 | 938.991 | 0.79% FutureGenetic | 1760.672 | 1815.272 | -3.01% Scrabble | 147.996 | 150.084 | -1.39% RxScrabble | 177.755 | 177.956 | -0.11% Dotty | 673.754 | 683.919 | -1.49% ScalaKmeans | 165.376 | 168.925 | -2.10% ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. ### Testing - [x] tier1 (x86_64, aarch64, x86_32) - [x] tier2 (x86_64, aarch64) - [x] tier3 (x86_64, aarch64) - [x] tier4 (x86_64, aarch64) ------------- Commit messages: - Merge tag 'jdk-20+17' into fast-locking - Fix OSR packing in AArch64, part 2 - Fix OSR packing in AArch64 - Merge remote-tracking branch 'upstream/master' into fast-locking - Fix register in interpreter unlock x86_32 - Support unstructured locking in interpreter (x86 parts) - Support unstructured locking in interpreter (aarch64 and shared parts) - Merge branch 'master' into fast-locking - Merge branch 'master' into fast-locking - Added test for hand-over-hand locking - ... and 17 more: https://git.openjdk.org/jdk/compare/79ccc791...3ed51053 Changes: https://git.openjdk.org/jdk/pull/9680/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9680&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291555 Stats: 3660 lines in 127 files changed: 650 ins; 2481 del; 529 mod Patch: https://git.openjdk.org/jdk/pull/9680.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9680/head:pull/9680 PR: https://git.openjdk.org/jdk/pull/9680 From stuefe at openjdk.org Thu Oct 6 07:47:02 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Oct 2022 07:47:02 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) When I run renaissance philosophers benchmark (no arguments, just the default settings) on my 12 core machine the VM intermittently hangs after the benchmark is done. Always, two threads keep running at 100% CPU. I have been able to attach gdb once and we were in a tight loop in (gdb) bt #0 Atomic::PlatformLoad<8ul>::operator() (dest=0x7f991c119e80, this=) at src/hotspot/share/runtime/atomic.hpp:614 #1 Atomic::LoadImpl, void>::operator() (dest=0x7f991c119e80, this=) at src/hotspot/share/runtime/atomic.hpp:392 #2 Atomic::load (dest=0x7f991c119e80) at src/hotspot/share/runtime/atomic.hpp:615 #3 ObjectMonitor::owner_raw (this=0x7f991c119e40) at src/hotspot/share/runtime/objectMonitor.inline.hpp:66 #4 ObjectMonitor::owner (this=0x7f991c119e40) at src/hotspot/share/runtime/objectMonitor.inline.hpp:61 #5 ObjectSynchronizer::monitors_iterate (thread=0x7f9a30027230, closure=) at src/hotspot/share/runtime/synchronizer.cpp:983 #6 ObjectSynchronizer::release_monitors_owned_by_thread (current=current at entry=0x7f9a30027230) at src/hotspot/share/runtime/synchronizer.cpp:1492 #7 0x00007f9a351bc320 in JavaThread::exit (this=this at entry=0x7f9a30027230, destroy_vm=destroy_vm at entry=false, exit_type=exit_type at entry=JavaThread::jni_detach) at src/hotspot/share/runtime/javaThread.cpp:851 #8 0x00007f9a352445ca in jni_DetachCurrentThread (vm=) at src/hotspot/share/prims/jni.cpp:3962 #9 0x00007f9a35f9ac7e in JavaMain (_args=) at src/java.base/share/native/libjli/java.c:555 #10 0x00007f9a35f9e30d in ThreadJavaMain (args=) at src/java.base/unix/native/libjli/java_md.c:650 #11 0x00007f9a35d47609 in start_thread (arg=) at pthread_create.c:477 #12 0x00007f9a35ea3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 in one thread. Which points to a misformed monitor list. I tried to reproduce it with a debug build, but no such luck. I was able to reproduce it once again with a release build. I'll see if I can find out more. Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run. Problem here is the OM in use list is circular (and very big, ca 11mio entries). I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3). Offlist questions from Roman: -"Does it really not happen with Stock?" no, I could not reproduce it with stock VM (built from f5d1b5bda27c798347ae278cbf69725ed4be895c, the commit preceding the PR) -"Do we now have more OMs than before?" I cannot see that effect. Running philosophers with -r 3 causes the VM in the end to have between 800k and ~2mio open OMs *if the error does not happen*, no difference between stock and PR VM. In cases where the PR-VM hangs we have a lot more, as I wrote, about 11-12mio OMs. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:03 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:03 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Wed, 3 Aug 2022 07:17:51 GMT, Thomas Stuefe wrote: > Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run. > > Problem here is the OM in use list is circular (and very big, ca 11mio entries). > > I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3). Hi Thomas, thanks for testing and reporting the issue. I just pushed an improvement (and simplification) of the monitor-enter-inflate path, and cannot seem to reproduce the problem anymore. Can you please try again with the latest change? ------------- PR: https://git.openjdk.org/jdk/pull/9680 From stuefe at openjdk.org Thu Oct 6 07:47:04 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Oct 2022 07:47:04 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: <87vYXa_Uu88sb8ldFeGdHfeqPCMPxhhzqbVOooXle7A=.09d21ecc-8910-464d-b164-88b8322ebd34@github.com> On Sun, 7 Aug 2022 12:50:01 GMT, Roman Kennke wrote: > > Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run. > > Problem here is the OM in use list is circular (and very big, ca 11mio entries). > > I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3). > > Hi Thomas, thanks for testing and reporting the issue. I just pushed an improvement (and simplification) of the monitor-enter-inflate path, and cannot seem to reproduce the problem anymore. Can you please try again with the latest change? New version ran for 30 mins without crashing. Not a solid proof, but its better :-) ------------- PR: https://git.openjdk.org/jdk/pull/9680 From dholmes at openjdk.org Thu Oct 6 07:47:05 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 6 Oct 2022 07:47:05 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially. I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things. Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit. What "modern workloads" are you using to measure this? We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks. Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons? I don't have github notification enabled so I missed this discussion. The JVMS permits lock A, lock B, unlock A, unlock B, in bytecode - i.e it passes verification and it does not violate the structured locking rules. It probably also passes verification if there is no exception table entries such that the unlocks are guaranteed to happen - regardless of the order. IIUC from above the VM will actually unlock all monitors for which there is a lock-record in the activation when the activation returns. The order in which it does that may be different to how the program would have done it but I don't see how that makes any difference to anything. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From stuefe at openjdk.org Thu Oct 6 07:47:07 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Oct 2022 07:47:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> Message-ID: On Mon, 8 Aug 2022 13:45:06 GMT, Roman Kennke wrote: > The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially. I fully agree that have to be careful, but I share Roman's viewpoint. If this work is something we want to happen and which is not in doubt in principle, then we also want the broadest possible test front. In my experience, opt-in coding is tested poorly. A runtime switch is fine as an emergency measure when you have customer problems, but then both standard and fallback code paths need to be very well tested. With something as ubiquitous as locking this would mean running almost the full test set with and without the new fast locking mechanism, and that is not feasible. Or even if it is, not practical: the cycles are better invested in hardening out the new locking mechanism. And arguably, we already have an opt-out mechanism in the form of UseHeavyMonitors. It's not ideal, but as Roman wrote, in most scenarios, this does not show any regression. So in a pinch, it could serve as a short-term solution if the new fast lock mechanism is broken. In my opinion, the best time for such an invasive change is the beginning of the development cycle for a non-LTS-release, like now. And we don't have to push the PR in a rush, we can cook it in its branch and review it very thoroughly. Cheers, Thomas ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:07 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in: #7 0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this at entry=0x7f7ce801dd90, adr=adr at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549 #8 0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979 #9 0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=, owner=owner at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382 I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. That explains why I have not heard anything about popframe and friends :) ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:09 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:09 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Mon, 8 Aug 2022 18:29:54 GMT, Roman Kennke wrote: > > I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in: > > ``` > > #7 0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this at entry=0x7f7ce801dd90, adr=adr at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549 > > #8 0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979 > > #9 0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=, owner=owner at entry=0x1 ) > > at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382 > > ``` > > Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an anonymous owner must be converted to the oop address before passing down to Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? Testing com/sun/jdi passes for me, now. Yes, that fixed it. I'm running more tests also. I got this build problem on aarch64: open/src/hotspot/share/asm/assembler.hpp:168), pid=3387376, tid=3387431 # assert(is_bound() || is_unused()) failed: Label was never bound to a location, but it was used as a jmp target V [libjvm.so+0x4f4788] Label::~Label()+0x48 V [libjvm.so+0x424a44] cmpFastLockNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x764 V [libjvm.so+0x1643888] PhaseOutput::fill_buffer(CodeBuffer*, unsigned int*)+0x538 V [libjvm.so+0xa85fcc] Compile::Code_Gen()+0x3bc ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:10 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:10 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Tue, 9 Aug 2022 09:19:54 GMT, Robbin Ehn wrote: > I got this build problem on aarch64: Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:06 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:06 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> On Mon, 8 Aug 2022 12:14:38 GMT, David Holmes wrote: > The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially. Reverting a change should not be difficult. (Unless maybe another major change arrived in the meantime, which makes reverse-applying a patch non-trivial.) I'm skeptical to implement an opt-in runtime-switch, though. - Keeping the old paths side-by-side with the new paths is an engineering effort in itself, as you point out. It means that it, too, introduces significant risks to break locking, one way or the other (or both). - Making the new path opt-in means that we achieve almost nothing by it: testing code would still normally run the old paths (hopefully we didn't break it by making the change), and only use the new paths when explicitely told so, and I don't expect that many people voluntarily do that. It *may* be more useful to make it opt-out, as a quick fix if anybody experiences troubles with it. - Do we need runtime-switchable opt-in or opt-out flag for the initial testing and baking? I wouldn't think so: it seems better and cleaner to take the Git branch of this PR and put it through all relevant testing before the change goes in. - For how long do you think the runtime switch should stay? Because if it's all but temporary, it means we better test both paths thoroughly and automated. And it may also mean extra maintenance work (with extra avenues for bugs, see above), too. > I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things. I believe the least risky path overall is to make UseHeavyMonitors a production flag. Then it can act as a kill-switch for the new locking code, should anything go bad. I even considered to remove stack-locking altogether, and could only show minor performance impact, and always only in code that uses obsolete synchronized Java collections like Vector, Stack and StringBuffer. If you'd argue that it's too risky to use UseHeavyMonitors for that - then certainly you understand that the risk of introducing a new flag and manage two stack-locking subsystems would be even higher. There's a lot of code that is risky in itself to keep both paths. For example, I needed to change register allocation in the C2 .ad declarations and also in the interpreter/generated assembly code. It's hard enough to see that it is correct for one of the implementations, and much harder to implement and verify this correctly for two. > Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit. Not only that. As far as I can tell, 'heavy monitors' would only be worse off in workloads that 1. use uncontended sync and 2. churns monitors. Lots of uncontended sync on the same monitor object is not actually worse than fast-locking (it boils down to a single CAS in both cases). It only gets bad when code keeps allocating short-lived objects and syncs on them once or a few times only, and then moves on to the next new sync objects. > What "modern workloads" are you using to measure this? So far I tested with SPECjbb and SPECjvm-workloads-transplanted-into-JMH, dacapo and renaissance. I could only measure regressions with heavy monitors in workloads that use XML/XSLT, which I found out is because the XSTL compiler generates code that uses StringBuffer for (single-threaded) parsing. I also found a few other places in XML where usage of Stack and Vector has some impact. I can provide fixes for those, if needed (but I'm not sure whether this should go into JDK, upstream Xalan/Xerxes or both). > We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks. Yup. > Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons? Yes, except that the flag would have to be made product. Also, it is useful to use this PR instead of upstream JDK, because it simplifies the inflation protocol pretty much like it would be simplified without any stack-locking. I can make a standalone PR that gets rid of stack-locking altogether, if that is useful. Also keep in mind that both this fast-locking PR and total removal of stack-locking would enable some follow-up improvements: we'd no longer have to inflate monitors in order to install or read an i-hashcode. And GC code similarily may benefit from easier read/write of object age bits. This might benefit generational concurrent GC efforts. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:08 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:08 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Mon, 8 Aug 2022 15:44:50 GMT, Robbin Ehn wrote: > I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in: > > ``` > #7 0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this at entry=0x7f7ce801dd90, adr=adr at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549 > #8 0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979 > #9 0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=, owner=owner at entry=0x1 ) > at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382 > ``` Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an anonymous owner must be converted to the oop address before passing down to Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? Testing com/sun/jdi passes for me, now. > I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. > > That explains why I have not heard anything about popframe and friends :) Hmm yeah, I also realized this recently :-D I will have to clean this up before going further. And I'll also will work to support the unstructured locking in the interpreter. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:11 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:11 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Tue, 9 Aug 2022 10:46:51 GMT, Roman Kennke wrote: > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:12 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:12 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> Message-ID: <9P1YMHdrh0hBVSsynwUQ5PVpU14yaF5V-00H5uWGLek=.fc7c13d9-3601-4f2e-8846-0b66eb0a13df@github.com> On Tue, 9 Aug 2022 09:32:47 GMT, Roman Kennke wrote: > I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things. Maybe it's worth to step back a little and discuss whether or not we actually want stack-locking (or a replacement) *at all*. My measurements seem to indicate that a majority of modern workloads (i.e. properly synchronized, not using legacy collections) actually benefit from running without stack-locking (or the fast-locking replacement). The workloads that suffer seem to be only such workloads which make heavy use of always-synchronized collections, code that we'd nowadays probably not consider 'idiomatic Java' anymore. This means that support for faster legacy code costs modern Java code actual performance points. Do we really want this? It may be wiser overall to simply drop stack-locking without replacement, and go and fix the identified locations where using of legacy collections affects performance negatively in the JDK (I found a few places in XML/XSLT code, for example). I am currently re-running my benchmarks to show this. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:13 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Tue, 9 Aug 2022 11:05:45 GMT, Robbin Ehn wrote: > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). > > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". Do you know by any chance which tests trigger this? ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:13 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> On Thu, 11 Aug 2022 11:19:31 GMT, Roman Kennke wrote: > > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). > > > > > > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". > > Do you know by any chance which tests trigger this? Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:14 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:14 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) I added implementation for arm, ppc and s390 blindly. @shipilev, @tstuefe maybe you could sanity-check them? most likely they are buggy. I also haven't checked riscv at all, yet. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:15 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:15 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

<4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Thu, 11 Aug 2022 11:39:01 GMT, Robbin Ehn wrote: >>> > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). >>> >>> NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". >> >> Do you know by any chance which tests trigger this? > >> > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). >> > >> > >> > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". >> >> Do you know by any chance which tests trigger this? > > Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java > @robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway). > > See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm jvms-2.11.10 > Structured locking is the situation when, during a method invocation, every exit on a given monitor matches a preceding entry on that monitor. Since there is no assurance that all code submitted to the Java Virtual Machine will perform structured locking, implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking. Let T be a thread and M be a monitor. Then: > > The number of monitor entries performed by T on M during a method invocation must equal the number of monitor exits performed by T on M during the method invocation whether the method invocation completes normally or abruptly. > > At no point during a method invocation may the number of monitor exits performed by T on M since the method invocation exceed the number of monitor entries performed by T on M since the method invocation. > > Note that the monitor entry and exit automatically performed by the Java Virtual Machine when invoking a synchronized method are considered to occur during the calling method's invocation. I think the intent of above was to allow enforcing structured locking. In relevant other projects, we support only structured locking in Java, but permit some unstructured locking when done via JNI. In that project JNI monitor enter/exit do not use the lockstack. I don't think we today fully support unstructured locking either: void foo_lock() { monitorenter(this); // If VM abruptly returns here 'this' will be unlocked // Because VM assumes structured locking. // see e.g. remove_activation(...) } *I scratch this as it was a bit off topic.* ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:16 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:16 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

<4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Tue, 16 Aug 2022 15:47:58 GMT, Robbin Ehn wrote: > > @robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway). > > See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm > > jvms-2.11.10 > > > Structured locking is the situation when, during a method invocation, every exit on a given monitor matches a preceding entry on that monitor. Since there is no assurance that all code submitted to the Java Virtual Machine will perform structured locking, implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking. Let T be a thread and M be a monitor. Then: > > The number of monitor entries performed by T on M during a method invocation must equal the number of monitor exits performed by T on M during the method invocation whether the method invocation completes normally or abruptly. > > At no point during a method invocation may the number of monitor exits performed by T on M since the method invocation exceed the number of monitor entries performed by T on M since the method invocation. > > Note that the monitor entry and exit automatically performed by the Java Virtual Machine when invoking a synchronized method are considered to occur during the calling method's invocation. > > I think the intent of above was to allow enforcing structured locking. TBH, I don't see how this affects the scenario that I'm testing. The scenario: monitorenter A; monitorenter B; monitorexit A; monitorexit B; violates any of the two conditions: - the number of monitorenters and -exits during the execution always matches - the number of monitorexits for each monitor does not exceed the number of monitorenters for the same monitor Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. > In relevant other projects, we support only structured locking in Java, but permit some unstructured locking when done via JNI. In that project JNI monitor enter/exit do not use the lockstack. Yeah, JNI locking always inflate and uses full monitors. My proposal hasn't changed this. > I don't think we today fully support unstructured locking either: > > ``` > void foo_lock() { > monitorenter(this); > // If VM abruptly returns here 'this' will be unlocked > // Because VM assumes structured locking. > // see e.g. remove_activation(...) > } > ``` > > _I scratch this as it was a bit off topic._ Hmm yeah, this is required for properly handling exceptions. I have seen this making a bit of a mess in C1 code. That said, unstructured locking today only ever works in the interpreter, the JIT compilers would refuse to compile unstructured locking code. So if somebody would come up with a language and compiler that emits unstructured (e.g. hand-over-hand) locks, it would run, but only very slowly. I think I know how to make my proposal handle unstructured locking properly: In the interpreter monitorexit, I can check the top of the lock-stack, and if it doesn't match, call into the runtime, and there it's easy to implement the unstructured scenario. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:17 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:17 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

<4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com>

Message-ID: On Tue, 16 Aug 2022 16:21:04 GMT, Roman Kennke wrote: > Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. I know but the text says: - "every exit on a given monitor matches a preceding entry on that monitor." - "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it. EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking? But then again what if: void foo_lock() { monitorenter(A); monitorenter(B); // If VM abruptly returns here // VM can unlock them in reverse order first B and then A ? monitorexit(A); monitorexit(B); } ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:17 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:17 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

<4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com>

Message-ID: On Wed, 17 Aug 2022 07:29:23 GMT, Robbin Ehn wrote: > > Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. > > I know but the text says: > > * "every exit on a given monitor matches a preceding entry on that monitor." > > * "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" > > > I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it. > > EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking? > > But then again what if: > > ``` > void foo_lock() { > monitorenter(A); > monitorenter(B); > // If VM abruptly returns here > // VM can unlock them in reverse order first B and then A ? > monitorexit(A); > monitorexit(B); > } > ``` Do you think there would be any chance to clarify the spec there? Or even outright disallow unstructured/not-properly-nested locking altogether (and maybe allow the verifier to check it)? That would certainly be the right thing to do. And, afaict, it would do no harm because no compiler of any language would ever emit unstructured locking anyway - because if it did, the resulting code would crawl interpreted-only). ------------- PR: https://git.openjdk.org/jdk/pull/9680 From kvn at openjdk.org Thu Oct 6 07:47:18 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 07:47:18 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

<4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com>

Message-ID: On Wed, 17 Aug 2022 15:34:01 GMT, Roman Kennke wrote: >>> Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. >> >> I know but the text says: >> - "every exit on a given monitor matches a preceding entry on that monitor." >> - "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" >> >> I read this as if the rules do not guarantee structured locking the rules are not correct. >> The VM is allowed to enforce it. >> But thats just my take on it. >> >> EDIT: >> Maybe I'm reading to much into it. >> Lock A,B then unlock A,B maybe is considered structured locking? >> >> But then again what if: >> >> >> void foo_lock() { >> monitorenter(A); >> monitorenter(B); >> // If VM abruptly returns here >> // VM can unlock them in reverse order first B and then A ? >> monitorexit(A); >> monitorexit(B); >> } > >> > Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. >> >> I know but the text says: >> >> * "every exit on a given monitor matches a preceding entry on that monitor." >> >> * "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" >> >> >> I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it. >> >> EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking? >> >> But then again what if: >> >> ``` >> void foo_lock() { >> monitorenter(A); >> monitorenter(B); >> // If VM abruptly returns here >> // VM can unlock them in reverse order first B and then A ? >> monitorexit(A); >> monitorexit(B); >> } >> ``` > > Do you think there would be any chance to clarify the spec there? Or even outright disallow unstructured/not-properly-nested locking altogether (and maybe allow the verifier to check it)? That would certainly be the right thing to do. And, afaict, it would do no harm because no compiler of any language would ever emit unstructured locking anyway - because if it did, the resulting code would crawl interpreted-only). We need to understand performance effects of these changes. I don't see data here or new JMH benchmarks which can show data. @rkennke can you show data you have? And, please, update RFE description with what you have in PR description. @ericcaspole do we have JMH benchmarks to test performance for different lock scenarios? I see few tests in `test/micro` which use `synchronized`. Are they enough? Or we need more? Do we have internal benchmarks we could use for such testing? I would prefer to have "opt-in" but looking on scope of changes it may introduce more issues. Without "opt-in" I want performance comparison of VMs with different implementation instead of using `UseHeavyMonitors` to make judgement about this implementation. `UseHeavyMonitors` (product flag) should be tested separately to make sure when it is used as fallback mechanism by customers they would not get significant performance penalty. I agree with @tstuefe that we should test this PR a lot (all tiers on all supported platforms) including performance testing before integration. In addition we need full testing of this implementation with `UseHeavyMonitors` ON. And I should repeat that integration happens when changes are ready (no issues). We should not rush for particular JDK release. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:19 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Tue, 30 Aug 2022 11:52:24 GMT, Roman Kennke wrote: >> I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) >> So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. >> >> That explains why I have not heard anything about popframe and friends :) > >> I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. >> >> That explains why I have not heard anything about popframe and friends :) > > Hmm yeah, I also realized this recently :-D > I will have to clean this up before going further. And I'll also will work to support the unstructured locking in the interpreter. > We need to understand performance effects of these changes. I don't see data here or new JMH benchmarks which can show data. @rkennke can you show data you have? And, please, update RFE description with what you have in PR description. I did run macro benchmarks (SPECjvm, SPECjbb, renaissance, dacapo) and there performance is most often <1% from baseline, some better, some worse. However, I noticed that I made a mistake in my benchmark setup, and I have to re-run them again. So far it doesn't look like the results will be much different - only more reliable. Before I do proper re-runs, I first want to work on removing the interpreter lock-stack, and also to support 'weird' locking (see discussion above). I don't expect those to affect performance very much, because it will only change the interpreter paths. I haven't run any microbenchmarks, yet, but it may be useful. If you have any, please point me in the direction. > I would prefer to have "opt-in" but looking on scope of changes it may introduce more issues. Without "opt-in" I want performance comparison of VMs with different implementation instead of using `UseHeavyMonitors` to make judgement about this implementation. `UseHeavyMonitors` (product flag) should be tested separately to make sure when it is used as fallback mechanism by customers they would not get significant performance penalty. Yes, I can do that. > I agree with @tstuefe that we should test this PR a lot (all tiers on all supported platforms) including performance testing before integration. In addition we need full testing of this implementation with `UseHeavyMonitors` ON. Ok. I'd also suggest to run relevant (i.e. what relates to synchronized) jcstress tests. > And I should repeat that integration happens when changes are ready (no issues). We should not rush for particular JDK release. Sure, I am not planning on rushing this. ;-) > I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. > > That explains why I have not heard anything about popframe and friends :) I'm now wondering if what I kinda accidentally did there is not the sane thing to do. The 'real' lock-stack (the one that I added) holds all the (fast-)locked oops. The frame basic lock area also holds oops now (before it was oop-lock pairs), and in addition to the per-thread lock-stack it also holds the association frame->locks, which is useful when popping interpreter frames, so that we can exit all active locks easily. C1 and C2 don't need this, because 1. the monitor enter and exit there is always symmetric and 2. they have their own and more efficient ways to remove activations. How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:20 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:20 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Fri, 9 Sep 2022 19:01:14 GMT, Roman Kennke wrote: > How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. private static final void monitorEnter(Object o) { .... long monitorFrameId = getCallerFrameId(); ``` When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg); ` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:21 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:21 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Mon, 12 Sep 2022 06:37:19 GMT, Robbin Ehn wrote: > > How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? > > At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. > > ``` > private static final void monitorEnter(Object o) { > .... > long monitorFrameId = getCallerFrameId(); > ``` > > When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) Hmm ok. I was thinking something similar, but instead of storing pairs (oop/frame-id), push frame-markers on the lock-stack. But given that we only need all this for the interpreter, I am wondering if keeping what we have now (e.g. the per-frame-lock-stack in interpreter frame) is the saner thing to do. The overhead seems very small, perhaps very similar to keeping track of frames in the per-thread lock-stack. > But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. > > So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. > Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. Nice! >From your snippets above I am gleaning that your implementation has the actual lock-stack in Java. Is that correct? Is there a particular reason why you need this? Is this for Loom? Would the implementation that I am proposing here also work for your use-case(s)? Thanks, Roman ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:22 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:22 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke wrote: > Nice! From your snippets above I am gleaning that your implementation has the actual lock-stack in Java. Is that correct? Is there a particular reason why you need this? Is this for Loom? Would the implementation that I am proposing here also work for your use-case(s)? > Yes, the entire implementation is in Java. void push(Object lockee, long fid) { if (this != Thread.currentThread()) Monitor.abort("invariant"); if (lockStackPos == lockStack.length) { grow(); } frameId[lockStackPos] = fid; lockStack[lockStackPos++] = lockee; } We are starting from the point of let's do everything be in Java. I want smart people to being able to change the implementation. So I really don't like the hardcoded assembly in remove_activation which do this check on frame id on the lock stack. If we can make the changes to e.g. popframe and take a bit different approach to JVMS we may have a total flexible Java implementation. But a flexible Java implementation means compiler can't have intrinsics, so what will the performance be.... We have more loose-ends than we can handle at the moment. Your code may be useable for JOM if we lock the implementation to using a lock-stack and we are going to write intrinsics to it. There is no point of it being in Java if so IMHO. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 08:13:09 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 08:13:09 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com>

Message-ID: On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke wrote: >>> How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. >> The frame id is previous fp, grabbed from "link()" when entering the locking code. >> >> private static final void monitorEnter(Object o) { >> .... >> long monitorFrameId = getCallerFrameId(); >> ``` >> When popping we can thus check if there is still monitors/locks for the frame to be popped. >> Remove activation reads the lock stack, with a bunch of assembly, e.g.: >> ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg); >> ` >> If we would keep this, loom freezing would need to relativize and derelativize these values. >> (we only have interpreter) >> >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. >> This code that unlocks all locks in the frame seems to have been added for JLS 17.1. >> I have asked for clarification and we only need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. >> >> Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. > >> > How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. >> >> ``` >> private static final void monitorEnter(Object o) { >> .... >> long monitorFrameId = getCallerFrameId(); >> ``` >> >> When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) > > Hmm ok. I was thinking something similar, but instead of storing pairs (oop/frame-id), push frame-markers on the lock-stack. > > But given that we only need all this for the interpreter, I am wondering if keeping what we have now (e.g. the per-frame-lock-stack in interpreter frame) is the saner thing to do. The overhead seems very small, perhaps very similar to keeping track of frames in the per-thread lock-stack. > >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. > >> Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. > > Nice! > From your snippets above I am gleaning that your implementation has the actual lock-stack in Java. Is that correct? Is there a particular reason why you need this? Is this for Loom? Would the implementation that I am proposing here also work for your use-case(s)? > > Thanks, > Roman > @rkennke I will have a look, but may I suggest to open a new PR and just reference this as background discussion? I think most of the comments above is not relevant enough for a new reviewer to struggle through. What do you think? Ok, will do that. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 10:22:14 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:22:14 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Closing this PR in favour of a new, clean PR. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 10:22:14 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:22:14 GMT Subject: Withdrawn: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 10:30:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:30:19 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking Message-ID: This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. This change enables to simplify (and speed-up!) a lot of code: - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR ### Benchmarks All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. #### DaCapo/AArch64 Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? benchmark | baseline | fast-locking | % | size -- | -- | -- | -- | -- avrora | 27859 | 27563 | 1.07% | large batik | 20786 | 20847 | -0.29% | large biojava | 27421 | 27334 | 0.32% | default eclipse | 59918 | 60522 | -1.00% | large fop | 3670 | 3678 | -0.22% | default graphchi | 2088 | 2060 | 1.36% | default h2 | 297391 | 291292 | 2.09% | huge jme | 8762 | 8877 | -1.30% | default jython | 18938 | 18878 | 0.32% | default luindex | 1339 | 1325 | 1.06% | default lusearch | 918 | 936 | -1.92% | default pmd | 58291 | 58423 | -0.23% | large sunflow | 32617 | 24961 | 30.67% | large tomcat | 25481 | 25992 | -1.97% | large tradebeans | 314640 | 311706 | 0.94% | huge tradesoap | 107473 | 110246 | -2.52% | huge xalan | 6047 | 5882 | 2.81% | default zxing | 970 | 926 | 4.75% | default #### DaCapo/x86_64 The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. benchmark | baseline | fast-Locking | % | size -- | -- | -- | -- | -- avrora | 127690 | 126749 | 0.74% | large batik | 12736 | 12641 | 0.75% | large biojava | 15423 | 15404 | 0.12% | default eclipse | 41174 | 41498 | -0.78% | large fop | 2184 | 2172 | 0.55% | default graphchi | 1579 | 1560 | 1.22% | default h2 | 227614 | 230040 | -1.05% | huge jme | 8591 | 8398 | 2.30% | default jython | 13473 | 13356 | 0.88% | default luindex | 824 | 813 | 1.35% | default lusearch | 962 | 968 | -0.62% | default pmd | 40827 | 39654 | 2.96% | large sunflow | 53362 | 43475 | 22.74% | large tomcat | 27549 | 28029 | -1.71% | large tradebeans | 190757 | 190994 | -0.12% | huge tradesoap | 68099 | 67934 | 0.24% | huge xalan | 7969 | 8178 | -2.56% | default zxing | 1176 | 1148 | 2.44% | default #### Renaissance/AArch64 This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 2558.832 | 2513.594 | 1.80% Reactors | 14715.626 | 14311.246 | 2.83% Als | 1851.485 | 1869.622 | -0.97% ChiSquare | 1007.788 | 1003.165 | 0.46% GaussMix | 1157.491 | 1149.969 | 0.65% LogRegression | 717.772 | 733.576 | -2.15% MovieLens | 7916.181 | 8002.226 | -1.08% NaiveBayes | 395.296 | 386.611 | 2.25% PageRank | 4294.939 | 4346.333 | -1.18% FjKmeans | 519.2 | 498.357 | 4.18% FutureGenetic | 2578.504 | 2589.255 | -0.42% Mnemonics | 4898.886 | 4903.689 | -0.10% ParMnemonics | 4260.507 | 4210.121 | 1.20% Scrabble | 139.37 | 138.312 | 0.76% RxScrabble | 320.114 | 322.651 | -0.79% Dotty | 1056.543 | 1068.492 | -1.12% ScalaDoku | 3443.117 | 3449.477 | -0.18% Philosophers | 24333.311 | 23438.22 | 3.82% ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% FinagleChirper | 6814.192 | 6853.38 | -0.57% FinagleHttp | 4762.902 | 4807.564 | -0.93% #### Renaissance/x86_64 benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 1117.185 | 1116.425 | 0.07% Reactors | 11561.354 | 11812.499 | -2.13% Als | 1580.838 | 1575.318 | 0.35% ChiSquare | 459.601 | 467.109 | -1.61% GaussMix | 705.944 | 685.595 | 2.97% LogRegression | 659.944 | 656.428 | 0.54% MovieLens | 7434.303 | 7592.271 | -2.08% NaiveBayes | 413.482 | 417.369 | -0.93% PageRank | 3259.233 | 3276.589 | -0.53% FjKmeans | 946.429 | 938.991 | 0.79% FutureGenetic | 1760.672 | 1815.272 | -3.01% Scrabble | 147.996 | 150.084 | -1.39% RxScrabble | 177.755 | 177.956 | -0.11% Dotty | 673.754 | 683.919 | -1.49% ScalaDoku | 2193.562 | 1958.419 | 12.01% ScalaKmeans | 165.376 | 168.925 | -2.10% ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. ### Testing - [x] tier1 (x86_64, aarch64, x86_32) - [x] tier2 (x86_64, aarch64) - [x] tier3 (x86_64, aarch64) - [x] tier4 (x86_64, aarch64) ------------- Commit messages: - Merge tag 'jdk-20+17' into fast-locking - Fix OSR packing in AArch64, part 2 - Fix OSR packing in AArch64 - Merge remote-tracking branch 'upstream/master' into fast-locking - Fix register in interpreter unlock x86_32 - Support unstructured locking in interpreter (x86 parts) - Support unstructured locking in interpreter (aarch64 and shared parts) - Merge branch 'master' into fast-locking - Merge branch 'master' into fast-locking - Added test for hand-over-hand locking - ... and 17 more: https://git.openjdk.org/jdk/compare/79ccc791...3ed51053 Changes: https://git.openjdk.org/jdk/pull/10590/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291555 Stats: 3660 lines in 127 files changed: 650 ins; 2481 del; 529 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From duke at openjdk.org Thu Oct 6 13:08:32 2022 From: duke at openjdk.org (JervenBolleman) Date: Thu, 6 Oct 2022 13:08:32 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) For those following along the new PR is https://github.com/openjdk/jdk/pull/10590 ------------- PR: https://git.openjdk.org/jdk/pull/9680 From jsjolen at openjdk.org Fri Oct 7 11:35:09 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 7 Oct 2022 11:35:09 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream Message-ID: Hi, I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. ------------- Commit messages: - Remove unnecessary ResourceMarks Changes: https://git.openjdk.org/jdk/pull/10602/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10602&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294954 Stats: 59 lines in 41 files changed: 2 ins; 57 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10602.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10602/head:pull/10602 PR: https://git.openjdk.org/jdk/pull/10602 From dholmes at openjdk.org Fri Oct 7 13:21:21 2022 From: dholmes at openjdk.org (David Holmes) Date: Fri, 7 Oct 2022 13:21:21 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. How are you defining "unnecessary"? Are these unnecessary because there is zero resource allocation involved? Or "unnecessary" because a ResourceMark higher up the call stack covers it? ------------- PR: https://git.openjdk.org/jdk/pull/10602 From dholmes at openjdk.org Fri Oct 7 13:32:11 2022 From: dholmes at openjdk.org (David Holmes) Date: Fri, 7 Oct 2022 13:32:11 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. I see now the bug report suggests these RM were in place because the stream itself may have needed them but that this is no longer the case. So was that the only reason for all these RMs? ------------- PR: https://git.openjdk.org/jdk/pull/10602 From jsjolen at openjdk.org Fri Oct 7 13:41:08 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 7 Oct 2022 13:41:08 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References:

Message-ID: On Fri, 7 Oct 2022 13:28:58 GMT, David Holmes wrote: >I see now the bug report suggests these RM were in place because the stream itself may have needed them but that this is no longer the case. So was that the only reason for all these RMs? There are RMs that I've looked at but left intact because they did have other reasons for being there (typically: string allocating functions). So yes, `LogStream` should be the only reason for all these RMs. ------------- PR: https://git.openjdk.org/jdk/pull/10602 From jsjolen at openjdk.org Fri Oct 7 13:51:15 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 7 Oct 2022 13:51:15 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: <3RXwTxz1C1mjzFvf-yKczgP4lCERhQQsJdCej7iXrFE=.38a314e4-70b5-4356-8360-1fbbbf68230b@github.com> On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. This PR does remove the RM in `VM_Operation::evaluate`, and I haven't checked all of the VM operations to see if anyone uses it. ------------- PR: https://git.openjdk.org/jdk/pull/10602 From duke at openjdk.org Sun Oct 9 06:45:10 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Sun, 9 Oct 2022 06:45:10 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check [v2] In-Reply-To: References: Message-ID: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 Tongbao Zhang has updated the pull request incrementally with one additional commit since the last revision: update rank of _alloc_failure_waiters_lock ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10264/files - new: https://git.openjdk.org/jdk/pull/10264/files/23f44fbd..87675608 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10264&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10264&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10264.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10264/head:pull/10264 PR: https://git.openjdk.org/jdk/pull/10264 From duke at openjdk.org Sun Oct 9 06:45:11 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Sun, 9 Oct 2022 06:45:11 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: <7erfXFkhlNdrcP0Pfuw_BzaY0T7g1GqD5dIBDoAMfTE=.2798b236-d61a-484d-a8dc-d2b8f311cb0c@github.com> On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 > Thanks for reminding! updated the rank of `_alloc_failure_waiters_lock ` ------------- PR: https://git.openjdk.org/jdk/pull/10264 From shade at openjdk.org Mon Oct 10 12:56:51 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 10 Oct 2022 12:56:51 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check [v2] In-Reply-To: References:

Message-ID: On Sun, 9 Oct 2022 06:45:10 GMT, Tongbao Zhang wrote: >> After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock >> >> Symptom >> >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 >> # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock >> # >> # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) >> # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) >> # Problematic frame: >> # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 > > Tongbao Zhang has updated the pull request incrementally with one additional commit since the last revision: > > update rank of _alloc_failure_waiters_lock This looks good, thank you! (I tested `hotspot:tier1` with Shenandoah, and it now passes cleanly) ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10264 From duke at openjdk.org Tue Oct 11 09:57:36 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Tue, 11 Oct 2022 09:57:36 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check [v2] In-Reply-To: References:

Message-ID: On Mon, 10 Oct 2022 12:52:55 GMT, Aleksey Shipilev wrote: >> Tongbao Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> update rank of _alloc_failure_waiters_lock > > This looks good, thank you! (I tested `hotspot:tier1` with Shenandoah, and it now passes cleanly) Thanks for reviews! @shipilev @TheRealMDoerr ------------- PR: https://git.openjdk.org/jdk/pull/10264 From duke at openjdk.org Tue Oct 11 10:07:57 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Tue, 11 Oct 2022 10:07:57 GMT Subject: Integrated: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 This pull request has now been integrated. Changeset: 6053bf0f Author: Tongbao Zhang Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/6053bf0f6a754bf3943ba6169316513055a5a3b2 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8293782: Shenandoah: some tests failed on lock rank check Reviewed-by: mdoerr, shade ------------- PR: https://git.openjdk.org/jdk/pull/10264 From rkennke at openjdk.org Tue Oct 11 12:16:23 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 12:16:23 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 11:10:29 GMT, Nick Gasson wrote: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. Hi Nick, Thank you, that is a useful change! I verified performance and it does improve both throughput and latency on several machines (not as much as for you - but I also have not thrown so many CPUs at it.. ) I do have a few suggestions. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 40: > 38: > 39: volatile jint *ShenandoahEvacOOMHandler::threads_in_evac_ptr(Thread* t) { > 40: uint64_t key = (uintptr_t)t; Maybe put that in a separate hash(Thread*) function? Also, is that a particular documented hash-function?(Related: In Lilliput project, I am working on a different identity-hash-code implementation, and part of it will be a hash-implementation to hash arbitrary pointers to 32 or 64 bit hash, currently using murmur3. Maybe this could be reused for here, when it happens?) src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 55: > 53: // *and* the counter is zero. > 54: while (Atomic::load_acquire(ptr) != OOM_MARKER_MASK) { > 55: os::naked_short_sleep(1); Not sure if SpinPause() may be better here? @shipilev probably knows more. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 88: > 86: static const jint OOM_MARKER_MASK; > 87: > 88: static constexpr jint EVAC_COUNTER_BUCKETS = 64; Maybe it'd be useful to not hardwire this? It could be a runtime option, possibly diagnostic (not sure). Many workloads would not even use so many threads... src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 92: > 90: shenandoah_padding(0); > 91: struct { > 92: volatile jint bits; The bits field needs a comment saying that it combines a counter with an OOM bit. In-fact, it would probably benefit from a little bit of refactoring, make it a class, and move accessors and relevant methods into it, and avoid public access to the field? ------------- Changes requested by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Tue Oct 11 12:36:28 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 11 Oct 2022 12:36:28 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 11:10:29 GMT, Nick Gasson wrote: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > Thank you, that is a useful change! I verified performance and it does improve both throughput and latency on several machines (not as much as for you - but I also have not thrown so many CPUs at it.. ) Thanks for testing! The improvement is quite dependent on the machine you're using (the 160-core one is probably an outlier ;-), and there's a marked difference between NUMA and non-NUMA systems. ------------- PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Tue Oct 11 12:36:31 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 11 Oct 2022 12:36:31 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References:

Message-ID: <7mpUhXJtmnGLJ1qqMtbAYNnGPIdTaYVjQjEIAhecNds=.7d5d02c4-6506-440f-969e-4a26e5f057ca@github.com> On Tue, 11 Oct 2022 12:02:43 GMT, Roman Kennke wrote: >> The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. >> >> See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html >> >> Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 40: > >> 38: >> 39: volatile jint *ShenandoahEvacOOMHandler::threads_in_evac_ptr(Thread* t) { >> 40: uint64_t key = (uintptr_t)t; > > Maybe put that in a separate hash(Thread*) function? Also, is that a particular documented hash-function?(Related: In Lilliput project, I am working on a different identity-hash-code implementation, and part of it will be a hash-implementation to hash arbitrary pointers to 32 or 64 bit hash, currently using murmur3. Maybe this could be reused for here, when it happens?) It is actually the bit mixing function from MurmurHash3. The particular algorithm doesn't matter too much though - I just couldn't find an existing one in the shared code. > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 55: > >> 53: // *and* the counter is zero. >> 54: while (Atomic::load_acquire(ptr) != OOM_MARKER_MASK) { >> 55: os::naked_short_sleep(1); > > Not sure if SpinPause() may be better here? @shipilev probably knows more. I think we'd probably want some back-off here rather than spinning indefinitely? E.g. spin N times and then start sleeping. > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 88: > >> 86: static const jint OOM_MARKER_MASK; >> 87: >> 88: static constexpr jint EVAC_COUNTER_BUCKETS = 64; > > Maybe it'd be useful to not hardwire this? It could be a runtime option, possibly diagnostic (not sure). Many workloads would not even use so many threads... If we're going to make it dynamic maybe it should be set to the number of physical CPUs? ------------- PR: https://git.openjdk.org/jdk/pull/10573 From shade at openjdk.org Tue Oct 11 18:14:17 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 18:14:17 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 10:23:04 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) I have a few questions after porting this to RISC-V... src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 272: > 270: // SharedRuntime::OSR_migration_begin() packs BasicObjectLocks in > 271: // the OSR buffer using 2 word entries: first the lock and then > 272: // the oop. This comment is now irrelevant? src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 432: > 430: if (method()->is_synchronized()) { > 431: monitor_address(0, FrameMap::r0_opr); > 432: __ ldr(r4, Address(r0, BasicObjectLock::obj_offset_in_bytes())); Do we have to use a new register here, or can we just reuse `r0`? src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1886: > 1884: > 1885: __ mov(c_rarg0, obj_reg); > 1886: __ mov(c_rarg1, rthread); Now that you dropped an argument here, you need to do `__ call_VM_leaf` with `2`, not with `3` arguments? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Tue Oct 11 19:49:32 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 19:49:32 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v2] In-Reply-To: References: Message-ID: <4G3892Q41Qwlt15Y1dmLWkNUmyIEusWVJH2fdb3K0eM=.5ff1859b-baa1-4d60-866b-8e9747a79180@github.com> > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Fix number of rt args to complete_monitor_locking_C, remove some comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/3ed51053..34bed54f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=00-01 Stats: 13 lines in 6 files changed: 0 ins; 11 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Tue Oct 11 20:01:32 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 20:01:32 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - Re-use r0 in call to unlock_object() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/34bed54f..4ccdab8f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=01-02 Stats: 7 lines in 3 files changed: 0 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Tue Oct 11 20:01:33 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 20:01:33 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References:

Message-ID: On Tue, 11 Oct 2022 13:25:30 GMT, Aleksey Shipilev wrote: >> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: >> >> - Merge remote-tracking branch 'origin/fast-locking' into fast-locking >> - Re-use r0 in call to unlock_object() > > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 272: > >> 270: // SharedRuntime::OSR_migration_begin() packs BasicObjectLocks in >> 271: // the OSR buffer using 2 word entries: first the lock and then >> 272: // the oop. > > This comment is now irrelevant? Yes, removed it there and in same files in other arches. > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 432: > >> 430: if (method()->is_synchronized()) { >> 431: monitor_address(0, FrameMap::r0_opr); >> 432: __ ldr(r4, Address(r0, BasicObjectLock::obj_offset_in_bytes())); > > Do we have to use a new register here, or can we just reuse `r0`? r0 is used below in call to unlock_object(), but not actually used there. I shuffled it a little and re-use r0 now. > src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1886: > >> 1884: >> 1885: __ mov(c_rarg0, obj_reg); >> 1886: __ mov(c_rarg1, rthread); > > Now that you dropped an argument here, you need to do `__ call_VM_leaf` with `2`, not with `3` arguments? Good catch! Yes. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Tue Oct 11 20:44:06 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 11 Oct 2022 20:44:06 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References:

Message-ID: On Tue, 11 Oct 2022 20:01:32 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) > > Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: > > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Re-use r0 in call to unlock_object() Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 07:33:48 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 07:33:48 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: RISC-V port ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/4ccdab8f..d9153be5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=02-03 Stats: 368 lines in 11 files changed: 89 ins; 211 del; 68 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Thu Oct 13 08:50:27 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 13 Oct 2022 08:50:27 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References:

Message-ID: On Thu, 13 Oct 2022 07:33:48 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISC-V port On aarch64 (linux and mac) I see these variations of crashes in random tests: # Internal Error .... src/hotspot/share/c1/c1_Runtime1.cpp:768), pid=2884803, tid=2884996 # assert(oopDesc::is_oop(oop(obj))) failed: must be NULL or an object: 0x000000000000dead # V [libjvm.so+0x7851d4] Runtime1::monitorexit(JavaThread*, oopDesc*)+0x110 # SIGSEGV (0xb) at pc=0x0000fffc9d4e3de8, pid=1842880, tid=1842994 # V [libjvm.so+0xbf3de8] SharedRuntime::monitor_exit_helper(oopDesc*, JavaThread*)+0x24 # SIGSEGV (0xb) at pc=0x0000fffca9f00394, pid=959883, tid=959927 # V [libjvm.so+0xc90394] ObjectSynchronizer::exit(oopDesc*, JavaThread*)+0x54 ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 10:35:16 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 10:35:16 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v5] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - Revert "Re-use r0 in call to unlock_object()" This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/d9153be5..8d146b99 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=03-04 Stats: 7 lines in 3 files changed: 1 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 10:36:34 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 10:36:34 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References:

Message-ID: On Thu, 13 Oct 2022 08:46:45 GMT, Robbin Ehn wrote: > On aarch64 (linux and mac) I see these variations of crashes in random tests: (asserts in debug, crash in release it looks like) > > ``` > # Internal Error .... src/hotspot/share/c1/c1_Runtime1.cpp:768), pid=2884803, tid=2884996 > # assert(oopDesc::is_oop(oop(obj))) failed: must be NULL or an object: 0x000000000000dead > # V [libjvm.so+0x7851d4] Runtime1::monitorexit(JavaThread*, oopDesc*)+0x110 > ``` > > ``` > # SIGSEGV (0xb) at pc=0x0000fffc9d4e3de8, pid=1842880, tid=1842994 > # V [libjvm.so+0xbf3de8] SharedRuntime::monitor_exit_helper(oopDesc*, JavaThread*)+0x24 > ``` > > ``` > # SIGSEGV (0xb) at pc=0x0000fffca9f00394, pid=959883, tid=959927 > # V [libjvm.so+0xc90394] ObjectSynchronizer::exit(oopDesc*, JavaThread*)+0x54 > ``` Ugh. That is most likely caused by the recent change: https://github.com/rkennke/jdk/commit/ebbcb615a788998596f403b47b72cf133cb9de46 It used to be very stable before that. I have backed out that change, can you try again? Thanks, Roman ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 10:42:03 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 10:42:03 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References:

Message-ID: On Tue, 11 Oct 2022 20:41:57 GMT, Robbin Ehn wrote: > Regarding benchmarks, is it possible to get some indication what fast-locking+lillput result will be? FinagleHttp seems to suffer a bit, will Lillput give some/all of that back, or more? That particular benchmark, as some others, exhibit relatively high run-to-run variance. I have run it again many more times to average-out the variance, and I'm now getting the following results: baseline: 3503.844 ms/ops, fast-locking: 3546.344 ms/ops, percent: -1.20% That is still a slight regression, but with more confidence. Regarding Lilliput, I cannot really say at the moment. Some workloads are actually regressing with Lilliput, presumably because they are sensitive on the performance of loading the Klass* out of objects, and that is currently more complex in Lilliput (because it needs to coordinate with monitor locking). FinagleHttp seems to be one of those workloads. I am working to get rid of this limitation, and then I can be more specific. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Fri Oct 14 01:22:58 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 14 Oct 2022 01:22:58 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References:

Message-ID: On Wed, 12 Oct 2022 11:26:16 GMT, Aleksey Shipilev wrote: > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Fri Oct 14 06:45:08 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 14 Oct 2022 06:45:08 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References:

Message-ID: On Thu, 13 Oct 2022 10:34:04 GMT, Roman Kennke wrote: > It used to be very stable before that. I have backed out that change, can you try again? Seems fine now, thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Fri Oct 14 13:47:13 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 14 Oct 2022 13:47:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References:

Message-ID: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> On Fri, 14 Oct 2022 01:19:27 GMT, Fei Yang wrote: > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 14 14:30:00 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 14 Oct 2022 14:30:00 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> References:

<05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> Message-ID: On Fri, 14 Oct 2022 13:45:07 GMT, Fei Yang wrote: > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Fri Oct 14 14:35:07 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 14 Oct 2022 14:35:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References:

<05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> Message-ID: <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> On Fri, 14 Oct 2022 14:26:20 GMT, Roman Kennke wrote: > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 14 14:41:07 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 14 Oct 2022 14:41:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> References:

<05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> Message-ID: On Fri, 14 Oct 2022 14:39:01 GMT, Roman Kennke wrote: > > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > > > > > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > > > > > > > > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? > > > > > > > > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. > > > > > > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff > > The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. @rkennke : I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff But I didn't check whether that will cause a problem here. patching file src/hotspot/cpu/riscv/c1_CodeStubs_riscv.cpp patching file src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp patching file src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp patching file src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.cpp Hunk #1 succeeded at 58 (offset -1 lines). Hunk #2 succeeded at 67 (offset -1 lines). patching file src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp patching file src/hotspot/cpu/riscv/interp_masm_riscv.cpp patching file src/hotspot/cpu/riscv/macroAssembler_riscv.cpp Hunk #1 succeeded at 2499 (offset 324 lines). Hunk #2 succeeded at 4474 (offset 330 lines). patching file src/hotspot/cpu/riscv/macroAssembler_riscv.hpp Hunk #1 succeeded at 869 with fuzz 2 (offset 313 lines). Hunk #2 succeeded at 1252 (offset 325 lines). patching file src/hotspot/cpu/riscv/riscv.ad Hunk #1 succeeded at 2385 (offset 7 lines). Hunk #2 succeeded at 2407 (offset 7 lines). Hunk #3 succeeded at 2433 (offset 7 lines). Hunk #4 succeeded at 10403 (offset 33 lines). Hunk #5 succeeded at 10417 (offset 33 lines). patching file src/hotspot/cpu/riscv/sharedRuntime_riscv.cpp Hunk #1 succeeded at 975 (offset 21 lines). Hunk #2 succeeded at 1030 (offset 21 lines). Hunk #3 succeeded at 1042 (offset 21 lines). Hunk #4 succeeded at 1058 (offset 21 lines). Hunk #5 succeeded at 1316 (offset 24 lines). Hunk #6 succeeded at 1416 (offset 24 lines). Hunk #7 succeeded at 1492 (offset 24 lines). Hunk #8 succeeded at 1517 (offset 24 lines). Hunk #9 succeeded at 1621 (offset 24 lines). ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 14 15:42:01 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 14 Oct 2022 15:42:01 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References:

<05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com>

Message-ID: <2abWu-ITUoN-hNBTy6f0qQN-Q5XuAF3XXbTe7Kz63iU=.350a2155-f2ef-4909-98d8-350306413f74@github.com> On Fri, 14 Oct 2022 14:53:57 GMT, Fei Yang wrote: > > > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > > > > > > > > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > > > > > > > > > > > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? > > > > > > > > > > > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. > > > > > > > > > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff > > > > > > The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. > > @rkennke : I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff But I didn't check whether that will cause a problem here. If you take the latest code from this PR, it would already have the patch applied. No need to patch it again. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Mon Oct 17 04:33:08 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 17 Oct 2022 04:33:08 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: <2abWu-ITUoN-hNBTy6f0qQN-Q5XuAF3XXbTe7Kz63iU=.350a2155-f2ef-4909-98d8-350306413f74@github.com> References:

<05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com>

<2abWu-ITUoN-hNBTy6f0qQN-Q5XuAF3XXbTe7Kz63iU=.350a2155-f2ef-4909-98d8-350306413f74@github.com> Message-ID: On Fri, 14 Oct 2022 15:39:41 GMT, Roman Kennke wrote: >>> > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch >>> > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? >>> > > > > >>> > > > > >>> > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) >>> > > > >>> > > > >>> > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? >>> > > >>> > > >>> > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. >>> > >>> > >>> > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff >>> >>> The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. >> >> @rkennke : >> I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff >> But I didn't check whether that will cause a problem here. >> >> >> patching file src/hotspot/cpu/riscv/c1_CodeStubs_riscv.cpp >> patching file src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp >> patching file src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp >> patching file src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.cpp >> Hunk #1 succeeded at 58 (offset -1 lines). >> Hunk #2 succeeded at 67 (offset -1 lines). >> patching file src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp >> patching file src/hotspot/cpu/riscv/interp_masm_riscv.cpp >> patching file src/hotspot/cpu/riscv/macroAssembler_riscv.cpp >> Hunk #1 succeeded at 2499 (offset 324 lines). >> Hunk #2 succeeded at 4474 (offset 330 lines). >> patching file src/hotspot/cpu/riscv/macroAssembler_riscv.hpp >> Hunk #1 succeeded at 869 with fuzz 2 (offset 313 lines). >> Hunk #2 succeeded at 1252 (offset 325 lines). >> patching file src/hotspot/cpu/riscv/riscv.ad >> Hunk #1 succeeded at 2385 (offset 7 lines). >> Hunk #2 succeeded at 2407 (offset 7 lines). >> Hunk #3 succeeded at 2433 (offset 7 lines). >> Hunk #4 succeeded at 10403 (offset 33 lines). >> Hunk #5 succeeded at 10417 (offset 33 lines). >> patching file src/hotspot/cpu/riscv/sharedRuntime_riscv.cpp >> Hunk #1 succeeded at 975 (offset 21 lines). >> Hunk #2 succeeded at 1030 (offset 21 lines). >> Hunk #3 succeeded at 1042 (offset 21 lines). >> Hunk #4 succeeded at 1058 (offset 21 lines). >> Hunk #5 succeeded at 1316 (offset 24 lines). >> Hunk #6 succeeded at 1416 (offset 24 lines). >> Hunk #7 succeeded at 1492 (offset 24 lines). >> Hunk #8 succeeded at 1517 (offset 24 lines). >> Hunk #9 succeeded at 1621 (offset 24 lines). > >> > > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch >> > > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? >> > > > > > >> > > > > > >> > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) >> > > > > >> > > > > >> > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? >> > > > >> > > > >> > > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. >> > > >> > > >> > > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff >> > >> > >> > The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. >> >> @rkennke : I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff But I didn't check whether that will cause a problem here. > > If you take the latest code from this PR, it would already have the patch applied. No need to patch it again. @rkennke : Could you please add this follow-up fix for RISC-V? I can build fastdebug on HiFive Unmatched board with this fix now and run non-trivial benchmark workloads. I will carry out more tests. [riscv-patch-2.txt](https://github.com/openjdk/jdk/files/9796886/riscv-patch-2.txt) ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Mon Oct 17 10:13:13 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 17 Oct 2022 10:13:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v6] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: More RISC-V fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/8d146b99..57403ad1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=04-05 Stats: 37 lines in 5 files changed: 0 ins; 8 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From shade at openjdk.org Mon Oct 17 18:27:44 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 17 Oct 2022 18:27:44 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v2] In-Reply-To: References: Message-ID: > There are number of places where misleading-indentation is reported by GCC. Currently, the warning is disabled for the entirety of Hotspot, which is not good. > > C1 does an unusual style here. Changing it globally would touch a lot of lines. Instead of doing that, I fit the existing style while also resolving the warnings. Note this actually solves a bug in `lir_alloc_array`, where `do_temp` are called without a check. > > Build-tested this with product of: > - GCC 10 > - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} > - {server, zero} > - {release, fastdebug} > > Linux x86_64 fastdebug `tier1` is fine. Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge branch 'master' into JDK-8294438-misleading-indentation - Merge branch 'master' into JDK-8294438-misleading-indentation - Also javaClasses.cpp - Fix ------------- Changes: https://git.openjdk.org/jdk/pull/10444/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10444&range=01 Stats: 56 lines in 5 files changed: 7 ins; 20 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/10444.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10444/head:pull/10444 PR: https://git.openjdk.org/jdk/pull/10444 From stefank at openjdk.org Tue Oct 18 13:04:57 2022 From: stefank at openjdk.org (Stefan Karlsson) Date: Tue, 18 Oct 2022 13:04:57 GMT Subject: RFR: 8295475: Move non-resource allocation strategies out of ResourceObj Message-ID: <4RakidFUe7jYYkY_1XkaBRuwJCxPd90CO1trC7QNzno=.18335453-ebc7-42b3-8973-d2ffefc47b53@github.com> Background to this patch: This prototype/patch has been discussed with a few HotSpot devs, and I've gotten feedback that I should send it out for broader discussion/review. It could be a first step to make it easier to talk about our allocation super classes and strategies. This in turn would make it easier to have further discussions around how to make our allocation strategies more flexible. E.g. do we really need to tie down utility classes to a specific allocation strategy? Do we really have to provide MEMFLAGS as compile time flags? Etc. PR RFC: HotSpot has a few allocation classes that other classes can inherit from to get different dynamic-allocation strategies: MetaspaceObj - allocates in the Metaspace CHeap - uses malloc ResourceObj - ... The last class sounds like it provide an allocation strategy to allocate inside a thread's resource area. This is true, but it also provides functions to allow the instances to be allocated in Areanas or even CHeap allocated memory. This is IMHO misleading, and often leads to confusion among HotSpot developers. I propose that we simplify ResourceObj to only provide an allocation strategy for resource allocations, and move the multi-allocation strategy feature to another class, which isn't named ResourceObj. In my proposal and prototype I've used the name AnyObj, as short, simple name. I'm open to changing the name to something else. The patch also adds a new class named ArenaObj, which is for objects only allocated in provided arenas. The patch also removes the need to provide ResourceObj/AnyObj::C_HEAP to `operator new`. If you pass in a MEMFLAGS argument it now means that you want to allocate on the CHeap. ------------- Commit messages: - Remove AnyObj new operator taking an allocation_type - Use more specific allocation types Changes: https://git.openjdk.org/jdk/pull/10745/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10745&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295475 Stats: 458 lines in 152 files changed: 67 ins; 37 del; 354 mod Patch: https://git.openjdk.org/jdk/pull/10745.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10745/head:pull/10745 PR: https://git.openjdk.org/jdk/pull/10745 From stefank at openjdk.org Tue Oct 18 13:42:40 2022 From: stefank at openjdk.org (Stefan Karlsson) Date: Tue, 18 Oct 2022 13:42:40 GMT Subject: RFR: 8295475: Move non-resource allocation strategies out of ResourceObj [v2] In-Reply-To: <4RakidFUe7jYYkY_1XkaBRuwJCxPd90CO1trC7QNzno=.18335453-ebc7-42b3-8973-d2ffefc47b53@github.com> References: <4RakidFUe7jYYkY_1XkaBRuwJCxPd90CO1trC7QNzno=.18335453-ebc7-42b3-8973-d2ffefc47b53@github.com> Message-ID: > Background to this patch: > > This prototype/patch has been discussed with a few HotSpot devs, and I've gotten feedback that I should send it out for broader discussion/review. It could be a first step to make it easier to talk about our allocation super classes and strategies. This in turn would make it easier to have further discussions around how to make our allocation strategies more flexible. E.g. do we really need to tie down utility classes to a specific allocation strategy? Do we really have to provide MEMFLAGS as compile time flags? Etc. > > PR RFC: > > HotSpot has a few allocation classes that other classes can inherit from to get different dynamic-allocation strategies: > > MetaspaceObj - allocates in the Metaspace > CHeap - uses malloc > ResourceObj - ... > > The last class sounds like it provide an allocation strategy to allocate inside a thread's resource area. This is true, but it also provides functions to allow the instances to be allocated in Areanas or even CHeap allocated memory. > > This is IMHO misleading, and often leads to confusion among HotSpot developers. > > I propose that we simplify ResourceObj to only provide an allocation strategy for resource allocations, and move the multi-allocation strategy feature to another class, which isn't named ResourceObj. > > In my proposal and prototype I've used the name AnyObj, as short, simple name. I'm open to changing the name to something else. > > The patch also adds a new class named ArenaObj, which is for objects only allocated in provided arenas. > > The patch also removes the need to provide ResourceObj/AnyObj::C_HEAP to `operator new`. If you pass in a MEMFLAGS argument it now means that you want to allocate on the CHeap. Stefan Karlsson has updated the pull request incrementally with one additional commit since the last revision: Fix Shenandoah ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10745/files - new: https://git.openjdk.org/jdk/pull/10745/files/bafa0229..4e8ac797 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10745&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10745&range=00-01 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10745.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10745/head:pull/10745 PR: https://git.openjdk.org/jdk/pull/10745 From ngasson at openjdk.org Wed Oct 19 17:05:26 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 19 Oct 2022 17:05:26 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v2] In-Reply-To: References: Message-ID: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: Refactor ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10573/files - new: https://git.openjdk.org/jdk/pull/10573/files/2303fbed..14cec5ed Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=00-01 Stats: 184 lines in 3 files changed: 109 ins; 45 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/10573.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10573/head:pull/10573 PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Wed Oct 19 17:05:26 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 19 Oct 2022 17:05:26 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v2] In-Reply-To: References:

Message-ID: <8AVo6UxTFFl8JY9i2Oy9XrWm50OUYEhFYOuv1-7mslA=.6995e072-5dca-4366-a7b4-a80245b4ff98@github.com> On Tue, 11 Oct 2022 11:57:55 GMT, Roman Kennke wrote: >> Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: >> >> Refactor > > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 92: > >> 90: shenandoah_padding(0); >> 91: struct { >> 92: volatile jint bits; > > The bits field needs a comment saying that it combines a counter with an OOM bit. In-fact, it would probably benefit from a little bit of refactoring, make it a class, and move accessors and relevant methods into it, and avoid public access to the field? I'm not sure if it's exactly what you intended, but I had a go at refactoring it in the last commit. ------------- PR: https://git.openjdk.org/jdk/pull/10573 From shade at openjdk.org Wed Oct 19 19:11:42 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Oct 2022 19:11:42 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v3] In-Reply-To: References: Message-ID: > There are number of places where misleading-indentation is reported by GCC. Currently, the warning is disabled for the entirety of Hotspot, which is not good. > > C1 does an unusual style here. Changing it globally would touch a lot of lines. Instead of doing that, I fit the existing style while also resolving the warnings. Note this actually solves a bug in `lir_alloc_array`, where `do_temp` are called without a check. > > Build-tested this with product of: > - GCC 10 > - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} > - {server, zero} > - {release, fastdebug} > > Linux x86_64 fastdebug `tier1` is fine. Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'master' into JDK-8294438-misleading-indentation - Merge branch 'master' into JDK-8294438-misleading-indentation - Merge branch 'master' into JDK-8294438-misleading-indentation - Also javaClasses.cpp - Fix ------------- Changes: https://git.openjdk.org/jdk/pull/10444/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10444&range=02 Stats: 56 lines in 5 files changed: 7 ins; 20 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/10444.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10444/head:pull/10444 PR: https://git.openjdk.org/jdk/pull/10444 From shade at openjdk.org Thu Oct 20 07:16:55 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 07:16:55 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v3] In-Reply-To: References: